Statistical Signal Processing of Complex-Valued Data.pdf

April 5, 2018 | Author: mehdicheraghi506 | Category: Polarization (Waves), Modulation, Complex Number, Correlation And Dependence, Covariance Matrix

Share Embed Donate

Report this link

Short Description

PETER J. SCHREIER LOUIS L. SCHARF...

Description

This page intentionally left blank

Statistical Signal Processing of Complex-Valued Data Complex-valued random signals are embedded into the very fabric of science and engineering, yet the usual assumptions made about their statistical behavior are often a poor representation of the underlying physics. This book deals with improper and noncircular complex signals, which do not conform to classical assumptions, and it demonstrates how correct treatment of these signals can have significant payoffs. The book begins with detailed coverage of the fundamental theory and presents a variety of tools and algorithms for dealing with improper and noncircular signals. It provides a comprehensive account of the main applications, covering detection, estimation, and signal analysis of stationary, nonstationary, and cyclostationary processes. Providing a systematic development from the origin of complex signals to their probabilistic description makes the theory accessible to newcomers. This book is ideal for graduate students and researchers working with complex data in a range of research areas from communications to oceanography. PETER J. SCHREIER is an Associate Professor in the School of Electrical Engineering and Computer Science, The University of Newcastle, Australia. He received his Ph.D. in electrical engineering from the University of Colorado at Boulder in 2003. He currently serves on the Editorial Board of the IEEE Transactions on Signal Processing, and on the IEEE Technical Committee Machine Learning for Signal Processing. LOUIS L. SCHARF is Professor of Electrical and Computer Engineering and Statistics at Colorado State University. He received his Ph.D. from the University of Washington at Seattle. He has since received numerous awards for his research contributions to statistical signal processing, including an IEEE Distinguished Lectureship, an IEEE Third Millennium Medal, and the Technical Achievement and Society Awards from the IEEE Signal Processing Society. He is a Life Fellow of the IEEE.

Statistical Signal Processing of Complex-Valued Data The Theory of Improper and Noncircular Signals PETER J. SCHREIER University of Newcastle, New South Wales, Australia

LOUIS L. SCHARF Colorado State University, Colorado, USA

CAMBRIDGE UNIVERSITY PRESS

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Dubai, Tokyo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521897723 © Cambridge University Press 2010 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2010 ISBN-13

978-0-511-67772-4

eBook (NetLibrary)

ISBN-13

978-0-521-89772-3

Hardback

Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Contents

Preface Notation

page xiii xvii

Part I Introduction

1

1

3

The origins and uses of complex signals 1.1 1.2 1.3 1.4

1.5

1.6

1.7 1.8 1.9 2

Cartesian, polar, and complex representations of two-dimensional signals Simple harmonic oscillator and phasors Lissajous figures, ellipses, and electromagnetic polarization Complex modulation, the Hilbert transform, and complex analytic signals 1.4.1 Complex modulation using the complex envelope 1.4.2 The Hilbert transform, phase splitter, and analytic signal 1.4.3 Complex demodulation 1.4.4 Bedrosian’s theorem: the Hilbert transform of a product 1.4.5 Instantaneous amplitude, frequency, and phase 1.4.6 Hilbert transform and SSB modulation 1.4.7 Passband filtering at baseband Complex signals for the efficient use of the FFT 1.5.1 Complex DFT 1.5.2 Twofer: two real DFTs from one complex DFT 1.5.3 Twofer: one real 2N -DFT from one complex N -DFT The bivariate Gaussian distribution and its complex representation 1.6.1 Bivariate Gaussian distribution 1.6.2 Complex representation of the bivariate Gaussian distribution 1.6.3 Polar coordinates and marginal pdfs Second-order analysis of the polarization ellipse Mathematical framework A brief survey of applications

4 5 6 8 9 11 13 14 14 15 15 17 18 18 19 19 20 21 23 23 25 27

Introduction to complex random vectors and processes

30

2.1 Connection between real and complex descriptions 2.1.1 Widely linear transformations 2.1.2 Inner products and quadratic forms

31 31 33

vi

Contents

2.2 Second-order statistical properties 2.2.1 Extending definitions from the real to the complex domain 2.2.2 Characterization of augmented covariance matrices 2.2.3 Power and entropy 2.3 Probability distributions and densities 2.3.1 Complex Gaussian distribution 2.3.2 Conditional complex Gaussian distribution 2.3.3 Scalar complex Gaussian distribution 2.3.4 Complex elliptical distribution 2.4 Sufficient statistics and ML estimators for covariances: complex Wishart distribution 2.5 Characteristic function and higher-order statistical description 2.5.1 Characteristic functions of Gaussian and elliptical distributions 2.5.2 Higher-order moments 2.5.3 Cumulant-generating function 2.5.4 Circularity 2.6 Complex random processes 2.6.1 Wide-sense stationary processes 2.6.2 Widely linear shift-invariant filtering Notes

34 35 36 37 38 39 41 42 44 47 49 50 50 52 53 54 55 57 57

Part II Complex random vectors

59

3

Second-order description of complex random vectors

61

3.1 Eigenvalue decomposition 3.1.1 Principal components 3.1.2 Rank reduction and transform coding 3.2 Circularity coefficients 3.2.1 Entropy 3.2.2 Strong uncorrelating transform (SUT) 3.2.3 Characterization of complementary covariance matrices 3.3 Degree of impropriety 3.3.1 Upper and lower bounds 3.3.2 Eigenvalue spread of the augmented covariance matrix 3.3.3 Maximally improper vectors 3.4 Testing for impropriety 3.5 Independent component analysis Notes

62 63 64 65 67 67 69 70 72 76 76 77 81 84

Correlation analysis

85

4.1 Foundations for measuring multivariate association between two complex random vectors 4.1.1 Rotational, reflectional, and total correlations for complex scalars

86 87

4

Contents

5

vii

4.1.2 Principle of multivariate correlation analysis 4.1.3 Rotational, reflectional, and total correlations for complex vectors 4.1.4 Transformations into latent variables 4.2 Invariance properties 4.2.1 Canonical correlations 4.2.2 Multivariate linear regression (half-canonical correlations) 4.2.3 Partial least squares 4.3 Correlation coefficients for complex vectors 4.3.1 Canonical correlations 4.3.2 Multivariate linear regression (half-canonical correlations) 4.3.3 Partial least squares 4.4 Correlation spread 4.5 Testing for correlation structure 4.5.1 Sphericity 4.5.2 Independence within one data set 4.5.3 Independence between two data sets Notes

91 94 95 97 97 100 101 102 103 106 108 108 110 112 112 113 114

Estimation

116

5.1 Hilbert-space geometry of second-order random variables 5.2 Minimum mean-squared error estimation 5.3 Linear MMSE estimation 5.3.1 The signal-plus-noise channel model 5.3.2 The measurement-plus-error channel model 5.3.3 Filtering models 5.3.4 Nonzero means 5.3.5 Concentration ellipsoids 5.3.6 Special cases 5.4 Widely linear MMSE estimation 5.4.1 Special cases 5.4.2 Performance comparison between LMMSE and WLMMSE estimation 5.5 Reduced-rank widely linear estimation 5.5.1 Minimize mean-squared error (min-trace problem) 5.5.2 Maximize mutual information (min-det problem) 5.6 Linear and widely linear minimum-variance distortionless response estimators 5.6.1 Rank-one LMVDR receiver 5.6.2 Generalized sidelobe canceler 5.6.3 Multi-rank LMVDR receiver 5.6.4 Subspace identification for beamforming and spectrum analysis 5.6.5 Extension to WLMVDR receiver 5.7 Widely linear-quadratic estimation

117 119 121 122 123 125 127 127 128 129 130 131 132 133 135 137 138 139 141 142 143 144

viii

Contents

5.7.1 5.7.2 Notes 6

7

Connection between real and complex quadratic forms WLQMMSE estimation

145 146 149

Performance bounds for parameter estimation

151

6.1 Frequentists and Bayesians 6.1.1 Bias, error covariance, and mean-squared error 6.1.2 Connection between frequentist and Bayesian approaches 6.1.3 Extension to augmented errors 6.2 Quadratic frequentist bounds 6.2.1 The virtual two-channel experiment and the quadratic frequentist bound 6.2.2 Projection-operator and integral-operator representations of quadratic frequentist bounds 6.2.3 Extension of the quadratic frequentist bound to improper errors and scores 6.3 Fisher score and the Cramér–Rao bound 6.3.1 Nuisance parameters 6.3.2 The Cramér–Rao bound in the proper multivariate Gaussian model 6.3.3 The separable linear statistical model and the geometry of the Cramér–Rao bound 6.3.4 Extension of Fisher score and the Cramér–Rao bound to improper errors and scores 6.3.5 The Cramér–Rao bound in the improper multivariate Gaussian model 6.3.6 Fisher score and Cramér–Rao bounds for functions of parameters 6.4 Quadratic Bayesian bounds 6.5 Fisher–Bayes score and Fisher–Bayes bound 6.5.1 Fisher–Bayes score and information 6.5.2 Fisher–Bayes bound 6.6 Connections and orderings among bounds Notes

152 154 155 157 157

Detection

177

7.1 Binary hypothesis testing 7.1.1 The Neyman–Pearson lemma 7.1.2 Bayes detectors 7.1.3 Adaptive Neyman–Pearson and empirical Bayes detectors 7.2 Sufficiency and invariance 7.3 Receiver operating characteristic 7.4 Simple hypothesis testing in the improper Gaussian model

178 179 180 180 180 181 183

157 159 161 162 164 164 165 167 168 169 170 171 172 173 174 175

Contents

7.4.1 Uncommon means and common covariance 7.4.2 Common mean and uncommon covariances 7.4.3 Comparison between linear and widely linear detection 7.5 Composite hypothesis testing and the Karlin–Rubin theorem 7.6 Invariance in hypothesis testing 7.6.1 Matched subspace detector 7.6.2 CFAR matched subspace detector Notes

ix

183 185 186 188 189 190 193 194

Part III Complex random processes

195

8

Wide-sense stationary processes

197

8.1 Spectral representation and power spectral density 8.2 Filtering 8.2.1 Analytic and complex baseband signals 8.2.2 Noncausal Wiener filter 8.3 Causal Wiener filter 8.3.1 Spectral factorization 8.3.2 Causal synthesis, analysis, and Wiener filters 8.4 Rotary-component and polarization analysis 8.4.1 Rotary components 8.4.2 Rotary components of random signals 8.4.3 Polarization and coherence 8.4.4 Stokes and Jones vectors 8.4.5 Joint analysis of two signals 8.5 Higher-order spectra 8.5.1 Moment spectra and principal domains 8.5.2 Analytic signals Notes

197 200 201 202 203 203 205 205 206 208 211 213 215 216 217 218 221

Nonstationary processes

223

9.1 Karhunen–Loève expansion 9.1.1 Estimation 9.1.2 Detection 9.2 Cramér–Loève spectral representation 9.2.1 Four-corners diagram 9.2.2 Energy and power spectral densities 9.2.3 Analytic signals 9.2.4 Discrete-time signals 9.3 Rihaczek time–frequency representation 9.3.1 Interpretation 9.3.2 Kernel estimators 9.4 Rotary-component and polarization analysis

224 227 230 230 231 233 235 236 237 238 240 242

9

x

10

Contents

9.4.1 Ellipse properties 9.4.2 Analytic signals 9.5 Higher-order statistics Notes

244 245 247 248

Cyclostationary processes

250

10.1 Characterization and spectral properties 10.1.1 Cyclic power spectral density 10.1.2 Cyclic spectral coherence 10.1.3 Estimating the cyclic power-spectral density 10.2 Linearly modulated digital communication signals 10.2.1 Symbol-rate-related cyclostationarity 10.2.2 Carrier-frequency-related cyclostationarity 10.2.3 Cyclostationarity as frequency diversity 10.3 Cyclic Wiener filter 10.4 Causal filter-bank implementation of the cyclic Wiener filter 10.4.1 Connection between scalar CS and vector WSS processes 10.4.2 Sliding-window filter bank 10.4.3 Equivalence to FRESH filtering 10.4.4 Causal approximation Notes

251 251 253 254 255 255 258 259 260 262

Appendix 1 Rudiments of matrix analysis

270

A1.1 Matrix factorizations A1.1.1 Partitioned matrices A1.1.2 Eigenvalue decomposition A1.1.3 Singular value decomposition A1.2 Positive definite matrices A1.2.1 Matrix square root and Cholesky decomposition A1.2.2 Updating the Cholesky factors of a Grammian matrix A1.2.3 Partial ordering A1.2.4 Inequalities A1.3 Matrix inverses A1.3.1 Partitioned matrices A1.3.2 Moore–Penrose pseudo-inverse A1.3.3 Projections

270 270 270 271 272 272 272 273 274 274 274 275 276

Appendix 2

277

Complex differential calculus (Wirtinger calculus)

A2.1 Complex gradients A2.1.1 Holomorphic functions A2.1.2 Complex gradients and Jacobians A2.1.3 Properties of Wirtinger derivatives

262 264 265 267 268

278 279 280 281

Contents

xi

A2.2 Special cases A2.3 Complex Hessians A2.3.1 Properties A2.3.2 Extension to complex-valued functions

282 283 285 285

Appendix 3

287

Introduction to majorization

A3.1 Basic definitions A3.1.1 Majorization A3.1.2 Schur-convex functions A3.2 Tests for Schur-convexity A3.2.1 Specialized tests A3.2.2 Functions defined on D A3.3 Eigenvalues and singular values A3.3.1 Diagonal elements and eigenvalues A3.3.2 Diagonal elements and singular values A3.3.3 Partitioned matrices

288 288 289 290 291 292 293 293 294 295

References Index

296 305

Preface

Complex-valued random signals are embedded into the very fabric of science and engineering, being essential to communications, radar, sonar, geophysics, oceanography, optics, electromagnetics, acoustics, and other applied sciences. A great many problems in detection, estimation, and signal analysis may be phrased in terms of two channels’ worth of real signals. It is common practice in science and engineering to place these signals into the real and imaginary parts of a complex signal. Complex representations bring economies and insights that are difficult to achieve with real representations. In the past, it has often been assumed – usually implicitly – that complex random signals are proper and circular. A proper complex random variable is uncorrelated with its complex conjugate, and a circular complex random variable has a probability distribution that is invariant under rotation in the complex plane. These assumptions are convenient because they simplify computations and, in many aspects, make complex random signals look and behave like real random signals. Yet, while these assumptions can often be justified, there are also many cases in which proper and circular random signals are very poor models of the underlying physics. This fact has been known and appreciated by oceanographers since the early 1970s, but it has only recently been accepted across disciplines by acousticians, optical scientists, and communication theorists. This book develops the tools and algorithms that are necessary to deal with improper complex random variables, which are correlated with their complex conjugate, and with noncircular complex random variables, whose probability distribution varies under rotation in the complex plane. Accounting for the improper and noncircular nature of complex signals can have big payoffs. In digital communications, it can lead to a significantly improved tradeoff between spectral efficiency and power consumption. In array processing, it can enable us to estimate with increased accuracy the direction of arrival of one or more signals impinging on a sensor array. In independent component analysis, it may be possible to blindly separate Gaussian sources – something that is impossible if these sources are proper. In the electrical engineering literature, the story of improper and noncircular complex signals began with Brown and Crane, Gardner, van den Bos, Picinbono, and their coworkers. They have laid the foundations for the theory we aim to review and extend in this research monograph, and to them we dedicate this book. The story is continuing, with work by a number of our colleagues who are publishing new findings as we write this preface. We have tried to stay up to date with their work by referencing it as carefully as we have been able. We ask their forbearance for results not included.

xiv

Preface

Outline of this book The book can be divided into three parts. Part I (Chapters 1 and 2) gives an overview and introduction to complex random vectors and processes. In Chapter 1, we describe the origins and uses of complex signals. The chapter answers the following question: why do engineers and applied scientists represent real measurable effects by complex signals? Chapter 2 lays the foundation for the remainder of the book by introducing important concepts and definitions for complex random vectors and processes, such as widely linear transformations, complementary correlations, the multivariate improper Gaussian distribution, and complementary power spectra of wide-sense stationary processes. Chapter 2 should be read before proceeding to any of the later chapters. Part II (Chapters 3–7) deals with complex random vectors and their application to correlation analysis, estimation, performance bounding, and detection. In Chapter 3, we discuss in detail the second-order description of a complex random vector. In particular, we are interested in those second-order properties that are invariant under either widely unitary or widely linear transformation. This leads us to a test for impropriety and applications in independent component analysis (ICA). Chapter 4 treats the assessment of multivariate association between two complex random vectors. We provide a unifying treatment of three popular correlation-analysis techniques: canonical correlation analysis, multivariate linear regression, and partial least squares. We also present several generalized likelihood-ratio tests for the correlation structure of complex Gaussian data, such as sphericity, independence within one data set, and independence between two data sets. Chapter 5 is on estimation. Here we are interested in linear and widely linear leastsquares problems, wherein parameter estimators are constrained to be linear or widely linear in the measurement and the performance criterion is mean-squared error or squared error under a constraint. Chapter 6 deals with performance bounds for parameter estimation. We consider quadratic performance bounds of the Weiss–Weinstein class, the most notable representatives of which are the Cramér–Rao and Fisher–Bayes bound. Chapter 7 addresses detection, where the problem is to determine which of two or more competing models best describes experimental measurements. In order to demonstrate the role of widely linear and widely quadratic forms in the theory of hypothesis testing, we concentrate on hypothesis testing within Gaussian measurement models. Part III (Chapters 8–10) deals with complex random processes, both continuous- and discrete-time. Throughout this part, we focus on second-order spectral properties, and optimum linear (or widely linear) minimum mean-squared error filtering. Chapter 8 discusses wide-sense stationary (WSS) processes, with a focus on the role of the complementary power spectral density in rotary-component and polarization analysis. WSS processes admit a spectral representation in terms of the Fourier basis, which allows a frequency interpretation. The transform-domain description of a WSS signal is a spectral process with orthogonal increments. For nonstationary signals, we have to sacrifice either the Fourier basis and thus its frequency interpretation, or the orthogonality of the transform-domain representation. In Chapter 9, we will discuss both possibilities,

Preface

xv

which leads either to the Karhunen–Loève expansion or the Cramér–Loève spectral representation. The latter is the basis for bilinear time–frequency representations. Then, in Chapter 10 we treat cyclostationary processes. They are an important class of nonstationary processes that have periodically varying correlation properties. They can model periodic phenomena occurring in science and technology, including communications, meteorology, oceanography, climatology, astronomy, and economics. Three appendices provide background material. Appendix 1 presents rudiments of matrix analysis. Appendix 2 introduces Wirtinger calculus, which enables us to compute generalized derivatives of a real function with respect to complex parameters. Finally, Appendix 3 discusses majorization, which is used at several places in this book. Majorization introduces a preordering of vectors, and it will allow us to optimize certain scalar real-valued functions with respect to real vector-valued parameters. This book is mainly targeted at researchers and graduate students who rely on the theory of signals and systems to conduct their work in signal processing, communications, radar, sonar, optics, electromagnetics, acoustics, oceanography, geophysics, and geography. Although it is not primarily intended as a textbook, chapters of the book may be used to support a special-topics course at a second-year graduate level. We would expect readers to be familiar with basic probability theory, linear systems, and linear algebra, at a level covered in a typical first-year graduate course.

Acknowledgments We would like to thank Dr. Patrik Wahlberg for giving us detailed feedback on many chapters of this book. We further thank Dr. Phil Meyler of Cambridge University Press for his support throughout the writing of this book. Peter Schreier acknowledges financial support from the Australian Research Council (ARC) under its Discovery Project scheme, and thanks Colorado State University, Ft. Collins, USA, for its hospitality during a five-month study leave in the winter and spring of 2008 in the northern hemisphere. Louis Scharf acknowledges years of research support by the Office of Naval Research and the National Science Foundation (NSF), and thanks the University of Newcastle, Australia, for its hospitality during a one-month study leave in the autumn of 2009 in the southern hemisphere. Peter J. Schreier Newcastle, New South Wales, Australia Louis L. Scharf Ft. Collins, Colorado, USA

Notation

Conventions x, y x xˆ x x⊥y

inner product norm (usually Euclidean) estimate of x complementary quantity to x x is orthogonal to y

Vectors and matrices x x≺y x ≺w y X X>Y X≥Y X∗ XT XH = (XT )∗ X† X x x= ∗ x X1 X2 X= X∗2 X∗1

column-vector with components xi x is majorized by y x is weakly majorized by y matrix with components (X)i j = X i j X − Y is positive definite X − Y is positive semidefinite (nonnegative definite) complex conjugate transpose Hermitian (conjugate) transpose Moore–Penrose pseudo-inverse subspace spanned by columns of X augmented vector augmented matrix

Functions x(t) x[k] ˆ x(t) X( f ) X( f ) X( f )

continuous-time signal discrete-time signal Hilbert transform of x(t); estimate of x(t) scalar-valued Fourier transform of x(t) vector-valued Fourier transform of x(t) matrix-valued Fourier transform of X(t)

xviii

Notation

X (z) X(z) X(z) x(t) ∗ y(t) = (x ∗ y)(t)

scalar-valued z-transform of x[k] vector-valued z-transform of x[k] matrix-valued z-transform of X[k] convolution of x(t) and y(t)

Commonly used symbols and operators arg(x) = x C C 2n ∗ δ(x) det(X) diag(X) Diag(x1 , . . ., xn ) e E(x) ev(X) I Im x K Λ mx ␮x px (x) PU Px x ( f ) Px x ( f ) Px x ( f ) Q ρ IR Rx y xy R Rx y Rx y r x y (t, τ ) r˜x y (t, τ ) Re x sgn(x) sv(X)

argument (phase) of complex x field of complex numbers set of augmented vectors x = [xT , xH ]T , x ∈ C n Dirac δ-function (distribution) matrix determinant vector of diagonal values X 11 , X 22 , . . ., X nn of X diagonal or block-diagonal matrix with diagonal elements x1 , . . ., xn error vector expectation of x vector of eigenvalues of X, ordered decreasingly identity matrix imaginary part of x matrix of canonical/half-canonical correlations ki matrix of eigenvalues λi sample mean vector of x mean vector of x probability density function (pdf) of x (often used without subscript) orthogonal projection onto subspace U power spectral density (PSD) of x(t) complementary power spectral density (C-PSD) of x(t) augmented PSD matrix of x(t) error covariance matrix correlation coefficient; degree of impropriety; coherence field of real numbers cross-covariance matrix of x and y complementary cross-covariance matrix of x and y augmented cross-covariance matrix of x and y covariance matrix of composite vector [xT , yT ]T cross-covariance function of x(t) and y(t) complementary cross-covariance function of x(t) and y(t) real part of x sign of x vector of singular values of X, ordered decreasingly

Notation

Sx x x x S Sx x Sx x (ν, f ) Sx x (ν, f ) I jI T= I −jI tr(X) W W m×n x = u + jv ␰ x(t) = u(t) + jv(t) ξ( f ) y = a + jb y(t) υ( f ) ␻ 0

xix

sample covariance matrix of x sample complementary covariance matrix of x augmented sample covariance matrix of x (Loève) spectral correlation of x(t) (Loève) complementary spectral correlation of x(t) real-to-complex transformation matrix trace Wiener (linear or widely linear minimum mean-squared error) filter matrix set of 2m × 2n augmented matrices complex message/source internal (latent) description of x complex continuous-time message/source signal spectral process corresponding to x(t) complex measurement/observation complex continuous-time measurement/observation signal spectral process corresponding to y(t) internal (latent) description of y sample space zero vector or matrix

Part I

Introduction

1

The origins and uses of complex signals

Engineering and applied science rely heavily on complex variables and complex analysis to model and analyze real physical effects. Why should this be so? That is, why should real measurable effects be represented by complex signals? The ready answer is that one complex signal (or channel) can carry information about two real signals (or two real channels), and the algebra and geometry of analyzing these two real signals as if they were one complex signal brings economies and insights that would not otherwise emerge. But ready answers beg for clarity. In this chapter we aim to provide it. In the bargain, we intend to clarify the language of engineers and applied scientists who casually speak of complex velocities, complex electromagnetic fields, complex baseband signals, complex channels, and so on, when what they are really speaking of is the x- and y-coordinates of velocity, the x- and y-components of an electric field, the in-phase and quadrature components of a modulating waveform, and the sine and cosine channels of a modulator or demodulator. For electromagnetics, oceanography, atmospheric science, and other disciplines where two-dimensional trajectories bring insight into the underlying physics, it is the complex representation of an ellipse that motivates an interest in complex analysis. For communication theory and signal processing, where amplitude and phase modulations carry information, it is the complex baseband representation of a real bandpass signal that motivates an interest in complex analysis. In Section 1.1, we shall begin with an elementary introduction to complex representations for Cartesian coordinates and two-dimensional signals. Then we shall proceed to a discussion of phasors and Lissajous figures in Sections 1.2 and 1.3. We will find that phasors are a complex representation for the motion of an undamped harmonic oscillator and Lissajous figures are a complex representation for polarized electromagnetic fields. The study of communication signals in Section 1.4 then leads to the Hilbert transform, the complex analytic signal, and various principles for modulating signals. Section 1.5 demonstrates how real signals can be loaded into the real and imaginary parts of a complex signal in order to make efficient use of the fast Fourier transform (FFT). The second half of this chapter deals with complex random variables and signals. In Section 1.6, we introduce the univariate complex Gaussian probability density function (pdf) as an alternative parameterization for the bivariate pdf of two real correlated Gaussian random variables. We will see that the well-known form of the univariate complex Gaussian pdf models only a special case of the bivariate real pdf, where the

4

The origins and uses of complex signals

two real random variables are independent and have equal variances. This special case is called proper or circular, and it corresponds to a uniform phase distribution of the complex random variable. In general, however, the complex Gaussian pdf depends not only on the variance but also on another term, which we will call the complementary variance. In Section 1.7, we extend this discussion to complex random signals. Using the polarization ellipse as an example, we will find an interplay of reality/complexity, propriety/impropriety, and wide-sense stationarity/nonstationarity. Section 1.8 provides a first glance at the mathematical framework that underpins the study of complex random variables in this book. Finally, Section 1.9 gives a brief survey of some recent papers that apply the theory of improper and noncircular complex random signals in communications, array processing, machine learning, acoustics, optics, and oceanography.

1.1

Cartesian, polar, and complex representations of two-dimensional signals It is commonplace to represent two Cartesian coordinates (u, v) in their two polar coordinates ( A, θ ), or as the single complex coordinate x = u + jv = Aejθ . The real coordinates (u, v) ←→ (A, θ ) are thus equivalent to the complex coordinates u + jv ←→ Aejθ . The virtue of this complex representation is that it leads to an economical algebra and an evocative geometry, especially when polar coordinates A and θ are used. This virtue extends to vector-valued coordinates (u, v), with complex representation x = u + jv. For example, x could be a mega-vector composed by stacking scan lines from a stereoscopic image, in which case u would be the image recorded by camera one and v would be the image recorded by camera two. In oceanographic applications, u and v could be the two orthogonal components of surface velocity and x would be the complex velocity. Or x could be a window’s worth of a discrete-time communications signal. In the context of communications, radar, and sonar, u and v are called the in-phase and quadrature components, respectively, and they are obtained as sampled-data versions of a continuous-time signal that has been demodulated with a quadrature demodulator. The quadrature demodulator itself is designed to extract a baseband information-bearing signal from a passband carrying signal. This is explained in more detail in Section 1.4. The virtue of complex representations extends to the analysis of time-varying coordinates (u(t), v(t)), which we call two-dimensional signals, and which we represent as the complex signal x(t) = u(t) + jv(t) = A(t)ejθ(t) . Of course, the next generalization of this narrative would be to vector-valued complex signals x(t) = [x1 (t), x2 (t), . . ., x N (t)]T , a generalization that produces technical difficulties, but not conceptual ones. The two best examples are complex-demodulated signals in a multi-sensor antenna array, in which case xk (t) is the complex signal recorded at sensor k, and complex-demodulated signals in spectral subbands of a wideband communication signal, in which case xk (t) is the complex signal recorded in subband k. When these signals are themselves sampled in time, then the vector-valued discrete-time sequence is x[n], with x[n] = x(nT ) a sampled-data version of x(t).

1.2 Simple harmonic oscillator and phasors

5

This introductory account of complex signals gives us the chance to remake a very important point. In engineering and applied science, measured signals are real. Correspondingly, in all of our examples, the components u and v are real. It is only our representation x that is complex. Thus one channel’s worth of complex signal serves to represent two channels’ worth of real signals. There is no fundamental reason why this would have to be done. We aim to make the point in this book that the algebraic economies, probabilistic computations, and geometrical insights that accrue to complex representations justify their use. The examples of the next several sections give a preview of the power of complex representations.

1.2

Simple harmonic oscillator and phasors The damped harmonic oscillator models damped pendulums and second-order electrical and mechanical systems. A measurement (of position or voltage) in such a system obeys the second-order, homogeneous, linear differential equation d2 d u(t) + 2ξ ω0 u(t) + ω02 u(t) = 0. dt 2 dt The corresponding characteristic equation is s 2 + 2ξ ω0 s + ω02 = 0.

(1.1)

(1.2)

If the damping coefficient ξ satisfies 0 ≤ ξ < 1, the system is called underdamped, and 2 the quadratic equation (1.2) has two complex conjugate roots s1 = −ξ ω0 + j 1 − ξ ω0 and s2 = s1∗ . The real homogeneous response of the damped harmonic oscillator is then ∗ u(t) = Aejθ es1 t + Ae−jθ es1 t = Re {Aejθ es1 t } = Ae−ξ ω0 t cos( 1 − ξ 2 ω0 t + θ ), (1.3) and A and θ may be determined from the initial values of u(t) and (d/dt)u(t) at t = 0. The real response (1.3) is the sum of two complex modal responses, or the real part of one of them. In anticipation of our continuing development, we might say that Aejθ es1 t is a complex representation of the real signal u(t). For the undamped system with damping coefficient ξ = 0, we have s1 = jω0 and the solution is u(t) = Re {Aejθ ejω0 t } = A cos(ω0 t + θ ).

(1.4)

In this case, Aejθ ejω0 t is the complex representation of the real signal A cos(ω0 t + θ ). The complex signal in its polar form x(t) = Aej(ω0 t+θ) = Aejθ ejω0 t ,

t ∈ IR,

(1.5)

is called a rotating phasor. The rotator ejω0 t rotates the stationary phasor Aejθ at the angular rate of ω0 radians per second. The rotating phasor is periodic with period 2π/ω0 , thus overwriting itself every 2π/ω0 seconds. Euler’s identity allows us to express the rotating phasor in its Cartesian form as x(t) = A cos(ω0 t + θ) + j A sin(ω0 t + θ).

(1.6)

6

The origins and uses of complex signals

Im

Ae jq e jw0 t1 Ae jq (t = 0) Re

Ae jq e

jw0 wπ

0

Figure 1.1 Stationary and rotating phasors.

Thus, the complex representation of the undamped simple harmonic oscillator turns out to be the trajectory in the complex plane of a rotating phasor of radian frequency ω0 , with starting point Aejθ at t = 0. The rotating phasor of Fig. 1.1 is illustrative. The rotating phasor is one of the most fundamental complex signals we shall encounter in this book, as it is a basic building block for more complicated signals. As we build these more complicated signals, we will allow A and θ to be correlated random processes.

1.3

Lissajous figures, ellipses, and electromagnetic polarization We might say that the circularly rotating phasor x(t) = Aejθ ejω0 t = A cos(ω0 t + θ) + j A sin(ω0 t + θ ) is the simplest of Lissajous figures, consisting of real and imaginary parts that are π/2 radians out of phase. A more general Lissajous figure allows complex signals of the form x(t) = u(t) + jv(t) = Au cos(ω0 t + θu ) + j Av cos(ω0 t + θv ).

(1.7)

Here the real part u(t) and the imaginary part v(t) can be mismatched in amplitude and phase. This Lissajous figure overwrites itself with period 2π/ω0 and turns out an ellipse in the complex plane. (This is still not the most general Lissajous figure, since Lissajous figures generally also allow different frequencies in the u- and v-components.) In electromagnetic theory, this complex signal would be the time-varying position of the electric field vector in the (u, v)-plane perpendicular to the direction of propagation. Over time, as the electric field vector propagates, it turns out an elliptical corkscrew in three-dimensional space. But in the two-dimensional plane perpendicular to the direction of propagation, it turns out an ellipse, so the electric field is said to be elliptically polarized. As this representation shows, the elliptical polarization may be modeled, and in fact produced, by the superposition of a one-dimensional, linearly polarized,

1.3 Lissajous figures, ellipses, and polarization

7

Im

2(A+ + A− ) q + − q− 2

Re

2 A+ − A− Figure 1.2 A typical polarization ellipse.

component of the form Au cos(ω0 t + θu ) in the u-direction and another of the form Av sin(ω0 t + θv ) in the v-direction. But there is more. Euler’s identity may be used to write the electric field vector as x(t) = 12 Au ejθu ejω0 t + 12 Au e−jθu e−jω0 t + 12 j Av ejθv ejω0 t + 12 j Av e−jθv e−jω0 t = 12 Au ejθu + j Av ejθv ejω0 t + 12 Au e−jθu + j Av e−jθv e−jω0 t .

jθ+ −jθ− A+ e A− e

(1.8)

This representation of the two-dimensional electric field shows it to be the superposition of a two-dimensional, circularly polarized, component of the form A+ ejθ+ ejω0 t and another of the form A− e−jθ− e−jω0 t . The first rotates counterclockwise (CCW) and is said to be left-circularly polarized. The second rotates clockwise (CW) and is said to be right-circularly polarized. In this representation, the complex constants A+ ejθ+ and A− e−jθ− fix the amplitude and phase of their respective circularly polarized components. The circular representation of the ellipse makes it easy to determine the orientation of the ellipse and the lengths of the major and minor axes. In fact, by noting that the magnitude-squared of x(t) is |x(t)|2 = A2+ + 2A+ A− cos(θ+ + θ− + 2ω0 t) + A2− , it is easy to see that |x(t)|2 has a maximum value of (A+ + A− )2 at θ+ + θ− + 2ω0 t = 2kπ , and a minimum value of ( A+ − A− )2 at θ+ + θ− + 2ω0 t = (2k + 1)π . This orients the major axis of the ellipse at angle (θ+ − θ− )/2 and fixes the major and minor axis lengths at 2(A+ + A− ) and 2|A+ − A− |. A typical polarization ellipse is illustrated in Fig. 1.2.

Jones calculus It is clear that the polarization ellipse x(t) may be parameterized either by four real parameters ( Au , Av , θu , θv ) or by two complex parameters ( A+ ejθ+ , A− e−jθ− ). In the first case, we modulate the real basis (cos(ω0 t), sin(ω0 t)), and in the second case, we modulate the complex basis (ejω0 t , e−jω0 t ). If we are interested only in the path that the electric field vector describes, and do not need to evaluate x(t0 ) at a particular time t0 , knowing the phase differences θu − θv or θ+ − θ− rather than the phases themselves is sufficient. The choice of parameterization – whether real or complex – is somewhat arbitrary, but it is common to use the Jones vector [ Au , Av ej(θu −θv ) ] to describe the state of polarization. This is illustrated in the following example.

8

The origins and uses of complex signals

Example 1.1. The Jones vectors for four basic states of polarization are (note that we do not follow the convention of normalizing Jones vectors to unit norm): 1 horizontal, linear polarization, ←→ x(t) = cos(ω0 t) 0 0 ←→ x(t) = j cos(ω0 t) vertical, linear polarization, 1 1 ←→ x(t) = ejω0 t CCW (left-) circular polarization, j 1 ←→ x(t) = e−jω0 t CW (right-) circular polarization. −j Various polarization filters can be coded with two-by-two complex matrices that selectively pass components of the polarization. For example, consider these two polarization filters, and their corresponding Jones matrices: 1 0 horizontal, linear polarizer, 0 0 1 1 −j CCW (left-)circular polarizer. 2 j 1 The first of these passes horizontal linear polarization and rejects vertical linear polarization. Such polarizers are used to reduce vertically polarized glare in Polaroid sunglasses. The second passes CCW circular polarization and rejects CW circular polarization. And so on.

1.4

Complex modulation, the Hilbert transform, and complex analytic signals When analyzing the damped harmonic oscillator or the elliptically polarized electric field, the appropriate complex representations present themselves naturally. We now establish that this is so, as well, in the theory of modulation. Here the game is to modulate a baseband, information-bearing, signal onto a passband carrier signal that can be radiated from a real antenna onto a real channel. When the aim is to transmit information from here to there, then the channel may be “air,” cable, or fiber. When the aim is to transmit information from now to then, then the channel may be a magnetic recording channel. Actually, since a sinusoidal carrier signal can be modulated in amplitude and phase, the game is to modulate two information-bearing signals onto a carrier, suggesting again that one complex signal might serve to represent these two real signals and provide insight into how they should be designed. In fact, as we shall see, without the notion of

1.4 Complex modulation and analytic signals

9

a complex analytic signal, electrical engineers might never have discovered the Hilbert transform and single-sideband (SSB) modulation as the most spectrally efficient way to modulate one real channel of baseband information onto a passband carrier. Thus modulation theory provides the proper context for the study of the Hilbert transform and complex analytic signals.

1.4.1

Complex modulation using the complex envelope Let us begin with two real information-bearing signals u(t) and v(t), which are combined in a complex baseband signal as x(t) = u(t) + jv(t) = A(t)ejθ(t) = A(t)cos θ (t) + j A(t)sin θ(t).

(1.9)

The amplitude A(t) and phase θ (t) are real. We take u(t) and v(t) to be lowpass signals with Fourier transforms supported on a baseband interval of − < ω < . The representation A(t)ejθ(t) is a generalization of the stationary phasor, wherein the fixed radius and angle of a phasor are replaced by a time-varying radius and angle. It is a simple matter to go back and forth between x(t) and (u(t), v(t)) and (A(t), θ (t)). From x(t) we propose to construct the real passband signal p(t) = Re {x(t)ejω0 t } = A(t)cos(ω0 t + θ (t)) = u(t)cos(ω0 t) − v(t)sin(ω0 t).

(1.10)

In accordance with standard communications terminology, we call x(t) the complex baseband signal or complex envelope of p(t), A(t) and θ(t) the amplitude and phase of the complex envelope, and u(t) and v(t) the in-phase and quadrature(-phase) components. The term “quadrature component” refers to the fact that it is in phase quadrature (+π/2 out of phase) with respect to the in-phase component. We say the complex envelope x(t) complex-modulates the complex carrier ejω0 t , when what we really mean is that the real amplitude and phase (A(t), θ (t)) real-modulate the amplitude and phase of the real carrier cos(ω0 t); or the in-phase and quadrature signals (u(t), v(t)) real-modulate the real in-phase carrier cos(ω0 t) and the real quadrature carrier sin(ω0 t). These are three equivalent ways of saying exactly the same thing. Figure 1.3(a) suggests a diagram for complex modulation. In Fig. 1.3(b), we stress the point that complex channels are actually two parallel real channels. It is worth noting that when θ(t) is constant (say zero), then modulation is amplitude modulation only. In the complex plane, the complex baseband signal x(t) writes out a trajectory x(t) = A(t) that does not leave the real line. When A(t) is constant (say 1), then modulation is phase modulation only. In the complex plane, the complex baseband signal x(t) writes out a trajectory x(t) = ejθ(t) that does not leave the unit circle. In general quadrature modulation, both A(t) and θ (t) are time-varying, and they combine to write out quite arbitrary trajectories in the complex plane. These trajectories are composed of real part u(t) = A(t)cos θ(t) and imaginary part v(t) = A(t)sin θ (t), or of amplitude A(t) and phase θ (t).

10

The origins and uses of complex signals

cos (w 0t ) e jw0 t

u(t)

x(t)

p(t)

p(t)

Re

v(t)

(a)

sin (w 0t ) (b) Figure 1.3 (a) Complex and (b) quadrature modulation.

Re X(w)

Re P(w) w

−w0

w0

Im X(w)

Im P(w)

−w 0

w w0

Figure 1.4 Baseband spectrum X (ω) (solid line) and passband spectrum P(ω) (dashed line).

If the complex signal x(t) has Fourier transform X (ω), denoted x(t) ←→ X (ω), then x(t)ejω0 t ←→ X (ω − ω0 ) and x ∗ (t) ←→ X ∗ (−ω). Thus, p(t) = Re {x(t)ejω0 t } = 12 x(t)ejω0 t + 12 x ∗ (t)e−jω0 t

(1.11)

has Hermitian-symmetric Fourier transform P(ω) = 12 X (ω − ω0 ) + 12 X ∗ (−ω − ω0 ).

(1.12)

Because p(t) is real its Fourier transform satisfies P(ω) = P ∗ (−ω). Thus, the real part of P(ω) is even, and the imaginary part is odd. Moreover, the magnitude |P(ω)| is even, and the phase P(ω) is odd. Fanciful spectra X (ω) and P(ω) are illustrated in Fig. 1.4.

1.4 Complex modulation and analytic signals

1.4.2

11

The Hilbert transform, phase splitter, and analytic signal If the complex baseband signal x(t) can be recovered from the passband signal p(t), then the two real channels u(t) and v(t) can be easily recovered as u(t) = Re x(t) = 12 [x(t) + x ∗ (t)] and v(t) = Im x(t) = [1/(2j)][x(t) − x ∗ (t)]. But how is x(t) to be recovered from p(t)? The real operator Re in the definition of p(t) is applied to the complex signal x(t)ejω0 t and returns the real signal p(t). Suppose there existed an inverse operator , i.e., a linear, convolutional, complex operator, that could be applied to the real signal p(t) and return the complex signal x(t)ejω0 t . Then this complex signal could be complexdemodulated for x(t) = e−jω0 t ejω0 t x(t). The complex operator would have to be defined by an impulse response φ(t) ←→ (ω), whose Fourier transform (ω) were zero for negative frequencies and 2 for positive frequencies, in order to return the signal x(t)ejω0 t ←→ X (ω − ω0 ). This brings us to the Hilbert transform, the phase splitter, and the complex analytic ˆ signal. The Hilbert transform of a signal p(t) is denoted p(t), and defined as the linear shift-invariant operation ∞ ˆ ˆ = (h ∗ p)(t) p(t) h(t − τ ) p(τ )dτ ←→ (H P)(ω) H (ω)P(ω) = P(ω). −∞

(1.13) The impulse response h(t) and complex frequency response H (ω) of the Hilbert transform are defined to be, for t ∈ IR and ω ∈ IR, h(t) =

1 ←→ −j sgn(ω) = H (ω). πt

Here sgn(ω) is the function

sgn(ω) =

   1, 0,   −1,

(1.14)

ω > 0, ω = 0,

(1.15)

ω < 0.

So h(t) is real and odd, and H (ω) is imaginary and odd. From the Hilbert transform h(t) ←→ H (ω) we define the phase splitter φ(t) = δ(t) + jh(t) ←→ 1 − j2 sgn(ω) = 2(ω) = (ω).

(1.16)

The complex frequency response of the phase splitter is (ω) = 2(ω), where (ω) is the standard unit-step function. The convolution of the complex filter φ(t) and the ˆ real signal p(t) produces the analytic signal y(t) = p(t) + j p(t), with Fourier transform identity ˆ ←→ P(ω) + sgn(ω)P(ω) = 2( P)(ω) = Y (ω). y(t) = (φ ∗ p)(t) = p(t) + j p(t) (1.17) Recall that the Fourier transform P(ω) of a real signal p(t) has Hermitian symmetry P(−ω) = P ∗ (ω), so P(ω) for ω < 0 is redundant. In the polar representation

12

The origins and uses of complex signals

P(ω) = B(ω)ej(ω) , we have B(ω) = B(−ω) and (−ω) = −(ω). Therefore, the corresponding spectral representations for real p(t) and complex analytic y(t) are ∞ ∞ dω dω P(ω)ejωt B(ω)cos(ωt + (ω)) , =2 (1.18) p(t) = 2π 2π −∞ 0 ∞ ∞ dω dω y(t) = 2P(ω)ejωt B(ω)ej(ωt+(ω)) . =2 (1.19) 2π 2π 0 0 The analytic signal replaces the redundant two-sided spectral representation (1.18) by the efficient one-sided representation (1.19). Equivalently, the analytic signal uses a linear combination of amplitude- and phase-modulated complex exponentials to represent a linear combination of amplitude- and phase-modulated cosines. We might say the analytic signal y(t) is a bandwidth-efficient representation for its corresponding real signal p(t). Example 1.2. Begin with the real signal cos(ω0 t). Its Hilbert transform is the real signal sin(ω0 t) and its analytic signal is the complex exponential ejω0 t . This is easily established from the Fourier-series expansions for ejω0 t , cos(ω0 t), and sin(ω0 t). This result extends to the complex Fourier series x(t) =

M

m

Am ejθm ej2π T t .

(1.20)

m=1

This complex signal is analytic, with spectral lines at positive frequencies of 2πm/T only. Its real and imaginary parts are u(t) =

ˆ = u(t)

M m Am jθm j2π m t Am cos 2π t + θm = e e T , T 2 m=−M m=1

(1.21)

M m sgn(m) Am jθm j2π m t Am sin 2π t + θm = e e T , T 2j m=−M m=1

(1.22)

M

M

ˆ is the Hilbert transform of where A0 = 0, A−m = Am , and θ−m = −θm . Of course u(t) u(t). But caution is required. Define the complex signal ejψ(t) = cos ψ(t) + j sin ψ(t), with ψ(t) real. But sin ψ(t) is the Hilbert transform of cos ψ(t) if and only if ejψ(t) is analytic, meaning that its Fourier transform is one-sided. This means that sin ψ(t) is the Hilbert transform of cos ψ(t) only in special cases. The imaginary part of a complex analytic signal is the Hilbert transform of its real part, and its Fourier transform is causal (zero for negative frequencies). There is also a dual for an analytic Fourier transform. Its imaginary part is the Hilbert transform of its real part, and its inverse Fourier transform is a causal signal (zero for negative time). This is called the Kramers–Kronig relation, after the physicists who first answered the question of what could be said about the spectrum of a causal signal.

13

1.4 Complex modulation and analytic signals

f(t) e− jw0 t d x(t) p(t) j h

(a)

cos (w0t)

d

u(t) sin (w0t)

p(t)

v(t)

h

cos (w0t (

(b)

Figure 1.5 Complex demodulation with the phase splitter: (a) one complex channel and (b) two real channels.

1.4.3

Complex demodulation Let’s recap. The output of the phase splitter applied to a passband signal p(t) = Re {x(t)ejω0 t } ←→ 12 X (ω − ω0 ) + 12 X ∗ (−ω − ω0 ) = P(ω) is the analytic signal y(t) = (φ ∗ p)(t) ←→ ( P)(ω) = 2( P)(ω) = Y (ω).

(1.23)

Under the assumption that the bandwidth of the spectrum X (ω) satisfies < ω0 , we have X (ω − ω0 ), ω > 0, (1.24) Y (ω) = 0, ω ≤ 0. From here, we only need to shift y(t) down to baseband to obtain the complex baseband signal x(t) = y(t)e−jω0 t = (φ ∗ p)(t)e−jω0 t ←→ 2( P)(ω + ω0 ) = X (ω).

(1.25)

Two diagrams of a complex demodulator are shown in Fig. 1.5. The Hilbert transform is an idealized convolution operator that can only be approximated in practice. The usual approach to approximating it is to complex-demodulate

14

The origins and uses of complex signals

p(t) as e−jω0 t p(t) ←→ P(ω + ω0 ) = 12 X (ω) + 12 X ∗ (−ω − 2ω0 ).

(1.26)

Again, under the assumption that X (ω) is bandlimited, this demodulated signal may be lowpass-filtered (with cutoff frequency ) for x(t) ←→ X (ω). Of course there are no ideal lowpass filters, so either the Hilbert transform is approximated and followed by a complex demodulator, or a complex demodulator is followed by an approximate lowpass filter. Evidently, in the case of a complex bandlimited baseband signal modulating a complex carrier, a phase splitter followed by a complex demodulator is equivalent to a complex demodulator followed by a lowpass filter. But how general is this? Bedrosian’s theorem answers this question.

1.4.4

Bedrosian’s theorem: the Hilbert transform of a product ˆ denote the Hilbert transform of this Let u(t) = u 1 (t)u 2 (t) be a product signal and let u(t) ˆ = u 1 (t)uˆ 2 (t). product. If U1 (ω) = 0 for |ω| > and U2 (ω) = 0 for |ω| < , then u(t) That is, the lowpass, slowly varying, factor may be regarded as constant when calculating the Hilbert transform. The proof is a frequency-domain proof: (H (U1 ∗ U2 ))(ω) = −j sgn(ω)(U1 ∗ U2 )(ω) ∞ dν U1 (ν)U2 (ω − ν)sgn(ω) = −j 2π −∞ ∞ dν = −j U1 (ν)U2 (ω − ν)sgn(ω − ν) 2π −∞ = (U1 ∗ (HU2 ))(ω).

(1.27)

Actually, this proof is not as simple as it looks. It depends on the fact that sgn(ω − ν) = sgn(ω) over the range of values ν for which the integrand is nonzero. Example 1.3. If a(t) is a real lowpass signal with bandwidth < ω0 , then the ˆ = a(t)sin(ω0 t). Hence, the analytic Hilbert transform of u(t) = a(t)cos(ω0 t) is u(t) ˆ computed from the real amplitude-modulated signal u(t) is signal x(t) = u(t) + ju(t) x(t) = a(t)(cos(ω0 t) + j sin(ω0 t)) = a(t)ejω0 t .

1.4.5

Instantaneous amplitude, frequency, and phase So far, we have spoken rather loosely of amplitude and phase modulation. If we modulate two real signals a(t) and ψ(t) onto a cosine to produce the real signal p(t) = a(t)cos(ω0 t + ψ(t)), then this language seems unambiguous: we would say the respective signals amplitude- and phase-modulate the cosine. But is it really unambiguous? The following example suggests that the question deserves thought.

1.4 Complex modulation and analytic signals

15

Example 1.4. Let’s look at a “purely amplitude-modulated” signal p(t) = a(t)cos(ω0 t).

(1.28)

Assuming that a(t) is bounded such that 0 ≤ a(t) ≤ A, there is a well-defined function −1 1 ψ(t) = cos (1.29) p(t) − ω0 t. A We can now write p(t) as p(t) = a(t)cos(ω0 t) = A cos(ω0 t + ψ(t)),

(1.30)

which makes it look like a “purely phase-modulated” signal. This example shows that, for a given real signal p(t), the factorization p(t) = a(t)cos(ω0 t + ψ(t)) is not unique. In fact, there is an infinite number of ways for p(t) to be factored into “amplitude” and “phase.” We can resolve this ambiguity by resorting to the complex envelope of p(t). The ˆ computed from the real bandpass signal complex envelope x(t) = e−jω0 t ( p(t) + j p(t)), p(t) for a given carrier frequency ω0 , is uniquely factored as x(t) = A(t)ejθ(t) .

(1.31)

We call A(t) = |x(t)| the instantaneous amplitude of p(t), θ (t) = x(t) the instantaneous phase of p(t), and the derivative of the instantaneous phase (d/dt)θ (t) the instantaneous frequency of p(t). This argument works also for ω0 = 0.

1.4.6

Hilbert transform and SSB modulation There is another important application of the Hilbert transform, again leading to the definition of a complex signal from a real signal. In this case the aim is to modulate one real channel u(t) onto a carrier. The direct way would be to use double-sideband suppressed carrier (DSB-SC) modulation of the form u(t)cos(ω0 t), whose spectrum is shown in Fig. 1.6(a). However, since u(t) is real, its Fourier transform satisfies U (−ω) = U ∗ (ω), so half the bandwidth is redundant. The alternative is to Hilbert-transform u(t), construct the analytic signal x(t) = ˆ u(t) + ju(t), and complex-modulate with it to form the real passband signal p(t) = ˆ u(t)cos(ω0 t) − u(t)sin(ω 0 t). This signal, illustrated in Fig. 1.6(b), is bandwidth-efficient and it is said to be single-sideband (SSB) modulated for obvious reasons. Without the notion of the Hilbert transform and the complex analytic signal, no such construction would have been possible.

1.4.7

Passband filtering at baseband Consider the problem of linear shift-invariant filtering of the real passband signal p(t) with a filter whose real-valued passband impulse response is g(t) and whose passband

16

The origins and uses of complex signals

U ∗ (−w − w0 )

U(w − w0 ) w

(a)

−w0

w0 X ∗ (−w − w0 )

X(w − w0 )

w −w0

w0

(b) Figure 1.6 DSB-SC (a) and SSB modulation (b).

frequency response is G(ω). The filter output is (g ∗ p)(t) ←→ (G P)(ω).

(1.32)

Instead of filtering at passband, we can filter at baseband. Similarly to the definition of the complex baseband signal x(t) in (1.25), we define the complex baseband impulse response and frequency response as gb (t) = 12 (φ ∗ g)(t)e−jω0 t ←→ G b (ω) = (G)(ω + ω0 ),

(1.33)

where again denotes the unit-step function. We note that the definition of a complex baseband impulse response includes a factor of 1/2 that is not included in the definition of the complex baseband signal x(t). Therefore, G(ω) = G b (ω − ω0 ) + G ∗b (−ω − ω0 ),

(1.34)

P(ω) = 12 X (ω − ω0 ) + 12 X ∗ (−ω − ω0 ).

(1.35)

whereas

The filter output is (G P)(ω) = 12 [G b (ω − ω0 ) + G ∗b (−ω − ω0 )][X (ω − ω0 ) + X ∗ (−ω − ω0 )].

(1.36)

If G b (ω) and X (ω) are bandlimited with bandwidth < ω0 , then G b (ω − ω0 )X ∗ (−ω − ω0 ) ≡ 0 and G ∗b (−ω − ω0 )X (ω − ω0 ) ≡ 0. This means that (G P)(ω) = 12 [G b (ω − ω0 )X (ω − ω0 ) + G ∗b (−ω − ω0 )X ∗ (−ω − ω0 )] = 12 [(G b X )(ω − ω0 ) + (G b X )∗ (−ω − ω0 )],

(1.37)

and in the time domain, (g ∗ p)(t) = Re {(gb ∗ x)(t)ejω0 t }.

(1.38)

1.5 Complex signals for efficient FFT

17

1 − jw 0 t 2e

e jw0 t g(t)

LPF

∗ p(t)

Re

(g ∗ p)(t)

LPF

e− jw0 t Figure 1.7 Passband filtering at baseband.

Hence, passband filtering can be performed at baseband. In the implementation shown in Fig. 1.7, the passband signal p(t) is complex-demodulated and lowpass-filtered to produce the complex baseband signal x(t) (as discussed in Section 1.4.3), then filtered using the complex baseband impulse response gb (t), and finally modulated back to passband. This is what is done in most practical applications. This concludes our discussion of complex signals for general modulation of the amplitude and phase of a sinusoidal carrier. The essential point is that, once again, the representation of two real modulating signals as a complex signal leads to insights and economies of reasoning that would not otherwise emerge. Moreover, complex signal theory allows us to construct bandwidth-efficient versions of amplitude modulation, using the Hilbert transform and the complex analytic signal.

1.5

Complex signals for the efficient use of the FFT There are four Fourier transforms: the continuous-time Fourier transform, the continuous-time Fourier series, the discrete-time Fourier transform (DTFT), and the discrete-time Fourier series, usually called the discrete Fourier transform (DFT). In practice, the DFT is always computed using the fast Fourier transform (FFT). All of these transforms are applied to a complex signal and they return a complex transform, or spectrum. When they are applied to a real signal, then the returned complex spectrum has Hermitian symmetry in frequency, meaning the negative-frequency half of the spectrum has been (inefficiently) computed when it could have been determined by simply complex conjugating an efficiently computed positive-frequency half of the spectrum. One might be inclined to say that the analytic signal solves this problem by placing a real signal in the real part and the real Hilbert transform of this signal in the imaginary part to form a complex analytic signal, whose spectrum no longer has Hermitian symmetry (being zero for negative frequencies). However, again, this special non-Hermitian spectrum is known to be zero for negative frequencies and therefore the negativefrequency part of the spectrum has been computed inefficiently for its zero values. So, the only way to exploit the Fourier transform efficiently is to use it to simultaneously

18

The origins and uses of complex signals

Fourier-transform two real signals that have been composed into the real and imaginary parts of a complex signal, or to compose two subsampled versions of a length-2N real signal into the real and imaginary parts of a length-N complex signal. Let’s first review the N -point DFT and its important identities. Then we will illustrate its efficient use for transforming two real discrete-time signals of length N , and for transforming a single real signal of length 2N , using just one length-N DFT of a complex signal. This treatment is adapted from Mitra (2006).

1.5.1

Complex DFT N −1 We shall denote the DFT of the length-N sequence {x[n]}n=0 by the length-N sequence N −1 N −1 N −1 . The mth {X [m]}m=0 and establish the shorthand notation {x[n]}n=0 ←→ {X [m]}m=0 DFT coefficient X [m] is computed as

X [m] =

N −1

x[n]W N−mn ,

W N = ej2π/N ,

(1.39)

n=0

and these DFT coefficients are inverted for the original signal as x[n] =

N −1 1 X [m]W Nmn . N m=0

(1.40)

The complex number W N = ej2π/N is an N th root of unity. When raised to the powers n = 0, 1, . . ., N − 1, it visits all the N th roots of unity. N −1 N −1 ←→ {X ∗ [(N − m) N ]}m=0 , which An important symmetry of the DFT is {x ∗ [n]}n=0 reads “the DFT of a complex-conjugated sequence is the DFT of the original sequence, complex-conjugated and cyclically reversed in frequency.” The notation (N − m) N stands for “N − m modulo N ” so that X [(N ) N ] = X [0]. −1 Now consider a length-2N sequence {x[n]}2N n=0 and its length-2N DFT sequence 2N −1 N −1 {X [m]}m=0 . Call {e[n] = x[2n]}n=0 the even polyphase component of x, and {o[n] = N −1 N −1 the odd polyphase component of x. Their DFTs are {e[n]}n=0 ←→ x[2n + 1]}n=0 N −1 N −1 N −1 2N −1 {E[m]}m=0 and {o[n]}n=0 ←→ {O[m]}m=0 . The DFT of {x[n]}n=0 is, for m = 0, 1, . . ., 2N − 1, X [m] =

2N −1 n=0

−mn x[n]W2N =

N −1

e[n]W N−mn + W N−m

n=0

= E[(m) N ] +

N −1

o[n]W N−mn

n=0

W N−m O[(m) N ].

(1.41)

That is, the 2N -point DFT is computed from two N -point DFTs. In fact, this is the basis of the decimation-in-time FFT.

1.5.2

Twofer: two real DFTs from one complex DFT N −1 N −1 Begin with the two real length-N sequences {u[n]}n=0 and {v[n]}n=0 . From them N −1 form the complex signal {x[n] = u[n] + jv[n]}n=0 . DFT this complex sequence

19

1.6 The bivariate Gaussian distribution

↓2 u

N

2N

D

↓2

Ev

x j

FFT

N

E 2N

U

X

N

Od

O

2N

WN−m Figure 1.8 Using one length-N complex DFT to compute the DFT for a length-2N real signal. N −1 N −1 for {x[n]}n=0 ←→ {X [m]}m=0 . Now note that u[n] = 12 (x[n] + x ∗ [n]), and v[n] = ∗ [1/(2j)](x[n] − x [n]). So for m = 0, 1, . . ., N − 1,

1 (X [m] + X ∗ [(N − m) N ]), 2 1 V [m] = (X [m] − X ∗ [(N − m) N ]). 2j

U [m] =

(1.42) (1.43)

In this way, the N -point DFT is applied to a complex N -sequence x and efficiently returns an N -point DFT sequence X , from which the DFTs for u and v are extracted frequency-by-frequency.

1.5.3

Twofer: one real 2N -DFT from one complex N -DFT −1 Begin with the real length-2N sequence {u[n]}2N n=0 and subsample it on its even and odd N −1 and {o[n] = integers to form the real polyphase length-N sequences {e[n] = u[2n]}n=0 N −1 N −1 . u[2n + 1]}n=0 . From these form the complex N -sequence {x[n] = e[n] + jo[n]}n=0 N −1 N −1 N −1 DFT this for {x[n]}n=0 ←→ {X [m]}m=0 . Extract the DFTs for {e[n]}n=0 ←→ N −1 N −1 N −1 and {o[n]}n=0 ←→ {O[m]}m=0 , according to (1.42) and (1.43). Then for {E[m]}m=0 −1 2N −1 m = 0, 1, . . ., 2N − 1, construct the 2N -DFT {u[n]}2N n=0 ←→ {U [m]}m=0 according to (1.41). In this way, the DFT of a real 2N -sequence is efficiently computed with the DFT of one complex N -sequence, followed by simple frequency-by-frequency computations. A hardware diagram is given in Fig. 1.8. The two examples in this section show that the complex FFT can be made efficient for the Fourier analysis of real signals by constructing one complex signal from two real signals, providing one more justification for the claim that complex representations of two real signals bring efficiencies not otherwise achievable.

1.6

The bivariate Gaussian distribution and its complex representation How is a complex random variable x, a complex random vector x, a complex random signal x(t), or a complex vector-valued random signal x(t) statistically described using probability distributions and moments? This question will occupy many of the ensuing chapters in this book. However, as a preview of our methods, we will offer a sketchy

20

The origins and uses of complex signals

account of complex second-order moments and the Gaussian probability density function for the complex scalar x = u + jv. A more general account for vector-valued x will be given in Chapter 2.

1.6.1

Bivariate Gaussian distribution The real components u and v of the complex scalar random variable x = u + jv, which may be arranged in a vector z = [u, v]T , are said to be bivariate Gaussian distributed, with mean zero and covariance matrix Rzz , if their joint probability density function (pdf) is −1 u 1 1 R exp − puv (u, v) = u v zz 2 v 2π det1/2 Rzz =

1 exp − 12 quv (u, v) . 1/2 2π det Rzz

(1.44)

Here the quadratic form quv (u, v) and the covariance matrix Rzz of the composite vector z are defined as follows: u , (1.45) quv (u, v) = u v R−1 zz v √ √ R Ruu Rvv ρuv E(u 2 ) E(uv) uu T √ √ = . (1.46) Rzz = E(zz ) = E(vu) E(v 2 ) Ruu Rvv ρuv Rvv In the right-most parameterization of Rzz , the terms are Ruu = E(u 2 )

variance of the random variable u,

Rvv = E(v 2 ) √ Ruv = Ruu Rvv ρuv = E(uv)

variance of the random variable v,

ρuv = √

Ruv √

Ruu

Rvv

correlation of the random variables u, v, correlation coefficient of the random variables u, v.

As in (A1.38), the inverse of the covariance matrix R−1 zz may be factored as √ √ 2 1√ 0 1/[Ruu (1 − ρuv )] 0 1 −( Ruu / Rvv )ρuv −1 √ . Rzz = 0 1/Rvv 0 1 −( Ruu / Rvv )ρuv 1 (1.47) 2 Using (A1.3) we find det Rzz = Ruu (1 − ρuv )Rvv , and from here the bivariate pdf puv (u, v) may be written as 2 1 1 R uu puv (u, v) = exp − ρuv v u− √ 2 ))1/2 2 ) (2π Ruu (1 − ρuv 2Ruu (1 − ρuv Rvv 1 1 2 × (1.48) exp − v . (2π Rvv )1/2 2Rvv

1.6 The bivariate Gaussian distribution

21

√ √ The term ( √Ruu / Rvv )ρuv v is the conditional mean estimator of u from v and e = √ u − ( Ruu / Rvv )ρuv v is the error of this estimator. Thus the bivariate pdf p(u, v) 2 ), and factors into a zero-mean Gaussian pdf for the error e, with variance Ruu (1 − ρuv a zero-mean Gaussian pdf for v, with variance Rvv . The error e and v are independent. From u = r cos θ, v = r sin θ , du dv = r dr dθ , it is possible to change variables and obtain the pdf for the polar coordinates (r, θ ) pr θ (r, θ ) = r · puv (u, v)|u=r cos θ,v=r sin θ .

(1.49)

From here it is possible to integrate over r to obtain the marginal pdf for θ and over θ to obtain the marginal pdf for r . But this sequence of steps is so clumsy that it is hard to find formulas in the literature for these marginal pdfs. There is an alternative, which demonstrates again the power of complex representations.

1.6.2

Complex representation of the bivariate Gaussian distribution Let’s code the real random variables u and v as u 1 1 x 1 =2 . v − j j x∗

(1.50)

Then the quadratic form quv (u, v) in the definition of the bivariate Gaussian distribution (1.45) may be written as ∗ 1 1 1 x j −1 1 Rzz quv (u, v) = 4 x x − j j x∗ 1 −j ∗ −1 x = x (1.51) x Rx x ∗ , x where the covariance matrix R x x and its inverse R−1 x x are 1 1 1 j Rx x Rx x x ∗ , = Rzz Rx x = E ∗ x x = ∗ −j j 1 −j x Rx x Rx x 1 − Rx x j 1 1 Rx x −1 −1 1 1 Rx x = 4 Rzz = . ∗ 1 −j −j j Rx x Rx2x − | Rx x |2 − Rx x

(1.52)

(1.53)

The new terms in this representation of the quadratic form quv (u, v) bear comment. So let’s consider the elements of R x x . The variance term Rx x is Rx x = E|x|2 = E[(u + jv)(u − jv)] = Ruu + Rvv + j0.

(1.54)

This variance alone is an incomplete characterization for the bivariate pair (u, v), and it carries no information at all about ρuv , the correlation coefficient between the random variables u and v. But R x x contains another complex second-order moment √ Rx x = E x 2 = E[(u + jv)(u + jv)] = Ruu − Rvv + j2 Ruu Rvv ρuv , (1.55)

22

The origins and uses of complex signals

which we will call the complementary variance. The complementary variance is the correlation between x and its conjugate x ∗ . It is zero if and only if Ruu = Rvv and ρuv = 0. This is the so-called proper case. All others are improper. Now let’s introduce the complex correlation coefficient ρ between x and x ∗ as ρ=

Rx x . Rx x

(1.56)

Thus, we may write Rx x = Rx x ρ and Rx2x − | Rx x |2 = Rx2x (1 − |ρ|2 ). The complex corre∗ ∗ lation coefficient ρ = |ρ|ejψ satisfies |ρ| ≤ 1. If |ρ| = 1, then x = Rx x Rx−1 x x = ρx = jψ ∗ e x with probability 1. Equivalent conditions for |ρ| = 1 are Ruu = 0, or Rvv = 0, or ρuv = ±1. The first of these conditions makes the complex signal x purely imaginary and the second makes it real. The third condition means v = tan(ψ/2)u and x = [1 + j tan(ψ/2)]u. All these cases with |ρ| = 1 are called maximally improper because the support of the pdf for the complex random variable x degenerates into a line in the complex plane. There are three real parameters Ruu , Rvv , and ρuv required to determine the bivariate pdf for (u, v), and these may be obtained from the three real values Rx x , Re ρ, and Im ρ (or alternatively, Rx x , Re Rx x , and Im Rx x ) using the following inverse formulas of (1.54) and (1.55): Ruu = 12 Rx x (1 + Re ρ),

(1.57)

Rvv = 12 Rx x (1 − Re ρ),

(1.58)

ρuv =

Im ρ 1 − (Re ρ)2

.

(1.59)

Now, by replacing the quadratic form quv (u, v) in (1.44) with the expression (1.51), and noting that det R x x = 4 det Rzz , we may record the complex representation of the pdf for the bivariate Gaussian distribution or, equivalently, the pdf for complex x: ∗ −1 x 1 1 exp − 2 x px (x) puv (u, v) = x Rx x ∗ x π det1/2 R x x 2 ∗2 |x| − Re (ρx ) 1 exp − = . (1.60) 2 Rx x (1 − |ρ|2 ) π Rx x 1 − |ρ| This shows that the bivariate pdf for the real pair (u, v) can be written in terms of the complex variable x. Yet the formula (1.60) is not what most people expect to see when they talk about the pdf of a complex Gaussian random variable x. In fact, it is often implicitly assumed that x is proper, i.e., ρ = 0 and x is not correlated with x ∗ . Then the pdf takes on the simple and much better-known form 1 |x|2 . (1.61) exp − px (x) = π Rx x Rx x But it is clear from our development that (1.61) models only a special case of the bivariate pdf for (u, v) where Ruu = Rvv and ρuv = 0. In general, we need to incorporate both the variance Rx x and the complementary variance Rx x = Rx x ρ. That is, even in this very

1.7 Analysis of the polarization ellipse

23

simple bivariate case, we need to take into account the correlation between x and its complex conjugate x ∗ . In Chapter 2, this general line of argumentation is generalized to derive the complex representation of the multivariate pdf puv (u, v).

1.6.3

Polar coordinates and marginal pdfs What could be the virtue of the complex representation for the real bivariate pdf? One answer is this: with the change from Cartesian to polar coordinates, the bivariate pdf takes the simple form 2 r r [1 − |ρ|cos(2θ − ψ)] exp − , (1.62) prθ (r, θ ) = Rx x (1 − |ρ|2 ) π Rx x 1 − |ρ|2 where x and ρ have been given their polar representations x = r ejθ and ρ = |ρ|ejψ . It is now a simple matter to integrate this bivariate pdf to obtain the marginal pdfs for the radius r and the angle θ . The results, to be explored more fully in Chapter 2, are r 2 |ρ| 2r r2 pr (r ) = exp − I , r > 0, 0 Rx x (1 − |ρ|2 ) Rx x (1 − |ρ|2 ) Rx x 1 − |ρ|2 (1.63) 1 − |ρ|2 pθ (θ ) = , −π < θ ≤ π. (1.64) 2π[1 − |ρ|cos(2θ − ψ)] Here I0 is the modified Bessel function of the first kind of order 0, defined as 1 π z cos θ e dθ. I0 (z) = π 0

(1.65)

These results show that the parameters (Rx x , |ρ|, ψ) for complex x = u + jv, rather than the parameters (Ruu , Rvv , ρuv ) for real (u, v), are the most natural parameterization for the joint and marginal pdfs of the polar coordinates r and θ . These marginals are illustrated in Fig. 1.9 for ψ = π/2 and various values of |ρ|. In the proper case ρ = 0, we see that the marginal pdf for r is Rayleigh and the marginal pdf for θ is uniform. Because of the uniform phase distribution, a proper Gaussian random variable is also called circular. The larger |ρ| the more improper (or noncircular) x becomes, and the marginal distribution for θ develops two peaks at θ = ψ/2 = π/4 and θ = ψ/2 − π . At the same time, the maximum of the pdf for r is shifted to the left. However, the change in the marginal for r is not as dramatic as the change in the marginal for θ .

1.7

Second-order analysis of the polarization ellipse In the previous section we found that the second-order description of a complex Gaussian random variable needs to take into account complementary statistics. The same holds for the second-order description of complex random signals. As an obvious extension of our previous definition, we shall call a zero-mean complex signal x(t), with correlation function r x x (t, τ ) = E[x(t + τ )x ∗ (t)], proper if its complementary correlation function

The origins and uses of complex signals

1 |r| = 0.95

1 |r| = 0.95

r=0 0.8

0.6

0.6 pq(q)

0.8

r

p (r)

24

0.4

|r| = 0.8 0.4 |r| = 0.6

0.2 0 0

r=0

0.2

1

2

3

0

r

0 q/p

0.5

1

Figure 1.9 Marginal pdfs for magnitude r and angle θ in the general bivariate Gaussian

distribution for Rx x = 1, |ρ| = {0, 0.6, 0.8, 0.95}, and ψ = π/2.

r˜x x (t, τ ) = E[x(t + τ )x(t)] ≡ 0. The following example is a preview of results that will be explored in much more detail in Chapters 8 and 9. Example 1.5. Our discussion of the polarization ellipse in Section 1.3 has said nothing about the second-order statistical behavior of the complex coefficients C+ = A+ ejθ+ and C− = A− e−jθ− . We show here how the second-order moments of these coefficients determine the second-order behavior of the complex signal x(t) = C+ ejω0 t + C− e−jω0 t . We begin with the second-order Hermitian correlation function r x x (t, τ ) = E[x(t + τ )x ∗ (t)] = E[C+ ejω0 (t+τ ) + C− e−jω0 (t+τ ) ][C+∗ e−jω0 t + C−∗ ejω0 t ] = E[C+ C+∗ ]ejω0 τ + 2 Re {E[C+ C−∗ ]ej2ω0 t ejω0 τ } + E[C− C−∗ ]e−jω0 τ and the second-order complementary correlation function r˜x x (t, τ ) = E[x(t + τ )x(t)] = E[C+ ejω0 (t+τ ) + C− e−jω0 (t+τ ) ][C+ ejω0 t + C− e−jω0 t ] = E[C+ C+ ]ej2ω0 t ejω0 τ + 2E[C+ C− ]cos(ω0 τ ) + E[C− C− ]e−j2ω0 t e−jω0 τ . Several observations are in order: r the signal x(t) is real if and only if C ∗ = C − + r the signal is wide-sense stationary (WSS), that is r (t, τ ) = r (0, τ ) and r˜ (t, τ ) = xx xx xx r˜x x (0, τ ), if and only if E[C+ C−∗ ] = 0, E[C+ C+ ] = 0, and E[C− C− ] = 0

1.8 Mathematical framework

25

r the signal is proper, that is r˜ (t, τ ) = 0, if and only if E[C C ] = 0, E[C C ] = 0, xx + + + − and E[C− C− ] = 0 r the signal is proper and WSS if and only if E[C C ∗ ] = 0, E[C C ] = 0, E[C C ] = + − + + + − 0, and E[C− C− ] = 0 r the signal cannot be nonzero, real, and proper, because C ∗ = C makes E[C C ] = − + − + E[C+ C+∗ ] > 0 So a signal can be proper and WSS, proper and nonstationary, improper and WSS, or improper and nonstationary. If the signal is real then it is improper, but it can be WSS or nonstationary.

1.8

Mathematical framework This chapter has shown that there are conceptual differences between one complex signal and two real signals. In this section, we will explain why there are also mathematical differences. This will allow us to provide a first glimpse at the mathematical framework that underpins much of what is to come later in this book. Consider the simple C-linear relationship between two complex scalars x = u + jv and y = a + jb: y = kx. We may write this relationship in terms of real and imaginary parts as a Re k −Im k u = . b Im k Re k v

M

(1.66)

(1.67)

The 2 × 2 matrix M has a special structure and is determined by two real numbers Re k and Im k. On the other hand, a general linear transformation on IR2 is a M11 M12 u , (1.68) = M21 M22 v b

M where all four elements of M may be chosen freely. This IR2 -linear expression is Clinear, i.e., it can be expressed as y = kx, if and only if M11 = M22 and M21 = −M12 . In the more general case, where M11 = M22 and/or M21 = −M12 , the complex equivalent of (1.68) is the linear–conjugate-linear, or widely linear, transformation y = k1 x + k2 x ∗ .

(1.69)

Widely linear transformations depend linearly on x and its conjugate x ∗ . The two complex coefficients k1 and k2 have a one-to-one correspondence to the four real coefficients M11 , M12 , M21 , M22 , which we derive in Section 2.1. Traditionally, widely linear transformations have been employed only reluctantly. If the IR2 -linear transformation (1.68) does not satisfy M11 = M22 and M21 = −M11 ,

26

The origins and uses of complex signals

most people would prefer the IR2 -linear representation (1.68) over the C-widely linear representation (1.69). As a matter of fact, in classical complex analysis, functions that depend on x ∗ are not even considered differentiable even if they are differentiable when expressed as two-dimensional functions in terms of real and imaginary parts (see Appendix 2). In this book, we aim to make the point that a complex representation is not only feasible but also can be much more powerful and elegant, even if it leads to expressions involving complex conjugates. A key tool will be the augmented representation of widely linear transformations, where we express (1.69) as k1 k2 x y = , k2∗ k1∗ x ∗ y∗ y = K x.

(1.70)

This representation utilizes the augmented vectors y x and x = ∗ y= ∗ y x

(1.71)

and the augmented matrix K=

k1 k2∗

k2 . k1∗

(1.72)

Augmented vectors and matrices are underlined. The augmented representation obviously has some built-in redundancy but it will turn out to be very useful and convenient as we develop the first- and second-order theories of improper signals. The space of augmented complex vectors [x, y]T , where y = x ∗ , is denoted C 2∗ , and it is isomorphic to IR2 . However, C 2∗ is only an IR-linear (or C-widely linear), but not a C-linear, subspace of C 2 . It satisfies all properties of a linear subspace except that it is not closed under multiplication with a complex scalar α because αx ∗ = (αx)∗ . Similarly, augmented matrices form a matrix algebra that is closed under addition, multiplication, inversion, and multiplication with a real scalar but not under multiplication with a complex scalar. We have already had a first exposure to augmented vectors and matrices in our discussion of the complex Gaussian distribution in Section 1.6. In fact, we now recognize that the covariance matrix defined in (1.52) is Rx x Rx x H . (1.73) R x x = Ex x = ∗ Rx x Rx x This is the covariance matrix of the augmented vector x = [x, x ∗ ]T , and R x x itself is an augmented matrix because Rx x = Rx∗x . This is why we call R x x the augmented covariance matrix of x. The augmented covariance matrix R x x is more than simply a convenient way of keeping track of both the variance Rx x = E|x|2 and the complementary variance Rx x = E x 2 . By combining Rx x and Rx x into R x x , we gain access to the large number of results on 2 × 2 matrices. For instance, we know that any covariance matrix, including R x x , must be positive semidefinite and thus have nonnegative determinant, det R x x ≥ 0.

1.9 A brief survey of applications

27

This immediately leads to | Rx x |2 ≤ Rx2x , a simple upper bound on the magnitude of the complementary variance.

1.9

A brief survey of applications The following is a brief survey of a few applications of improper and noncircular complex random signals, without any attempt at a complete bibliography. Our aim here is to indicate the breadth of applications spanning areas as diverse as communications and oceanography. We apologize in advance to authors whose work has not been included. Currently, much of the research utilizing improper random signals concerns applications in communications. So what has sparked this recent interest in impropriety? There is an important result (cf. Results 2.15 and 2.16) stating that wide-sense stationary analytic signals, and also complex baseband representations of wide-sense stationary real bandpass signals, must be proper. On the other hand, nonstationary analytic signals and complex baseband representations of nonstationary real bandpass signals can be improper. In digital communications, thermal noise is assumed to be wide-sense stationary, but the transmitted signals are nonstationary (in fact, as we will discuss in Chapter 9, they are cyclostationary). This means that the analytic and complex baseband representations of thermal noise are always proper, whereas the analytic and complex baseband representations of the transmitted data signal are potentially improper. In optimum maximum-likelihood detection, only the noise is assigned statistical properties. The likelihood function, which is the probability density function of the received signal conditioned on the transmitted signal, is proper because the noise is proper. Therefore, in maximum-likelihood detection, it is irrelevant whether or not the transmitted signal is improper. The communications research until the 1990s focused on optimal detection strategies based on maximum likelihood and was thus prone to overlook the potential impropriety of the data signal. However, when more complicated scenarios are considered, such as multiuser or space–time communications, maximum likelihood is no longer a viable detection strategy because it is computationally too expensive. In these scenarios, detection is usually based on suboptimum algorithms that are less complex to implement. Many of these suboptimum detection algorithms do assign statistical properties to the signal. Hence, the potentially improper nature of signals must be taken into account when designing these detection algorithms. In mobile multiuser communications, this leads to a significantly improved tradeoff between spectral efficiency and power consumption. Important examples of digital modulation schemes that produce improper complex baseband signals are Binary Phase Shift Keying (BPSK), Pulse Amplitude Modulation (PAM), Gaussian Minimum Shift Keying (GMSK), Offset Quaternary Phase Shift Keying (OQPSK), and baseband (but not passband) Orthogonal Frequency Division Multiplexing (OFDM), which is commonly called Discrete Multitone (DMT). A small sample of papers addressing these issues is Yoon and Leib (1997), Gelli et al. (2000),

28

The origins and uses of complex signals

Lampe et al. (2002), Gerstacker et al. (2003), Nilsson et al. (2003), Napolitano and Tanda (2004), Witzke (2005), Buzzi et al. (2006), Jeon et al. (2006), Mirbagheri et al. (2006), Chevalier and Pipon (2006), Taubock (2007), and Cacciapuoti et al. (2007). Improper baseband communication signals can also arise due to imbalance between their in-phase and quadrature (I/Q) components. This can be caused by amplifier or receiver imperfections, or by communication channels that are not rotationally invariant. If the in-phase and quadrature components of a signal are subject to different gains, or if the phase-offset between them is not exactly 90◦ , even rotationally invariant modulation schemes such as Quaternary Phase Shift Keying (QPSK) become improper at the receiver. I/Q imbalance degrades the signal-to-noise ratio and thus bit error rate performance. Some papers proposing ways of compensating for I/Q imbalance in various types of communication systems include Anttila et al. (2008), Rykaczewski et al. (2008), and Zou et al. (2008). Morgan (2006) and Morgan and Madsen (2006) present techniques for wideband system identification when the system (e.g., a wideband wireless communication channel) is not rotationally invariant. Array processing is the generic term applied to processing the output of an array of sensors. An important example of array processing is beamforming, which allows directional signal transmission or reception, either for radio or for sound waves. Besides the perhaps obvious applications in radar, sonar, and wireless communications, array processing is also employed in fields such as seismology, radio astronomy, and biomedicine. Often the aim is to estimate the direction of arrival (DOA) of one or more signals of interest impinging on a sensor array. If the signals of interest or the interference are improper (as they would be if they originated, e.g., from a BPSK transmitter), this can be exploited to achieve higher DOA resolution. Some papers addressing the adaptation of array-processing algorithms to improper signals include Charge et al. (2001), McWhorter and Schreier (2003), Haardt and Roemer (2004), Delmas (2004), Chevalier and Blin (2007), and Römer and Haardt (2009). Another area where the theory of impropriety has led to important advances is machine learning. Much interest is centered around independent component analysis (ICA), which is a technique for separating a multivariate signal into additive components that are as independent as possible. A typical application of ICA is to functional magnetic resonance imaging (fMRI), which measures neural activity in the brain or spinal cord. The fMRI signal is naturally modeled as a complex signal (see Adali and Calhoun (2007) for an introduction to complex ICA applied to fMRI data). In the past, it has been assumed that the fMRI signal is proper, when in fact it isn’t. As we will discuss in Section 3.5, impropriety is even a desirable property because it enables the separation of signals that would otherwise not be separable. The impropriety of the fMRI signal is now recognized, and techniques for ICA of complex signals that exploit impropriety have been proposed by DeLathauwer and DeMoor (2002), Eriksson and Koivunen (2006), Adali et al. (2008), Novey and Adali (2008a, 2008b), Li and Adali (2008), and Ollila and Koivunen (2009), amongst others. Machine-learning techniques are often implemented using neural networks. Examples of neural-network implementations of signal processing algorithms utilizing complementary statistics are given by Goh and Mandic (2007a, 2007b). A comprehensive

1.9 A brief survey of applications

29

account of complex-valued nonlinear adaptive filters is provided in the research monograph by Mandic and Goh (2009). The theory of impropriety has also found recent applications in acoustics (e.g., Rivet et al. (2007)) and optics. In optics, the standard correlation function is called the phase-insensitive correlation, and the complementary correlation function is called the phase-sensitive correlation. In a recent paper, Shapiro and Erkmen (2007) state that “Optical coherence theory for the complex envelopes of passband fields has been concerned, almost exclusively, with correlations that are all phase insensitive, despite decades of theoretical and experimental work on the generation and applications of light with phase-sensitive correlations. This paper begins the process of remedying that deficiency . . . .” More details on the work with phase-sensitive light can be found in Erkmen and Shapiro (2006) and the Ph.D. dissertation by Erkmen (2008). Maybe our discussion of polarization analysis in Sections 8.4 and 9.4 can make a contribution to this topic. It is interesting to note that perhaps the first fields of research that recognized the importance of the complementary correlation and made consistent use of complex representations are oceanography and geophysics. The seminal paper by Mooers (1973), building upon prior work by Gonella (1972), presented techniques for the cross-spectrum analysis of bivariate time series by modeling them as complex-valued time series. Mooers realized that the information in the standard correlation function – which he called the inner-cross correlation – must be complemented by the complementary correlation function – which he called the outer-cross correlation – to fully model the secondorder behavior of bivariate time series. He also recognized that the complex-valued description yields the desirable property that coherences are invariant under coordinate rotation. Moreover, testing for coherence between a pair of complex-valued time series is significantly simplified compared with the real-valued description. Early examples using this work are the analysis of wind fields by Burt et al. (1974) and the interpretation of ocean-current spectra by Calman (1978). Mooers’ work is still frequently cited today, and it provides the basis for our discussion of polarization analysis in Section 8.4.

2

Introduction to complex random vectors and processes

This chapter lays the foundation for the remainder of the book by introducing key concepts and definitions for complex random vectors and processes. The structure of this chapter is as follows. In Section 2.1, we relate descriptions of complex random vectors to the corresponding descriptions in terms of their real and imaginary parts. We will see that operations that are linear when applied to real and imaginary parts generally become widely linear (i.e., linear–conjugate-linear) when applied to complex vectors. We introduce a matrix algebra that enables a convenient description of these widely linear transformations. Section 2.2 introduces a complete second-order statistical characterization of complex random vectors. The key finding is that the information in the standard, Hermitian, covariance matrix must be complemented by a second, complementary, covariance matrix. We establish the conditions that a pair of Hermitian and complementary covariance matrices must satisfy, and show what role the complementary covariance matrix plays in power and entropy. In Section 2.3, we explain that probability distributions and densities for complex random vectors must be interpreted as joint distributions and densities of their real and imaginary parts. We present two important distributions: the complex multivariate Gaussian distribution and its generalization, the complex multivariate elliptical distribution. These distributions depend both on the Hermitian covariance matrix and on the complementary covariance matrix, and their well-known versions are obtained for the zero complementary covariance matrix. In Section 2.4, we establish that the Hermitian sample covariance and complementary sample covariance matrices are maximum-likelihood estimators and sufficient statistics for the Hermitian covariance and complementary covariance matrices. The sample covariance matrix is complex Wishart distributed. In Section 2.5, we introduce characteristic and cumulant-generating functions, and use these to derive higher-order moments and cumulants of complex random vectors. We then discuss circular random vectors whose probability distributions are invariant under rotation. Circular random vectors may be regarded as an extension of proper random vectors, for which rotation invariance holds only for second-order moments. In Section 2.6, we extend some of these ideas and concepts to continuous-time complex random processes. However, we treat only second-order properties of wide-sense stationary processes and widely linear shift-invariant filtering of them, postponing more advanced topics, such as higher-order statistics and circularity, to Chapter 8.

2.1 Real and complex descriptions

2.1

31

Connection between real and complex descriptions Let be the sample space of a random experiment, and u: −→ IRn and v: −→ IRn be two real random vectors defined on . From u and v we construct three closely related vectors. The first is the real composite random vector z: −→ IR2n , obtained by stacking u on v: u z= . (2.1) v The second vector is the complex random vector x: −→ C n , obtained by composing u and v into its real and imaginary parts: x = u + jv.

(2.2)

The third vector is the complex augmented random vector x: −→ C 2n ∗ , obtained by stacking x on top of its complex conjugate x∗ : x (2.3) x= ∗ . x The space of complex augmented vectors, whose bottom n entries are the complex conjugates of the top n entries, is denoted by C 2n ∗ . Augmented vectors will always be underlined. The complex augmented vector x is related to the real composite vector z as x = Tn z

⇐⇒

z = 12 TH n x,

where the real-to-complex transformation I jI Tn = ∈ C 2n×2n I −jI

(2.4)

(2.5)

is unitary up to a factor of 2: H Tn TH n = Tn Tn = 2I.

(2.6)

The complex augmented random vector x: −→ C 2n ∗ is obviously an equivalent redun2n dant representation of z: −→ IR . But far from being a vice, this redundancy will be turned into an evident virtue as we develop the algebra of improper complex random vectors. Whenever the size of Tn is clear, we will drop the subscript n for economy.

2.1.1

Widely linear transformations If a real linear transformation M ∈ IR2m×2n is applied to the composite real vector z: −→ IR2n , it yields a real composite vector w: −→ IR2m , a M11 M12 u = Mz, (2.7) w= = M21 M22 v b

32

Complex random vectors and processes

where Mi j ∈ IRm×n . The augmented complex version of w is a y y = ∗ = Tm = 12 Tm MTH n (Tn z) = H x, b y

(2.8)

with y = a + jb. The matrix H ∈ C 2m×2n is called an augmented matrix because it satisfies a particular block pattern, where the southeast block is the conjugate of the northwest block, and the southwest block is the conjugate of the northeast block: H1 H2 H 1 H = 2 Tm MTn = , (2.9) H∗2 H∗1 H1 = 12 [M11 + M22 + j(M21 − M12 )], H2 = 12 [M11 − M22 + j(M21 + M12 )]. Hence, H is an augmented description of the widely linear or linear–conjugate-linear transformation 1 y = H1 x + H2 x∗ .

(2.10)

Obviously, the set of complex linear transformations, y = H1 x with H2 = 0, is a subset of the set of widely linear transformations. A complex linear transformation (sometimes called strictly linear for emphasis) has the equivalent real representation a M11 M12 u = . (2.11) −M12 M11 v b Even though the representation (2.9) contains some redundancy in that the northern blocks determine the southern blocks, it will prove to be very powerful in due course. For instance, it enables easy concatenation of widely linear transformations. Let’s recap. Linear transformations on IR2n are linear on C n only if they have the particular structure (2.11). Otherwise, the equivalent operation on C n is widely linear. Representing IR-linear operations as C-widely linear operations often provides more insight. However, from a hardware implementation point of view, IR-linear transformations are usually preferable over C-widely linear transformations because the former require fewer real operations (additions and multiplications) than the latter. We will let W m×n denote the set of 2m × 2n augmented matrices that satisfy the pattern (2.9). Elements of W m×n are always underlined. For m = n, the set W n×n is a real matrix algebra that is closed under addition, multiplication, inversion, and multiplication by a real, but not complex, scalar. Example 2.1. Let us show that W n×n is closed under inversion. Using the matrix-inversion lemma (A1.42) in Appendix 1, the inverse of the block matrix H1 H2 H= H∗2 H∗1

33

2.1 Real and complex descriptions

can be calculated as ∗ −1 (H1 − H2 H−∗ −1 1 H2 ) H = ∗ ∗ −1 −1 ∗ −1 −(H1 − H2 H1 H2 ) H2 H1 N1 N2 = = N, N∗2 N∗1

−∗ ∗ −1 −(H1 − H2 H−∗ 1 H2 ) H2 H1 ∗ ∗ −1 −1 (H1 − H2 H1 H2 )

which has the block structure (2.9). When working with the augmented matrix algebra W we often require that all factors in matrix factorizations represent widely linear transformations. If that is the case, we need to ensure that all factors satisfy the block pattern (2.9). If a factor H ∈ W, it does not represent a widely linear transformation, since applying H to an augmented vector x would yield a vector whose last n entries are not the conjugate of the first n entries. Example 2.2. Consider the Cholesky factorization of a positive definite matrix H ∈ W n×n into a lower-triangular and upper-triangular factor. If we require that the Cholesky factors be widely linear transformations, then H = XXH with X lower triangular and XH upper triangular will not work since generally X ∈ W n×n . Instead, we determine the Cholesky factorization of the equivalent real matrix 1 H T HT 2

= LLH ,

(2.12)

and transform L into the augmented complex notation as N = 12 TLTH . Then H = N NH

(2.13)

is the augmented complex representation of the Cholesky factorization of H with N ∈ W n×n . Note that (2.13) simply reexpresses (2.12) in the augmented algebra W n×n but N itself is not generally lower triangular, and neither are its blocks N1 and N2 . An exception is the block-diagonal case: if H is block-diagonal, N is block-diagonal and the diagonal block N1 (and N∗1 ) is lower triangular.

2.1.2

Inner products and quadratic forms Consider the two 2n-dimensional real composite vectors w = [aT , bT ]T and z = [uT , vT ]T , the corresponding n-dimensional complex vectors y = a + jb and x = u + jv, and their complex augmented descriptions y = Tw and x = Tz. We may now relate the n inner products defined on IR2n , C 2n ∗ , and C as wT z = 12 yH x = Re yH x . (2.14) Thus, the usual inner product wT z defined on IR2n equals (up to a factor of 1/2) the inner product yH x defined on C 2n ∗ , and also the real part of the usual inner product yH x defined on C n . In this book, we will compute inner products on C 2n ∗ as

34

Complex random vectors and processes

well as inner products on C n . These inner products are discussed in more detail in Section 5.1. Another common real-valued expression is the quadratic form zT Mz, which may be written as a (real-valued) widely quadratic form in x: zT Mz = 12 (zT TH ) 12 TMTH (Tz) = 12 xH H x. (2.15) The augmented matrix H and the real matrix M are connected as before in (2.9). Thus, we obtain zT Mz = xH H1 x + Re (xH H2 x∗ ).

(2.16)

Widely quadratic forms are discussed in more detail in the context of widely quadratic estimation in Section 5.7.

2.2

Second-order statistical properties In order to characterize the second-order statistical properties of x = u + jv, we consider the composite real random vector z. Its mean vector is Eu ␮u (2.17) = ␮z = Ez = ␮v Ev and its covariance matrix is

Ruu Rzz = E(z − ␮z )(z − ␮z ) = T Ruv T

Ruv Rvv

(2.18)

with Ruu = E(u − ␮u )(u − ␮u )T , Ruv = E(u − ␮u )(v − ␮v )T , and Rvv = E(v − ␮v )(v − ␮v )T . The augmented mean vector of x is ␮u + j␮v ␮x = (2.19) ␮x = Ex = T␮z = ␮∗x ␮u − j␮v and the augmented covariance matrix of x is Rx x

R = E(x − ␮x )(x − ␮x ) = TRzz T = ∗x x Rx x H

H

xx R = RHx x . R∗x x

(2.20)

The augmented covariance matrix Rx x is a member of the matrix algebra W n×n . Its northwest block is the usual (Hermitian) covariance matrix T Rx x = E(x − ␮x )(x − ␮x )H = Ruu + Rvv + j(Ruv − Ruv ) = RHx x

(2.21)

and its northeast block is the complementary covariance matrix T x x = E(x − ␮x )(x − ␮x )T = Ruu − Rvv + j(Ruv Tx x , R + Ruv ) = R

(2.22)

which uses a regular transpose rather than a Hermitian (conjugate) transpose. Other x x include pseudo-covariance matrix, conjugate covariance matrix, and names for R x x are required for a relation matrix. 2 It is important to note that both Rx x and R

2.2 Second-order statistical properties

35

complete second-order characterization of x. There is, however, an important special case in which the complementary covariance vanishes. x x = 0, x is called Definition 2.1. If the complementary covariance matrix vanishes, R proper, otherwise x is called improper. The conditions for propriety on the covariance and cross-covariance of real and imaginary parts u and v are Ruu = Rvv , Ruv =

T −Ruv .

(2.23) (2.24)

The second condition, (2.24), requires Ruv to have zero diagonal elements, but its off-diagonal elements may be nonzero. When x = u + jv is scalar, then Ruv = 0 is xx = 0 necessary for propriety. If x is proper, its complementary covariance matrix R and its Hermitian covariance matrix is T Rx x = 2Ruu − 2jRuv = 2Rvv + 2jRuv ,

(2.25)

so its augmented covariance matrix R x x is block-diagonal. If complex x is proper and scalar, then Rx x = 2Ruu = 2Rvv . It is easy to see that propriety is preserved by strictly linear transformations, which are represented by block-diagonal augmented matrices.

2.2.1

Extending definitions from the real to the complex domain A general question that has divided researchers is how to extend definitions from the real to the complex case. As an example, consider the definition of uncorrelatedness. Two real random vectors z and w are called uncorrelated if their cross-covariance matrix is zero: Rzw = E(z − ␮z )(w − ␮w )T = 0.

(2.26)

There are now two philosophies for a corresponding definition for two complex vectors x and y. One could argue for the classical definition that calls x and y uncorrelated if Rx y = E(x − ␮x )(y − ␮ y )H = 0.

(2.27)

This only considers the usual cross-covariance matrix but not the complementary crosscovariance matrix. On the other hand, if we consider x and y to be equivalent complex descriptions of real z = [uT , vT ]T and w = [aT , bT ]T as x = u + jv ⇔ x = Tz,

(2.28)

y = a + jb ⇔ y = Tw,

(2.29)

then the condition equivalent to (2.26) is that the augmented cross-covariance matrix be zero: R x y = E(x − ␮x )(y − ␮ y )H = 0.

(2.30)

36

Complex random vectors and processes

x y = 0. If x and y have zero mean, an equivalent Thus, (2.30) requires Rx y = 0 but also R statement in the Hilbert space of second-order random variables is that the first definition (2.27) requires x ⊥ y, whereas the second definition (2.30) requires x ⊥ y and x ⊥ y∗ . Which of these two definitions is more compelling? The first school of thought, which ignores complementary covariances in definitions, treats real and complex descriptions differently. This leads to unusual and counterintuitive results such as the following. r Two uncorrelated Gaussian random vectors need not be independent (because their complementary covariance matrix need not be diagonal). r A wide-sense stationary analytic signal may describe a nonstationary real signal (because the complementary covariance function of the analytic signal need not be shift-invariant). We would like to avoid these displeasing results, and therefore always adhere to the following general principle in this book: definitions and conditions derived for the real and complex domains must be equivalent. This means that complementary covariances must be considered if they are nonzero.

2.2.2

Characterization of augmented covariance matrices A matrix R x x is the augmented covariance matrix of a complex random vector x if and only if (1) it satisfies the block pattern (2.9), i.e., R x x ∈ W n×n , and (2) it is Hermitian and positive semidefinite. Condition (1) needs to be enforced when factoring an augmented covariance matrix into factors that represent widely linear transformations. Then all factors must be members of W n×n . A particularly important example is the eigenvalue decomposition of R x x , which will be presented in Chapter 3 (Result 3.1). Condition (2) leads to characterizations of the individual blocks of R x x , i.e., the x x .3 covariance matrix Rx x and the complementary covariance matrix R Result 2.1. If Rx x is nonsingular, the following three conditions are necessary and x x to be covariance and complementary covariance matrices of sufficient for Rx x and R a complex random vector x. 1. The covariance matrix Rx x is Hermitian and positive semidefinite. Tx x . xx = R 2. The complementary covariance matrix is symmetric, R ∗ x x R−∗ 3. The Schur complement of the augmented covariance matrix, Rx x − R x x Rx x , is positive semidefinite. If Rx x is singular, then condition 3 must be replaced with 3a. The generalized Schur complement of the augmented covariance matrix, Rx x − ∗x x , where (·)† denotes the pseudo-inverse, is positive semidefinite. x x (R∗x x )† R R ∗x x . 3b. The null space of Rx x is contained in the null space of R

2.2 Second-order statistical properties

37

This result says that (1) any given complex random vector x has covariance and x x that satisfy conditions 1–3, and (2), complementary covariance matrices Rx x and R x x that satisfies conditions 1–3, there exists a complex given a pair of matrices Rx x and R random vector x with covariance and complementary covariance matrices Rx x and x x . We will revisit the problem of characterizing augmented covariance matrices in R Section 3.2.3, where we will develop an alternative point of view.

2.2.3

Power and entropy The average power of complex x is defined as n 1 E |xi |2 . n i=1

(2.31)

1 1 1 tr Rx x = tr R x x = tr Rzz . n 2n n

(2.32)

Px = It can be calculated as Px =

Hence, power is invariant under widely unitary transformation U, U UH = UH U = I, and x = U1 x + U2 x∗ has the same power as x. The entropy of a complex random vector x is defined to be the entropy of the composite vector of real and imaginary parts [uT , vT ]T = z. If u and v are jointly Gaussian distributed, their differential entropy is (2.33) H (z) = 12 log (2π e)2n det Rzz . Since det T = (−2j)n and det R x x = det Rzz |det T|2 = 22n det Rzz ,

(2.34)

we obtain the following result. Result 2.2. The differential entropy of a complex Gaussian random vector x with augmented covariance matrix R x x is (2.35) H (x) = 12 log (π e)2n det R x x . The Fischer determinant inequality xx Rx x R Rx x ≤ det det ∗ 0 Rx x R∗x x

0 R∗x x

(2.36)

establishes the following classical result. Result 2.3. If x is Gaussian with given covariance matrix Rx x , its differential entropy is maximized if x is proper. The differential entropy of a proper complex Gaussian x is H (x) = log[(π e)n det Rx x ].

(2.37)

This formula for H (x) is owed to det R x x = det2 Rx x for block-diagonal R x x . Like power, entropy is invariant under widely unitary transformation.

38

Complex random vectors and processes

2.3

Probability distributions and densities Rather than defining a complex random variable from first principles (where we would start with a probability measure on a sample space), we simply define a complex random variable x: −→ C n as x = u + jv, where u: −→ IRn and v: −→ IRn are a pair of real random variables. This pair (u, v) has the joint probability distribution P(u0 , v0 ) = Prob(u ≤ u0 , v ≤ v0 )

(2.38)

and joint probability density function (pdf) p(u, v) =

∂ ∂ P(u, v). ∂u ∂v

(2.39)

We will allow the use of Dirac delta functions in the pdf. When we write P(x) or p(x), we shall define this to mean P(x) = P(u + jv) P(u, v),

(2.40)

p(x) = p(u + jv) p(u, v).

(2.41)

Thus, the probability distribution of a complex random vector is interpreted as the 2ndimensional joint distribution of its real and imaginary parts. The probability of x taking a value in the region A = {u1 < u ≤ u2 ; v1 < v ≤ v2 } is thus v2 u2 Prob(x ∈ A) = p(x)du dv. (2.42) v1

u1

For a function g: D → C whose domain D includes the range of x, the expectation operator is defined accordingly as n

E{g(x)} = E{Re[g(x)]} + jE{Im[g(x)]} g(u + jv) p(u + jv)du dv. =

(2.43)

IR2n

In many cases, expressing P(u, v) or p(u, v) in terms of x requires the use of the complex conjugate x∗ . This has prompted many researchers to write P(x, x∗ ) and p(x, x∗ ), which raises the question of whether these are now the joint distribution and density for x and x∗ . This question is actually ill-posed since distributions and densities of complex random vectors are always interpreted in terms of (2.40) and (2.41) – whether we write this as p(x) or p(x, x∗ ) makes no difference. Nevertheless, the notation p(x, x∗ ) does seem to carry potential for confusion since x perfectly determines x∗ , and vice versa. It is not possible to assign densities to x and x∗ independently. The advantage of expressing a pdf in terms of complex x lies not in the fact that Prob(x ∈ A) becomes easier to evaluate – that is obviously not the case. However, direct calculations of Prob(x ∈ A) via (2.42) are rare. In most practical cases, e.g., maximumlikelihood or minimum mean-squared error estimation, we can work directly with p(x) since it contains all relevant information, conveniently parameterized in terms of the statistical properties of complex x.

39

2.3 Probability distributions and densities

We will now take a look at two important complex distributions: the multivariate Gaussian distribution and its generalization, the multivariate elliptical distribution. We will be particularly interested in expressing these pdfs in terms of covariance and complementary covariance matrices.

2.3.1

Complex Gaussian distribution In order to derive the general complex multivariate Gaussian pdf (proper or improper), we begin with the Gaussian pdf of the composite vector of real and imaginary parts [uT , vT ]T = z: −→ IR2n : p(z) =

(2π )2n/2

1 exp − 12 (z − ␮z )T R−1 zz (z − ␮z ) . 1/2 det Rzz

(2.44)

Using H −1 R−1 zz = T R x x T,

det R x x = 22n det Rzz

(2.45) (2.46)

in (2.44), we obtain p(z) =

πn

1 exp − 12 (z − ␮z )T TH R−1 x x T(z − ␮z ) . 1/2 det R x x

(2.47)

With x = Tz, we are now in a position to state the following. Result 2.4. The general pdf of a complex Gaussian random vector x: −→ C n is 1 p(x) = n exp − 12 (x − ␮x )H R−1 (2.48) x x (x − ␮x ) . 1/2 π det R x x This pdf algebraically depends on x, i.e., x and x∗ , but is interpreted as the joint pdf of u and v. It may be used for proper or improper x. In the past, the term “complex Gaussian distribution” often implicitly assumed propriety. Therefore, some researchers call an improper complex Gaussian random vector “generalized complex Gaussian.” 4 The x x = 0 and R x x is block-diagonal, simplification that occurs in the proper case, where R is obvious and leads to the following classical result. Result 2.5. The pdf of a complex proper Gaussian random vector x: −→ C n is p(x) =

πn

1 exp −(x − ␮x )H R−1 x x (x − ␮x ) . det Rx x

(2.49)

Let’s go back to the general Gaussian pdf in Result 2.4. As in (A1.38) of Appendix 1, R−1 x x may be factored as −1 xx 0 I −W Rx x R I 0 P−1 −1 , (2.50) Rx x = ∗ = 0 R−∗ 0 I −WH I Rx x R∗x x xx ∗ ∗ x x R−∗ where P = Rx x − R x x Rx x is the Schur complement of Rx x within R x x . Furthermore, −∗ x x Rx x produces the linear minimum mean-squared error (LMMSE) estimate of W=R

40

Complex random vectors and processes

x from x∗ as xˆ = W(x − ␮x )∗ + ␮x ,

(2.51)

and tr P = Ex − xˆ 2 is the corresponding LMMSE. From (A1.3) we find det R x x = det R∗x x det P = det Rx x det P, and, using (2.50), we may then factor the improper pdf p(x) as p(x) =

1 1 1 H −1 × exp − (x − ␮ ) R (x − ␮ ) x x x x 2 1/2 πn det Rx x 1 exp − 12 (x − xˆ )H P−1 (x − xˆ ) . × 1/2 det P

(2.52)

This expresses the improper Gaussian pdf p(x) in terms of two factors: the first factor involves only x, its mean ␮x , and its covariance matrix Rx x ; and the second factor involves only the prediction error x − xˆ and its covariance matrix P. These two factors are “almost” proper Gaussian pdfs, albeit with incorrect normalization constants and a factor of 1/2 in the quadratic form. In Section 1.6.1, we found that the real bivariate Gaussian pdf p(u, v) in (1.48) does indeed factor into a Gaussian pdf for the prediction error u − uˆ and a Gaussian pdf for v. Importantly, in the real case, the error u − uˆ and v are independent. The difference in the complex case is that, although x − xˆ and x∗ are uncorrelated, they cannot be independent because x∗ perfectly determines x (through complex conjugation). If x is proper, then xˆ = ␮x , so that W = 0 and P = Rx x , and the two factors in (2.52) are identical. This makes the factor of 1/2 in the quadratic form disappear. By employing the Woodbury identity (cf. (A1.43) in Appendix 1) T −∗ ∗ P−1 = R−1 xx + W P W

(2.53)

det P = det Rx x det(I − WW∗ ),

(2.54)

and

we may find the following alternative expressions for p(x): p(x) =

p(x) =

det1/2 (I − WW∗ ) π n det P × exp −(x − ␮x )H P−1 (x − ␮x ) + Re (x − ␮x )T P−∗ W∗ (x − ␮x ) , (2.55) 1

exp −(x − ␮x )H R−1 x x (x − ␮x )

π n (det Rx x det P)1/2 × exp −(x − ␮x )H WT P−∗ W∗ (x − ␮x ) + Re (x − ␮x )T P−∗ W∗ (x − ␮x ) . (2.56)

Since the complex Gaussian pdf is simply a convenient way of expressing the joint pdf of real and imaginary parts, many results valid for the real case translate straightforwardly to the complex case. In particular, a linear or widely linear transformation of a Gaussian random vector (proper or improper) is again Gaussian (proper or improper). We note,

2.3 Probability distributions and densities

41

however, that a widely linear transformation of a proper Gaussian will generally produce an improper Gaussian, and a widely linear transformation of an improper Gaussian may produce a proper Gaussian.

2.3.2

Conditional complex Gaussian distribution If two real random vectors z = [uT , vT ]T : −→ IR2n and w = [aT , bT ]T : −→ IR2m are jointly Gaussian, then the conditional density for z given w is Gaussian, p(z|w) =

(2π)2n/2

1 exp − 12 (z − ␮z|w )T R−1 zz|w (z − ␮z|w ) 1/2 det Rzz|w

(2.57)

with conditional mean vector −1 (w − ␮w ) ␮z|w = ␮z + Rzw Rww

(2.58)

and conditional covariance matrix −1 T Rzw . Rzz|w = Rzz − Rzw Rww

(2.59)

This result easily generalizes to the complex case. Let x = u + jv: −→ C n and y = a + jb: −→ C m , and y = Tw. Then the augmented conditional mean vector is ␮x|y −1 −1 = T␮z|w = T␮z + (TRzw TH )(T−H Rww T )T(w − ␮w ) ␮x|y = ␮∗x|y = ␮x + R x y R−1 yy (y − ␮ y ).

(2.60)

The augmented conditional covariance matrix is −1 −1 T )(TRTzw TH ) R x x|y = TRzz|w TH = TRzz TH − (TRzw TH )(T−H Rww H = R x x − R x y R−1 yy R x y .

Therefore, the conditional pdf takes the general form 1 exp − 12 (x − ␮x|y )H R−1 p(x|y) = n 1/2 x x|y (x − ␮x|y ) . π det R x x|y

(2.61)

(2.62)

Using the matrix inversion lemma for R−1 x x|y , it is possible to derive an expression that explicitly shows the dependence of p(x|y) on y and y∗ . However, we shall postpone this until our discussion of widely linear estimation in Section 5.4. Definition 2.2. Two complex random vectors x and y are called jointly proper if the composite vector [xT , yT ]T is proper. This means they must be individually proper, yy = 0, and also cross-proper, R x y = 0. x x = 0 and R R If x and y are jointly proper, the conditional Gaussian density for x given y is p(x|y) =

1 exp −(x − ␮x|y )H R−1 x x|y (x − ␮x|y ) π n det Rx x|y

(2.63)

42

Complex random vectors and processes

with mean ␮x|y = ␮x + Rx y R−1 yy (y − ␮ y )

(2.64)

H Rx x|y = Rx x − Rx y R−1 yy Rx y .

(2.65)

and covariance matrix

2.3.3

Scalar complex Gaussian distribution The scalar complex Gaussian distribution is important enough to revisit in detail. Consider a zero-mean scalar Gaussian random variable x = u + jv with variance Rx x = E|x|2 and complementary variance Rx x = E x 2 = ρ Rx x with |ρ| < 1. The complex correlation coefficient ρ between x and x ∗ is a measure for the degree of impropriety of x. From Result 2.4, the pdf of x is |x|2 − Re(ρx ∗ 2 ) 1 exp − . (2.66) p(x) = Rx x (1 − |ρ|2 ) π Rx x 1 − |ρ|2 Let Ruu and Rvv be the variances of the real part u and imaginary part v, and Ruv their cross-covariance. The correlation coefficient between u and v is ρuv = √

Ruv √ . Ruu Rvv

(2.67)

From (2.21) and (2.22) we know that

Ruu − Rvv + 2j

Ruu + Rvv = Rx x , Ruu Rvv ρuv = ρ Rx x .

(2.68) (2.69)

So the complementary variance ρ Rx x carries information about the variance mismatch Ruu − Rvv in its real part and about the correlation between u and v in its imaginary part. There are now four different cases. 1. If u and v have identical variances, Ruu = Rvv = Rx x /2, and are independent, ρuv = 0, then x is proper, i.e., ρ = 0. Its pdf is p(x) =

1 −|x|2 . e π

(2.70)

2. If u and v have different variances, Ruu = Rvv , but u and v are still independent, ρuv = 0, then ρ is real, ρ = (Ruu − Rvv )/Rx x , and x is improper. 3. If u and v have identical variances, Ruu = Rvv = Rx x /2, but u and v are correlated, ρuv = 0, then ρ is purely imaginary, ρ = jρuv , and x is improper. 4. We can combine these two possible sources of impropriety so that u and v have different variances, Ruu = Rvv , and are correlated, ρuv = 0. Then ρ is generally complex. With x = r ejθ and ρ = |ρ|ejψ , we see that the pdf p(x) is constant on the contour (or level curve) r 2 [1 − |ρ|cos(2θ − ψ)] = K 2 . This contour is an ellipse, and r is

2.3 Probability distributions and densities

2

2

2

1

1

1

0

0

0

1

1

0 1 (a) r = 0

2

2 2

1 0 1 2 (b) r = 0.5 exp( j0)

2 2

2

2

2

1

1

1

0

0

0

1

1

1

2 2

1 0 1 2 (d) r = 0.5 exp(jπ/2)

2 2

1 0 1 2 (e) r = 0.8 exp( j3π/2)

2 2

43

1 0 1 2 (c) r = 0.8 exp( jπ)

1 0 1 2 (f ) r = 0.95 exp( jπ/4)

Figure 2.1 Probability-density contours of complex Gaussian random variables with different ρ.

maximum when cos(2θ − ψ) is minimum. This establishes that the ellipse orientation (the angle between the u-axis and the major ellipse axis) is θ = ψ/2, which is half the angle of the complex correlation coefficient ρ = |ρ|ejψ . It is also not difficult to show (see Ollila (2008)) that |ρ| is the square of the ellipse eccentricity. This is compelling evidence for the usefulness of the complex description. The real description – in terms of Ruu , Rvv , and the correlation coefficient ρuv between u and v – is not nearly as insightful.

Example 2.3. Figure 2.1 shows contours of constant probability density for cases 1–4 listed above. In plot (a), we see the proper case with ρ = 0, which exhibits circular contour lines. All remaining plots are improper, with elliptical contour lines. We can make two observations. First, increasing the degree of impropriety of the signal by increasing |ρ| leads to ellipses with greater eccentricity. Secondly, the angle of the ellipse orientation is half the angle of ρ, as proved above. In plots (b) and (c), we have case 2: u and v have different variances but are still independent. In this situation, the ellipse orientation is either 0◦ or 90◦ , depending on whether u or v has greater variance. Plots (d) and (e) show case 3: u and v have the same variance but are now correlated. In this situation, the ellipse orientation is either 45◦ or 135◦ . The general case, case 4, is depicted in plot (f). Now the ellipse can have an arbitrary orientation ψ/2, which is controlled by the angle of ρ = |ρ|ejψ .

44

Complex random vectors and processes

With u = r cos θ , v = r sin θ , du dv = r dr dθ , it is possible to change variables and obtain the pdf for the polar coordinates (r, θ ) 2 r r [1 − |ρ| cos(2θ − ψ)] prθ (r, θ ) = exp − , (2.71) Rx x (1 − |ρ|2 ) π Rx x 1 − |ρ|2 where x = r ejθ and ρ = |ρ|ejψ . The marginal pdf for r is obtained by integrating over θ, r 2 |ρ| 2r r2 exp − I0 , r > 0, pr (r ) = Rx x (1 − |ρ|2 ) Rx x (1 − |ρ|2 ) Rx x 1 − |ρ|2 (2.72) where I0 is the modified Bessel function of the first kind of order 0: 1 π z cos θ e dθ. (2.73) I0 (z) = π 0 This pdf is invariant with respect to ψ. It is plotted in Fig. 1.9 in Section 1.6 for several values of |ρ|. For ρ = 0, it is the Rayleigh pdf r r2 pr (r ) = . (2.74) exp − Rx x /2 Rx x This suggests that we call pr (r ) in (2.72) the improper Rayleigh pdf. 5 Integrating prθ (r, θ ) over r yields the marginal pdf for θ: 1 − |ρ|2 pθ (θ ) = , −π < θ ≤ π. (2.75) 2π [1 − |ρ|cos(2θ − ψ)] This pdf is shown in Fig. 1.9 for several values of |ρ|. For ρ = 0, the pdf is uniform. For larger |ρ|, the pdf develops two peaks at θ = ψ/2 and θ = ψ/2 − π . If |ρ| = 1, x is a singular random variable because the support of the pdf p(x) collapses to a line in the complex plane and the pdf (2.66) must be expressed using a Dirac δfunction. This case is called maximally improper (terms used by other researchers are rectilinear and strict-sense noncircular). If x is maximally improper, we can express it as x = aejψ/2 = a cos(ψ/2) + ja sin(ψ/2), where a is a real Gaussian random variable with √ zero mean and variance Rx x . Hence, the radius-squared r 2 of x/ Rx x is χ 2 -distributed with one degree of freedom, and the angle θ takes on values ψ/2 and ψ/2 − π , each with probability equal to 1/2.

2.3.4

Complex elliptical distribution A generalization of the Gaussian distribution, which has found some interesting applications in communications, is the family of elliptical distributions. We could proceed as in the Gaussian case by starting with the pdf of an elliptical distribution for a composite real random vector z and then deriving an expression in terms of complex x. Instead, we will directly modify the improper Gaussian pdf (2.48) by replacing the exponential function with! a nonnegative function g: [0, ∞) −→ [0, ∞), called the pdf generator, ∞ that satisfies 0 t n−1 g(t)dt < ∞. This necessitates two changes: First, since we do not

2.3 Probability distributions and densities

45

yet know the second-order moments of x, the matrix used in the expression for the pdf may no longer be the augmented covariance matrix of x. Hence, instead of Rx x , we use the augmented generating matrix xx Hx x H Hx x = ∗ Hx x H∗x x to denote an arbitrary augmented positive definite matrix of size 2n × 2n. Secondly, we need to introduce a normalizing constant cn to ensure that p(x) is a valid pdf that integrates to 1. We now state the general form of the complex elliptical pdf, which is a straightforward generalization of the real elliptical pdf, due to Ollila and Koivunen (2004). Definition 2.3. The pdf of a complex elliptical random vector x: −→ C n is cn g (x − ␮x )H H−1 p(x) = x x (x − ␮x ) . det1/2 Hx x

(2.76)

The normalizing constant cn is given by cn =

πn

(n − 1)! !∞ . 2n−1 g(t 2 )dt 0 t

(2.77)

If the mean exists, the parameter ␮x in the pdf (2.76) is the augmented mean of x. This indicates that the mean is independent of the choice of pdf generator. However, there are distributions for which some or all moments are undefined. For instance, none of the moments exist for the Cauchy distribution, which belongs to the family of elliptical distributions. In this case, ␮x should be treated simply as a parameter of the pdf but not its augmented mean. Since the complex elliptical pdf (2.76) contains the same quadratic form as the complex Gaussian pdf (2.48), we can obtain straightforward analogs of (2.55) and (2.56), which are expressions in terms of x and x∗ . We can also write down the expression for the pdf of an elliptical random vector with zero complementary generating matrix. xx = 0 Result 2.6. The pdf of a complex elliptical random vector x: −→ C n with H is cn p(x) = g 2(x − ␮x )H H−1 (2.78) x x (x − ␮x ) , det Hx x with normalizing constant cn given by (2.77). A good overview of real and complex elliptical distributions is given by Fang et al. (1990). However, Fang et al. consider only complex elliptical distributions with zero complementary generating matrix. Thus, Ollila and Koivunen (2004) refer to the general complex elliptical pdf in Definition 2.3 as a “generalized complex elliptical” pdf. The family of complex elliptical distributions contains some important subclasses of distributions.

46

Complex random vectors and processes

2

2

2

1

1

1

0

0

0

1

1

0 1 (a) r = 0

2 2

2

1 0 1 2 (b) r = 0.5 exp(jπ/2)

2 2

1 0 1 2 (c) r = 0.95 exp( jπ/4)

Figure 2.2 Probability-density contours of complex Cauchy random variables with different ρ.

r The complex multivariate Gaussian distribution, for pdf generator g(t) = exp(−t/2). r The complex multivariate t-distribution, with pdf given by −n−k/2 2n (n + k/2) p(x) = 1 + k −1 (x − ␮x )H H−1 , (2.79) x x (x − ␮x ) 1/2 n (π k) (k/2)det Hx x where k is an integer. We note that the Gamma function satisfies (n) = (n − 1)! if n is a positive integer. r The complex multivariate Cauchy distribution, which is a special case of the complex multivariate t-distribution with k = 1. Its pdf is −n−1/2 2n (n + 1/2) 1 + (x − ␮x )H H−1 . (2.80) p(x) = n+1/2 x x (x − ␮x ) 1/2 π det Hx x None of the moments of the Cauchy distribution exist. Example 2.4. Similarly to Example 2.3, consider a scalar complex Cauchy pdf with µx = 0 (which is the median but not the mean, since the mean does not exist) and augmented generator matrix 1 ρ , |ρ| < 1. (2.81) Hx x = ∗ 1 ρ Its pdf is p(x) =

π

1 1 − |ρ|2

|x|2 − Re(ρx ∗ 2 ) 1+ 1 − |ρ|2

−3/2 .

(2.82)

If ρ = 0, then the pdf is p(x) = (1/π )(1 + |x|2 )−3/2 . Just as for the scalar complex Gaussian pdf, the scalar complex Cauchy pdf is constant on elliptical contours whose major axis is ψ/2, half the angle of ρ = |ρ|ejψ . Figure 2.2 shows contours of constant probability density for three Cauchy random variables. Plots (a), (b), and (c) in this figure should be compared with plots (a), (d), and (f) in Fig. 2.1, respectively, for the Gaussian case. The main difference is that Cauchy random variables have much heavier tails than Gaussian random variables. From the expression (2.82), it is straightforward to write down the joint pdf in polar coordinates (r, θ ), and not quite as straightforward to integrate with respect to r or θ to obtain the marginal pdfs.

2.4 ML covariance estimators: Wishart distribution

47

Complex elliptical distributions have a number of desirable properties. r The mean of x (if it exists) is independent of the choice of pdf generator g. r The augmented covariance matrix of x (if it exists) is proportional to the augmented generating matrix Hx x . The proportionality factor depends on g, and is most easily determined using the characteristic function of x. Therefore, if the second-order moments exist, the pdf (2.76) is the pdf of a complex improper elliptical random x x = 0 is the pdf of a complex proper elliptical random vector, and the pdf (2.78) for H x x = 0. This is due to Result 2.8, to be discussed in Section 2.5. vector with R r All marginal distributions are also elliptical. If y: −→ C m contains components of x: −→ C n , m < n, then y is elliptical with the same g as x, and ␮ y contains the corresponding components of ␮x . The augmented generating matrix of y, H yy , is the sub-matrix of Hx x that corresponds to the components that y extracts from x. r Let y = M x + b, where M is a given 2m × 2n augmented matrix and b is a given 2m × 1 augmented vector. Then y is elliptical with the same g as x and ␮ y = M ␮x + b

and H yy = M Hx x MH . r If x: −→ C n and y: −→ C m are jointly elliptically distributed with pdf generator g, then the conditional distribution of x given y is also elliptical with ␮x|y = ␮x + Hx y H−1 yy (y − ␮ y ),

(2.83)

which is analogous to the Gaussian case (2.60), and augmented conditional generating matrix H Hx x|y = Hx x − Hx y H−1 yy Hx y ,

(2.84)

which is analogous to the Gaussian case (2.61). However, the conditional distribution will in general have a different pdf generator than x and y.

2.4

Sufficient statistics and ML estimators for covariances: complex Wishart distribution Let’s draw a sequence of M independent and identically distributed (i.i.d.) samples M from a complex multivariate Gaussian distribution with mean ␮x and augmented {xi }i=1 covariance matrix R x x . We assemble these samples in a matrix X = [x1 , x2 , . . ., x M ], and let X = [x1 , x2 , . . ., x M ] denote the augmented sample matrix. Using the expression M is for the Gaussian pdf in Result 2.4, the joint pdf of the samples {xi }i=1

1 p(X) = π (det R x x ) exp − (x − ␮x )H R−1 x x (xm − ␮x ) 2 m=1 m M −Mn −M/2 −1 =π (det R x x ) exp − tr(R x x Sx x ) . 2 −Mn

−M/2

M

(2.85)

(2.86)

48

Complex random vectors and processes

In this expression, Sx x is the augmented sample covariance matrix M 1 1 Sx x Sx x = Sx x = ∗ (x − mx )(xm − mx )H = X XH − mx mHx Sx x S∗x x M m=1 m M

(2.87)

and mx is the augmented sample mean vector mx =

M 1 x . M m=1 m

(2.88)

The augmented sample covariance matrix contains the sample Hermitian covariance matrix Sx x =

M 1 1 (xm − mx )(xm − mx )H = XXH − mx mHx M m=1 M

(2.89)

and the sample complementary covariance matrix M 1 1 Sx x = (xm − mx )(xm − mx )T = XXT − mx mTx . M m=1 M

(2.90)

We now appeal to the Fisher–Neyman factorization theorem to argue that mx and Sx x x x ) are a are a pair of sufficient statistics for ␮x and R x x , or, equivalently, (mx , Sx x , S set of sufficient statistics for (␮x , Rx x , Rx x ). It is a straightforward consequence of the corresponding result for the real case that the sample mean mx is also a maximum-likelihood estimator of the mean ␮x , and the sample x x ) are maximum-likelihood estimators of the covariances (Rx x , R x x ). covariances (Sx x , S Note that the sample mean is independent of the sample covariance matrices, but the Hermitian and complementary sample covariances are not independent.

Complex Wishart distribution How are the sample covariance matrices distributed? Let ui and vi be the real and imaginary parts of the sample vector xi = ui + jvi , and U and V be the real and imaginary parts of the sample matrix X = U + jV. Moreover, let Z = [UT , VT ]T . If the samples xi are drawn from a zero-mean Gaussian distribution, then Wzz = ZZT = MSzz is Wishart distributed: [det Wzz ](M−1)/2−n exp − 12 tr(R−1 zz Wzz ) . (2.91) pWzz (Wzz ) = 2 Mn 2n (M/2)[det Rzz ] M/2 In this expression, 2n (M/2) is the multivariate Gamma function 2n (M/2) = π n(2n−1)/2

2n "

[(M − i + 1)/2].

(2.92)

i=1

By following a path similar to the derivation of the complex Gaussian pdf in Section 2.3.1, it is possible to rewrite the pdf of real Wzz in terms of the complex augmented

2.5 Characteristic function

49

matrix Wx x = TWzz TH = X XH = MSx x and the augmented covariance matrix R x x as [det Wx x ](M−1)/2−n exp − 12 tr(R−1 x x Wx x ) pWzz (Wx x , Wx x ) = . (2.93) 2n(M−2n−1) 2n (M/2)[det R x x ] M/2 x x = 0 and R x x = Diag(Rx x , R∗x x ). However, this does not imply If x is proper, then R x x = XXT = that the sample complementary covariance matrix vanishes. In general, W 0. Thus, for proper x the pdf (2.93) simplifies, but it is still the pdf of real Wzz , expressed x x = XXT : as a function of both Wx x = XXH and W (M−1)/2−n ∗x x W−1 [det Wx x det(W∗x x − W exp − tr(R−1 x x Wx x )] x x Wx x ) xx ) = pWzz (Wx x , W . 2n(M−2n−1) 2n (M/2)[det Rx x ] M (2.94) xx ) In order to obtain the marginal pdf of Wx x , we would need to integrate pWzz (Wx x , W ∗ ∗ −1 x x > 0. Alternatively, we can compute the x x Wx x W for given Wx x over all Wx x − W characteristic function of Wzz , set all complementary terms equal to zero, and then compute the corresponding pdf of Wx x . The last of these three steps is the tricky one. In essence, this is the approach followed by Goodman (1963), who showed that [det Wx x ] M−n exp −tr(R−1 x x Wx x ) p(Wx x ) = , (2.95) nc (M)[det Rx x ] M where nc (M) is nc (M) = π n(n−1)/2

n " (M − i)!

(2.96)

i=1

The marginal pdf p(Wx x ) for proper x in (2.95) is what is commonly referred to as the complex Wishart distribution. Its support is Wx x > 0, and it is interpreted as the joint pdf of real and imaginary parts of Wx x . An alternative, simpler, derivation of the complex Wishart distribution, not based on the characteristic function, was given by Srivastava (1965). The derivation of the marginal pdf p(Wx x ) for improper x is an unresolved problem.

2.5

Characteristic function and higher-order statistical description The characteristic function is a characterization equivalent to the pdf of a random vector, yet working with the characteristic function can be more convenient. For the composite real random vector z = [uT , vT ]T , the characteristic function is defined in terms of su ∈ IRn and sv ∈ IRn (which are not random) as ψ(su , sv ) = E exp j(sTu u + sTv v) . (2.97) This is the inverse Fourier transform of the pdf p(u, v). If we let x = u + jv (random) and s = su + jsv (not random), we have sTu u + sTv v = 12 sH x = Re(sH x).

(2.98)

50

Complex random vectors and processes

Thus, we may define the characteristic function of x as j H = E exp[j Re(sH x)] . s x ψ(s) = E exp 2

2.5.1

(2.99)

Characteristic functions of Gaussian and elliptical distributions The characteristic function of a real elliptical random vector z (cf. Fang et al. (1990)) is s (2.100) ψ(su , sv ) = exp(j sTu sTv ␮z )φ sTu sTv Rzz u , sv where φ is a scalar function that determines the distribution. From this we obtain the following result, which was first published by Ollila and Koivunen (2004). Result 2.7. The characteristic function of a complex elliptical random vector x is j H ψ(s) = exp s ␮x φ 14 sH Hx x s 2 # $ x x s∗ ) . (2.101) = exp[j Re(sH ␮x )]φ 12 sH Hx x s + 12 Re(sH H xx = R x x , and φ(t) = exp(−t/2). If x is complex Gaussian, then Hx x = Rx x , H Both expressions in (2.101) – in terms of augmented vectors and in terms of unaugmented vectors – are useful. It is easy to see the simplification that occurs when x x = 0, as when x is a proper Gaussian random vector. x x = 0 or R H The characteristic function has a number of useful properties. It exists even when the augmented covariance matrix Rx x (or the covariance matrix Rx x ) is singular, a property not shared by the pdf. The characteristic function also allows us to state a simple connection between R x x and Hx x for elliptical random vectors (consult Fang et al. (1990) for the expression in the real case). Result 2.8. Let x be a complex elliptical random vector with characteristic function (2.101). If the second-order moments of x exist, then Rx x = cHx x with c = −2

d φ(t)|t=0 . dt

(2.102)

x x = cH x x . In particular, we see that x is proper It follows that Rx x = cHx x and R if and only if Hx x = 0. For Gaussian x, φ(t) = exp(−t/2) and therefore c = 1, as expected.

2.5.2

Higher-order moments First- and second-order moments suffice for the characterization of many probability distributions. They also suffice for the solution of mean-squared-error estimation problems and Gaussian detection problems. Nevertheless, higher-order moments and cumulants carry important information that can be exploited when the underlying probability distribution is unknown. For a complex random variable x, there are N + 1 different N th-order

2.5 Characteristic function

51

moments E(x q x ∗ N −q ), with q = 0, 1, 2, . . ., N . It is immediately clear that there can be only a maximum of N /2 + 1 distinct moments, where · denotes the floor function. Since E(x q x ∗ N −q ) = [E(x N −q x ∗ q )]∗ we can restrict q to q = 0, 1, 2, . . ., N /2. Because moments become increasingly difficult to estimate in practice for larger N , it is rare to consider moments with N > 4. Summarizing all N th-order moments for a complex random vector x requires the use of tensor algebra, as developed in the complex case by Amblard et al. (1996a). We will avoid this by considering only moments of individual components of x. These are of the form E(xi11 xi22 · · · xiNN ), where 1 ≤ i j ≤ n, j = 1, . . ., N , and j indicates whether or not xi j is conjugated. Again, there are only a maximum of N /2 + 1 distinct moments since E(xi11 xi22 · · · xiNN ) = [E((xi11 )∗ (xi22 )∗ · · · (xiNN )∗ )]∗ . As in the real case, the moments of x can be calculated from the characteristic function. To this end, we use two generalized complex differential operators for complex s = su + jsv , which are defined as ∂ ∂ 1 ∂ , (2.103) −j ∂s 2 ∂su ∂sv 1 ∂ ∂ ∂ (2.104) +j ∂s ∗ 2 ∂su ∂sv and discussed in Appendix 2. From the complex Taylor-series expansion of ψ(s), we can obtain the N th-order moment from the characteristic function. The following result is due to Amblard et al. (1996a). Result 2.9. For a random vector x with characteristic function ψ(s), the N th-order moment can be computed as % % ∂ N ψ(s) 2N % N 1 2 E(xi1 xi2 · · · xi N ) = N (2.105) % . j (∂si11 )∗ (∂si22 )∗ · · · (∂siNN )∗ % s=0

Example 2.5. Consider a scalar Gaussian random variable with zero mean, variance Rx x , and complementary variance Rx x = ρ Rx x , whose pdf is |x|2 − Re(ρx ∗ 2 ) 1 exp − . (2.106) p(x) = Rx x (1 − |ρ|2 ) π Rx x 1 − |ρ|2 Its characteristic function is

ψ(s) = exp − 14 Rx x [|s|2 + Re(ρs ∗ 2 )] .

(2.107)

The pdf p(x) is not defined for |ρ| = 1 because the support of the pdf collapses to a line in the complex plane. The characteristic function, on the other hand, is defined for all |ρ| ≤ 1. Using the differentiation rules from Appendix 2, we compute the first-order partial derivative ∂ψ(s) = − 14 Rx x ψ(s)(s + ρs ∗ ) (2.108) ∂s ∗

52

Complex random vectors and processes

and the second-order partial derivatives ∂ψ(s) ∂ 2 ψ(s) ∗ 1 = − 4 Rx x (s + ρs ) + ψ(s) , ∂s ∂s ∗ ∂s ∂ 2 ψ(s) ∂ψ(s) ∗ 1 = − R (s + ρs ) + ρψ(s) . 4 xx (∂s ∗ )2 ∂s ∗ According to Result 2.9, the variance is obtained as % 22 ∂ 2 ψ(s) %% = Rx x E(x ∗ x) = 2 j ∂s ∂s ∗ %s=0

(2.109) (2.110)

(2.111)

and the complementary variance as % 22 ∂ 2 ψ(s) %% = ρ Rx x . E(x x) = 2 j (∂s ∗ )2 %s=0

2.5.3

(2.112)

Cumulant-generating function The cumulant-generating function of x is defined as (s) = log ψ(s) = log E exp j Re(sH x) .

(2.113)

The cumulants are found from the cumulant-generating function just like the moments are from the characteristic function: % % ∂ N (s) 2N % N 1 2 Cum(xi1 xi2 · · · xi N ) = N . (2.114) N ∗ % 1 ∗ 2 ∗ j (∂si1 ) (∂si2 ) · · · (∂si N ) % s=0

The first-order cumulant is the mean, Cum xi = E xi .

(2.115)

For a zero-mean random vector x, the cumulants of second, third, and fourth order are related to moments through Cum(xi11 xi22 ) = E(xi11 xi22 ), Cum(xi11 xi22 xi33 )

=

E(xi11 xi22 xi33 ),

(2.116) (2.117)

Cum(xi11 xi22 xi33 xi44 ) = E(xi11 xi22 xi33 xi44 ) − E(xi11 xi22 )E(xi33 xi44 ) − E(xi11 xi33 )E(xi22 xi44 ) − E(xi11 xi44 )E(xi22 xi33 ).

(2.118)

Thus, the second-order cumulant is the covariance or complementary covariance. Cumulants are sometimes preferred over moments because cumulants are additive for independent random variables. For Gaussian random vectors, all cumulants of order higher than two are zero. This leads to the following result for proper Gaussians, which is based on (2.118) and vanishing complementary covariance terms.

2.5 Characteristic function

53

Result 2.10. For a proper complex Gaussian random vector x, only fourth-order moments with two conjugated and two nonconjugated terms are nonzero. Without loss of generality, assume that xi3 and xi4 are the conjugated terms. We then have E(xi1 xi2 xi∗3 xi∗4 ) = E(xi1 xi∗3 )E(xi2 xi∗4 ) + E(xi1 xi∗4 )E(xi2 xi∗3 ).

2.5.4

(2.119)

Circularity It is also possible to define a stronger version of propriety in terms of the probability distribution of a random vector. A vector is called circular if its probability distribution is rotationally invariant. Definition 2.4. A random vector x is called circular if x and x = ejα x have the same probability distribution for any given real α. Therefore, x must have zero mean in order to be circular. But circularity does not imply any condition on the standard covariance matrix Rx x because Rx x = E(x x ) = E(ejα xxH e−jα ) = Rx x .

(2.120)

xx x x = E(x x T ) = E(ejα xxT ejα ) = ej2α R R

(2.121)

H

On the other hand,

x x = 0. Because the Gaussian pdf is completely can be true for arbitrary α only if R determined by ␮x , Rx x , and Rx x , we obtain the following result, due to Grettenberg (1965). Result 2.11. A complex zero-mean Gaussian random vector x is proper if and only if it is circular. This result generalizes nicely to the elliptical pdf. Result 2.12. A complex elliptical random vector x is circular if and only if ␮x = 0 and x x = 0. If the first- and second-order moments the complementary generating matrix H exist, then we may equivalently say that x is circular if and only if it has zero mean and is proper. The first half of this result is easily obtained by inspecting the elliptical pdf, and the second half is due to Result 2.8. Propriety requires that second-order moments be rotationally invariant, whereas circularity requires that the pdf, and thus all moments (if they exist), be rotationally invariant. Therefore, circularity implies propriety, but not vice versa, and impropriety implies noncircularity, but not vice versa. By extending the reasoning of (2.120) and (2.121) to higher-order moments, we see that the following result holds. Result 2.13. If x is circular, an N th-order moment E (xi11 xi22 · · · xiNN ) can be nonzero only if it has the same number of conjugated and nonconjugated terms. In particular, all odd moments must be zero. This holds for arbitrary N .

54

Complex random vectors and processes

We have already seen a forerunner of this fact in Result 2.10. We may also be interested in a term for a random vector that satisfies this condition up to some order N only. Definition 2.5. A vector x is called N th-order circular or N th-order proper if the only nonzero moments up to order N have the same number of conjugated and nonconjugated terms. In particular, all odd moments up to order N must be zero. Therefore, the terms proper and second-order circular are equivalent. We note that, while this terminology is most common, there is not uniform agreement in the literature. Some researchers use the terms “proper” and “circular” interchangeably. It is also possible to define stronger versions of circularity, as Picinbono (1994) has done.

Do circular random vectors have spherical pdf contours? It is instructive to first take a closer look at a scalar circular random variable x = Aejφ . If x and ejα x have the same probability distribution, then the phase φ of x must be uniformly distributed over [0, 2π ) and independent of the amplitude A, which may have an arbitrary distribution. This means that the pdf of x is complex elliptical with variance Rx x and zero complementary generator, c1 |x|2 p(x) = , (2.122) g 2 Rx x Rx x as developed in Section 2.3.4. The pdf (2.122) has circular contour lines of constant probability density. Does this result generalize to circular random vectors? Unfortunately, nothing can be said in general about the contours of a circular random vector x – not even that they are elliptical. Elliptical random vectors have elliptical contours but circularity is not sufficient to make them spherical. In order to obtain spherical contours, x must be spherically distributed, which means that x is elliptical with ␮x = 0 and augmented generating matrix Hx x = kI for some positive constant k. Hence, only spherical circular random vectors indeed have spherical contours.

2.6

Complex random processes In this section, we extend some of the ideas introduced so far to a continuous-time complex random process x(t) = u(t) + jv(t), which is built from two real-valued random processes u(t) and v(t) defined on IR. We restrict our attention to a second-order description of wide-sense stationary processes. Higher-order statistical characterizations and concepts such as circularity of random processes are more difficult to treat than in the vector case, and we postpone a discussion of more advanced topics to Chapters 8 and 9.6 To simplify notation in this section, we will assume that x(t) has zero mean. The covariance function of x(t) is denoted by r x x (t, τ ) = E[x(t + τ )x ∗ (t)],

(2.123)

2.6 Complex random processes

55

and the complementary covariance function of x(t) is (2.124) r˜x x (t, τ ) = E[x(t + τ )x(t)]. T We also introduce the augmented signal x(t) = x(t) x ∗ (t) , whose covariance matrix r x x (t, τ ) r˜x x (t, τ ) H (2.125) R x x (t, τ ) = E[x(t + τ )x (t)] = ∗ r˜x x (t, τ ) r x∗x (t, τ ) is called the augmented covariance function of x(t).

2.6.1

Wide-sense stationary processes Definition 2.6. A signal x(t) is wide-sense stationary (WSS) if and only if R x x (t, τ ) is independent of t. That is, both the covariance function r x x (t, τ ) and the complementary covariance function r˜x x (t, τ ) are independent of t. This definition, in keeping with our general philosophy outlined in Section 2.2.1, calls x(t) WSS if and only if its real and imaginary parts u(t) and v(t) are jointly WSS. We note that some researchers call a complex signal x(t) WSS if r x x (t, τ ) alone is independent of t, and second-order stationary if both r x x (t, τ ) and r˜x x (t, τ ) are independent of t. If x(t) is WSS, we drop the t-argument from the covariance functions. The covariance and complementary covariance functions then have the symmetries r x x (τ ) = r x∗x (−τ ) and r˜x x (τ ) = r˜x x (−τ ).

(2.126)

The Fourier transform of R x x (τ ) is the augmented power spectral density (PSD) matrix Px x ( f ) Px x ( f ) . (2.127) P x x ( f ) = ∗ Px x (− f ) Px∗x (− f ) The augmented PSD matrix contains the PSD Px x ( f ), which is the Fourier transform of r x x (τ ), and the complementary power spectral density (C-PSD) Px x ( f ), which is the Fourier transform of r˜x x (τ ). The augmented PSD matrix is positive semidefinite, which implies the following result. Result 2.14. There exists a WSS random process x(t) with PSD Px x ( f ) and C-PSD Px x ( f ) if and only if (1) the PSD is real and nonnegative (but not necessarily even), Px x ( f ) ≥ 0;

(2.128)

(2) the C-PSD is even (but generally complex), Px x ( f ) = Px x (− f );

(2.129)

(3) the PSD provides a bound on the magnitude of the C-PSD, | Px x ( f )|2 ≤ Px x ( f )Px x (− f ).

(2.130)

56

Complex random vectors and processes

Condition (3) is due to det Px x ( f ) ≥ 0. The three conditions in this result correspond to the three conditions in Result 2.1 for a complex vector x. Because of (2.128) and (2.129) the augmented PSD matrix simplifies to Px x ( f ) Px x ( f ) . (2.131) P x x ( f ) = ∗ Px x ( f ) Px x (− f ) The time-invariant power of x(t) is Px = r x x (0) =

∞

−∞

Px x ( f )d f,

(2.132)

regardless of whether or not x(t) is proper. We now connect the complex description of x(t) = u(t) + jv(t) to the description in terms of its real and imaginary parts u(t) and v(t). Let ruv (τ ) = E[ u(t + τ )v(t)] denote the cross-covariance function between u(t) and v(t), and Puv ( f ) the Fourier transform of ruv (τ ), which is the cross-PSD between u(t) and v(t). Analogously to (2.20) for n = 1, there is the connection Puu ( f ) Puv ( f ) H T . Px x ( f ) = T ∗ (2.133) Puv ( f ) Pvv ( f ) From it, we find that Px x ( f ) = Puu ( f ) + Pvv ( f ) + 2 Im Puv ( f ),

(2.134)

Px x ( f ) = Puu ( f ) − Pvv ( f ) + 2j Re Puv ( f ),

(2.135)

which are the analogs of (2.21) and (2.22). Note that, unlike the PSD of x(t), the PSDs of u(t) and v(t) are even: Puu ( f ) = Puu (− f ) and Pvv ( f ) = Pvv (− f ). Propriety is now defined as the obvious extension from vectors to processes. Definition 2.7. A complex WSS random process x(t) is called proper if r˜x x (τ ) = 0 for all τ or, equivalently, Px x ( f ) = 0 for all f . Equivalent conditions on real and imaginary parts for propriety are ruu (τ ) = rvv (τ )

and

ruv (τ ) = −ruv (−τ ) for all τ

(2.136)

or Puu ( f ) = Pvv ( f )

Re Puv ( f ) = 0 for all f.

(2.137)

Px x ( f ) = 2[Puu ( f ) − jPuv ( f )] = 2[Pvv ( f ) + jPvu ( f )].

(2.138)

and

Therefore, if WSS x(t) is proper, its PSD is

The PSD of a proper x(t) is even if and only if Puv ( f ) = 0 because Puv ( f ) is purely imaginary and odd. From (2.130), we obtain the following important result for WSS analytic signals, which have Px x ( f ) = 0 for f < 0, and WSS anti-analytic signals, which have Px x ( f ) = 0 for f > 0.

Notes

57

Result 2.15. A WSS analytic (or anti-analytic) signal without a DC component, i.e., Px x (0) = 0, is proper. Because the equivalent complex baseband signal of a real bandpass signal is a downmodulated analytic signal, and modulation keeps propriety intact, we also have the following result. Result 2.16. The equivalent complex baseband signal of a WSS real bandpass signal is proper.

2.6.2

Widely linear shift-invariant filtering Widely linear (linear–conjugate-linear) shift-invariant filtering is described in the time domain as ∞ y(t) = [h 1 (t − τ )x(τ ) + h 2 (t − τ )x ∗ (τ )]dτ. (2.139) −∞

There is a slight complication with a corresponding frequency-domain expression: as we will discuss in Section 8.1, the Fourier transform of a WSS random process x(t) does not exist. The way to deal with WSS processes in the frequency domain is to utilize the Cramér spectral representation for x(t) and y(t), ∞ dξ ( f )ej2π f t , (2.140) x(t) = −∞

y(t) =

∞

−∞

dυ( f )ej2π f t ,

(2.141)

where ξ ( f ) and υ( f ) are spectral processes with orthogonal increments dξ ( f ) and dυ( f ), respectively. This will be discussed in detail in Section 8.1. For now, we content ourselves with stating that dυ( f ) = H1 ( f )dξ ( f ) + H2 ( f )dξ ∗ (− f ), which may be written in augmented notation as H1 ( f ) H2 ( f ) dυ( f ) dξ ( f ) = . H2∗ (− f ) H1∗ (− f ) dξ ∗ (− f ) dυ ∗ (− f )

H( f )

(2.142)

(2.143)

The relationship between the PSDs of x(t) and y(t) is P yy ( f ) = H( f )Px x ( f )HH ( f ).

(2.144)

Strictly linear filters have h 2 (t) = 0 for all t or, equivalently, H2 ( f ) = 0 for all f . It is clear that propriety is preserved under linear filtering, and it is also preserved under modulation.

Notes 1 The first account of widely linear filtering for complex signals seems to have been that by Brown and Crane (1969), who used the term “conjugate linear filtering.” Gardner and his co-authors

58

Complex random vectors and processes

2

3 4

5

6

have made extensive use of widely linear filtering in the context of cyclostationary signals, in particular for communications. See, for instance, Gardner (1993). The term “widely linear” was introduced by Picinbono and Chevalier (1995), who presented the widely linear minimum mean-squared error (WLMMSE) estimator for complex random vectors. Schreier and Scharf (2003a) revisited the WLMMSE problem using the augmented complex matrix algebra. The terms “proper” and “pseudo-covariance” (for complementary covariance) were coined by Neeser and Massey (1993), who looked at applications of proper random vectors in communications and information theory. The term “complementary covariance” is used by Lee and Messerschmitt (1994), “relation matrix” by Picinbono and Bondon (1997), and “conjugate covariance” by Gardner. Both van den Bos (1995) and Picinbono (1996) utilize what we have called the augmented covariance matrix. Conditions 1–3 in Result 2.1 for the nonsingular case were proved by Picinbono (1996). A rather technical proof of conditions 3a and 3b was presented by Wahlberg and Schreier (2008). The complex proper multivariate Gaussian distribution was introduced by Wooding (1956). Goodman (1963) provided a more in-depth study, and also derived the proper complex Wishart distribution. The assumption of propriety when studying Gaussian random vectors was commonplace until van den Bos (1995) introduced the improper multivariate Gaussian distribution, which he called “generalized complex normal.” Picinbono (1996) explicitly connected this distribution with the Hermitian and complementary covariance matrices. There are other distributions thrown off by the scalar improper complex Gaussian pdf. For example, the conditional distribution Pθ|r (θ|r ) = prθ (r, θ )/Pr (r ), with prθ (r, θ ) given by (2.71) and pr (r ) by (2.72) is the von Mises distribution. The distribution of the radius-squared r 2 could be called the improper χ 2 -distribution with one degree of freedom, and the sum of the radiisquared of k independent improper Gaussian random variables could be called the improper χ 2 -distribution with k degrees of freedom. One could also derive improper extensions of the βand exponential distributions. Section 2.6 builds on results by Picinbono and Bondon (1997), who discussed second-order properties of complex signals, both stationary and nonstationary, and Amblard et al. (1996b), who developed higher-order properties of complex stationary signals. Rubin-Delanchy and Walden (2007) present an algorithm for the simulation of improper WSS processes having specified covariance and complementary covariance functions.

Part II

Complex random vectors

3

Second-order description of complex random vectors

In this chapter, we discuss in detail the second-order description of a complex random vector x. We have seen in Chapter 2 that the second-order averages of x are completely described by the augmented covariance matrix R x x . We shall now be interested in those second-order properties of x that are invariant under two types of transformations: widely unitary and nonsingular strictly linear. The eigenvalues of the augmented covariance matrix R x x constitute a maximal invariant for R x x under widely unitary transformation. Hence, any function of R x x that is invariant under widely unitary transformation must be a function of these eigenvalues only. In Section 3.1, we consider the augmented eigenvalue decomposition (EVD) of R x x for a complex random vector x. Since we are working with an augmented matrix algebra, this EVD looks somewhat different from what one might expect. In fact, because all factors in the EVD must be augmented matrices, widely unitary diagonalization of R x x is generally not possible. As an application for the augmented EVD, we discuss rank reduction and transform coding. In Section 3.2, we introduce the canonical correlations between x and x∗ , which have been called the circularity coefficients. These constitute a maximal invariant for R x x under nonsingular strictly linear transformation. They are interesting and useful for a number of reasons. r They determine the loss in entropy that an improper Gaussian random vector incurs compared with its proper version (see Section 3.2.1). r They enable an easy characterization of the set of complementary covariance matrices for a fixed covariance matrix (Section 3.2.3) and can be used to quantify the degree of impropriety (Section 3.3). r They define the test statistic in a generalized likelihood-ratio test for impropriety (Section 3.4). r They play a key role in blind source separation of complex signals using independent component analysis (Section 3.5). The connection between the circularity coefficients and the eigenvalues of R x x is explored in Section 3.3.1. We find that certain functions of the circularity coefficients (in particular, the degree of impropriety) are upper- and lower-bounded by functions of the eigenvalues. 1

62

Second-order description of complex random vectors

3.1

Eigenvalue decomposition In this section, we develop the augmented eigenvalue decomposition (EVD), or spectral representation, of the augmented covariance matrix of a complex zero-mean random vector x: −→ C n . Proceeding along the lines of Schreier and Scharf (2003a), the idea is to start with the EVD for the composite vector of real and imaginary parts of x. Following the notation introduced in Chapter 2, let x = u + jv and z = [uT , vT ]T . The covariance matrix of the composite vector z is Ruu Ruv EuuT EuvT Rzz = EzzT = = . (3.1) T EvuT EvvT Ruv Rvv Using the matrix

I T= I

jI −jI

(3.2)

we may transform z to x = [xT , xH ]T = Tz. The augmented covariance matrix of x is H T R Exx R Exx x x x x = ∗ , (3.3) R x x = Ex xH = Ex∗ xH Ex∗ xT Rx x R∗x x which can be connected with Rzz as R x x = Ex xH = E[(Tz)(Tz)H ] = TRzz TH .

(3.4)

Since T is unitary up to a factor of 2, the eigenvalues of R x x are the eigenvalues of Rzz multiplied by 2, [λ1 , λ2 , . . ., λ2n ]T = ev(R x x ) = 2 ev(Rzz ). We write the EVD of Rzz as Rzz = U

1 2

Λ(1) 0

0 T 1 (2) U , Λ 2

(3.5)

(3.6)

where the eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λ2n ≥ 0 are contained on the diagonals of the diagonal matrices Λ(1) = Diag(λ1 , λ3 , . . ., λ2n−1 ), Λ

(2)

= Diag(λ2 , λ4 , . . ., λ2n ).

(3.7) (3.8)

We will see shortly why we are distributing the eigenvalues over Λ(1) and Λ(2) in this fashion. The orthogonal transformation U in (3.6) is sometimes called the Karhunen–Loève transform. On inserting (3.6) into (3.4), we obtain 1 (1) H 1 1 Λ 0 H H 2 T TUTH . (3.9) R x x = 2 TUT 1 (2) T 2 Λ 0 2 From this, we obtain the following expression for the EVD of R x x .

3.1 Eigenvalue decomposition

63

Result 3.1. The augmented EVD of the augmented covariance matrix R x x is R x x = U Λ UH

(3.10)

with U = 12 TUTH , (1) Λ + Λ(2) Λ = 12 Λ(1) − Λ(2)

(3.11) Λ(1) − Λ(2) . Λ(1) + Λ(2)

(3.12)

A crucial remark regarding this EVD is the following. When factoring augmented matrices, we need to make sure that all factors again have the block pattern required of an augmented matrix. In the EVD, U is a widely unitary transformation, satisfying U1 U2 , (3.13) U= U∗2 U∗1 and U UH = UH U = I. The pattern of U is necessary to ensure that, in the augmented internal description T ∗ ␰ = UH x ⇔ ␰ = UH 1 x + U2 x ,

(3.14)

the bottom half of ␰ is indeed the conjugate of the top half. This leads to the maybe unexpected result that the augmented eigenvalue matrix Λ is generally not diagonal, but has diagonal blocks instead. It becomes diagonal if and only if all eigenvalues have even multiplicity. In general, the internal description ␰ has uncorrelated components: E ξi ξ ∗j = 0

and

E ξi ξ j = 0

for i = j.

(3.15)

The ith component has variance E|ξi |2 = 12 (λ2i−1 + λ2i )

(3.16)

Eξi2 = 12 (λ2i−1 − λ2i ).

(3.17)

and complementary variance

Thus, if the eigenvalues do not have even multiplicity, ␰ is improper. If x is proper, the augmented covariance matrix is R x x = R0 = Diag(Rx x , R∗x x ). If we denote ev(Rx x ) = [µ1 , µ2 , . . ., µn ]T , then ev(R0 ) = [µ1 , µ1 , µ2 , µ2 , . . ., µn , µn ]T . Propriety of x is a sufficient but not necessary condition for the internal description ␰ to be proper.

3.1.1

Principal components Principal-component analysis (PCA) is a classical statistical tool for data analysis, compression, and prediction. PCA determines a unitary transformation into an internal coordinate system, where the ith component makes the ith largest possible contribution to the overall variance of x. It is easy to show that the EVD achieves this objective, using results from majorization theory (see Appendix 3).

64

Second-order description of complex random vectors

We shall begin with strictly unitary PCA, which computes an internal description as ␰ = UH x,

(3.18)

restricted to strictly unitary transformations. The internal representation ␰ has covariance matrix Rξ ξ = UH Rx x U. The variance of component ξi is di = E|ξi |2 = (Rξ ξ )ii . For simplicity and without loss of generality, we assume that we order the components ξi such that d1 ≥ d2 ≥ · · · ≥ dn . PCA aims to maximize the sum over the r largest variances di , for each r : max U

r

di ,

r = 1, . . ., n.

(3.19)

i=1

An immediate consequence of the majorization relation (cf. Result A3.5) [d1 , d2 , . . ., dn ]T = diag(Rξ ξ ) ≺ ev(Rξ ξ ) = ev(Rx x )

(3.20)

is that (3.19) is maximized if U is chosen from the EVD of Rx x = UMUH , where M = Diag(µ1 , µ2 , . . ., µn ) contains the eigenvalues of Rx x . Therefore, for each r , max U

r

di =

r

i=1

µi ,

r = 1, . . ., n.

(3.21)

i=1

The components of ␰ are called the principal components and are Hermitianuncorrelated. The approach (3.18) is generally suboptimal because it ignores the complementary covariances. For improper complex vectors, the principal components ␰ must instead be determined by (3.14), using the widely unitary transformation U from the EVD R x x = U Λ UH . This leads to an improved result max U

r i=1

di =

r

1 (λ 2 2i−1

i=1

+ λ2i ) ≥

r

µi ,

r = 1, . . ., n.

(3.22)

i=1

This maximization requires the arrangement of eigenvalues in (3.7) and (3.8). The inequality in (3.22) follows from the majorization [µ1 , µ1 , µ2 , µ2 , . . ., µn , µn ]T = ev(R0 ) ≺ ev(R x x ) = [λ1 , λ2 , . . ., λ2n ]T ,

(3.23)

which in turn is a consequence of Result A3.7. The last result shows that the eigenvalues of the augmented covariance matrix of an improper vector are more spread out than those of a proper vector with augmented covariance matrix R0 = Diag(Rx x , R∗x x ). We will revisit this finding later on.

3.1.2

Rank reduction and transform coding The maximization property (3.19) makes PCA an ideal candidate for rank reduction and transform coding, as shown in Fig. 3.1. Example A3.5 in Appendix 3 considers the equivalent structure for real-valued vectors.

3.2 Circularity coefficients

x

UH

ξ

& ξ

Quantizer

coder

U

65

xˆ

decoder

Figure 3.1 A widely unitary rank-reduction/transform coder.

The complex random vector x = [x1 , x2 , . . ., xn ]T , which is assumed to be zeromean Gaussian, is passed through a widely unitary coder UH . The output of the coder T ∗ is ␰ = UH 1 x + U2 x , which is subsequently processed by a bank of n scalar complex quantizers. The quantizer output & ␰ is then decoded as xˆ = U1 & ␰ + U2 & ␰ ∗ . From (3.20) we n know that any Schur-concave function of the variances {di }i=1 will be minimized if the coder UH decorrelates the quantizer input ␰. In particular, consider the Schur-concave mean-squared error ␰)2 = 12 E␰ − & ␰2 = E␰ − & ␰2 . MSE = 12 Ex − xˆ 2 = 12 EUH (␰ − &

(3.24)

A good model for the quantization error in the internal coordinate system is E␰ − & ␰2 =

n

di f (bi ),

(3.25)

i=1

where f (bi ) is a decreasing function of the number of bits bi spent on quantizing ξi . We will spend more bits on components with higher variance, i.e., b1 ≥ b2 ≥ · · · ≥ bn , and, therefore, f (b1 ) ≤ f (b2 ) ≤ · · · ≤ f (bn ). For a fixed but arbitrary bit assignment, Result A3.4 in Appendix 3 shows that the linear function (3.25) is indeed a Schur-concave n . The minimum MSE is therefore function of {di }i=1 MMSE =

1 2

n

(λ2i−1 + λ2i ) f (bi ).

(3.26)

i=1

There are two common choices for f (bi ) that deserve explicit mention. r To model rank-r reduction (sometimes called zonal sampling), we set f (b ) = 0 for i i = 1, . . ., r and f (bi ) = 1 for i = r + 1, . . ., n. Gaussianity need not be assumed for rank reduction. r For fine quantizers with a large number of bits, we may employ the high-resolution assumption, as explained by Gersho and Gray (1992), and set f (bi ) = c 2−bi . This assumes that, of the bi bits spent on quantizing ξi , bi /2 bits each go toward quantizing real and imaginary parts. The constant c is dependent on the distribution of ξi .

3.2

Circularity coefficients We now turn to finding a maximal invariant (a complete set of invariants) for R x x under nonsingular strictly linear transformation. Such a set is given by the canonical correlations between x and its conjugate x∗ . Canonical correlations in the general case will be discussed in much more detail in Chapter 4. Assuming Rx x has full rank, the

66

Second-order description of complex random vectors

canonical correlations between x and x∗ are determined by starting with the coherence matrix ∗ −H/2 −T/2 = R−1/2 C = R−1/2 x x Rx x (Rx x ) x x Rx x Rx x .

(3.27)

Since C is complex symmetric, C = CT , yet not Hermitian symmetric, C = CH , there exists a special singular value decomposition (SVD), called the Takagi factorization, which is C = FKFT .

(3.28)

The Takagi factorization is discussed more thoroughly in Section 3.2.2. The complex matrix F is unitary, and K = Diag(k1 , k2 , . . ., kn ) contains the canonical correlations 1 ≥ k1 ≥ k2 ≥ · · · ≥ kn ≥ 0 on its diagonal. The squared canonical correlations ki2 are −∗ ∗ −H/2 , or the eigenvalues of the squared coherence matrix CCH = R−1/2 x x Rx x Rx x Rx x Rx x −1 −∗ ∗ equivalently, of the matrix Rx x Rx x Rx x Rx x because −∗ ∗ −H/2 F. KKH = FH R−1/2 x x Rx x Rx x Rx x Rx x

(3.29)

The canonical correlations between x and x∗ are invariant to the choice of a square root for Rx x , and they are a maximal invariant for R x x under nonsingular strictly linear transformation of x. Therefore, any function of R x x that is invariant under nonsingular strictly linear transformation must be a function of these canonical correlations only. The internal description ␰ = FH R−1/2 x x x = Ax

(3.30)

is said to be given in canonical coordinates. We will adopt the following terminology by Eriksson and Koivunen (2006). Definition 3.1. Vectors that are uncorrelated with unit variance, but possibly improper, are called strongly uncorrelated. The transformation A = FH R−1/2 x x , which transforms x into canonical coordinates ␰, is called the strong uncorrelating transform (SUT). The n canonical correlations ki are referred to as circularity coefficients, and the set {ki }i=1 as the circularity spectrum of x. The term “circularity coefficient” is not entirely accurate insofar as the circularity coefficients only characterize second-order circularity, or (im-)propriety. Thus, the name impropriety coefficients would have been more suitable. Moreover, the insight that the circularity coefficients are canonical correlations is critical because it enables us to utilize a wealth of results on canonical correlations in the literature.2 The canonical coordinates are strongly uncorrelated, i.e., they are uncorrelated, E ξi ξ ∗j = E ξi ξ j = 0

for i = j,

(3.31)

and have unit variance, E|ξi |2 = 1. However, they are generally improper as E ξi2 = ki .

(3.32)

The circularity coefficients ki measure the correlations between the white, unitnorm canonical coordinates ␰ and their conjugates ␰ ∗ . More precisely, the

3.2 Circularity coefficients

67

circularity coefficients ki are the cosines of the canonical angles between the linear subspaces spanned by ␰ and the complex conjugate ␰ ∗ . If these angles are small, then x∗ may be linearly estimated from x, indicating that x is improper – obviously, x∗ can always be perfectly estimated from x if widely linear operations are allowed. If these angles are large, then x∗ may not be linearly estimated from x, indicating a proper x. These angles are invariant with respect to nonsingular linear transformation.

3.2.1

Entropy Combining our results so far, we may factor R x x as xx 0 F 0 I R1/2 Rx x R xx = Rx x = ∗ ∗/2 ∗ ∗ 0 Rx x 0 F K Rx x Rx x

K I

H F 0

0 FT

H/2 Rx x 0

0 . RT/2 xx (3.33)

Note that each factor is an augmented matrix. This factorization establishes det Rx x = det2 Rx x det(I − KKH ) = det2 Rx x

n " (1 − ki2 ).

(3.34)

i=1

This allows us to derive the following connection between the entropy of an improper Gaussian random vector with augmented covariance matrix Rx x and the corresponding proper Gaussian random vector with covariance matrix Rx x . Result 3.2. The entropy of a complex improper Gaussian random vector x is Himproper =

1 2

log[(π e)2n det R x x ]

n " = log[(π e)n det Rx x ] + 12 log (1 − ki2 ),

i=1

Hproper −I (x; x∗ )

(3.35)

where Hproper is the entropy of a proper Gaussian random vector with the same Her x x = 0), and I (x; x∗ ) is the (nonnegative) mutual mitian covariance matrix Rx x (but R ∗ information between x and x . Therefore, the entropy is maximized if and only if x is proper. If x is improper, the loss in entropy compared with the proper case is the mutual information between x and x∗ , which is a function of the circularity spectrum.

3.2.2

Strong uncorrelating transform (SUT) The Takagi factorization is a special SVD for a complex symmetric matrix C = CT . But why do we need the Takagi factorization C = FKFT rather than simply a regular SVD C = UKVH ? The reason is that the canonical coordinates ␰ and ␰ ∗ are supposed to be complex conjugates of each other. If we used a regular SVD C = UKVH , then ∗ H −1/2 VH R−∗/2 x x x would not generally be the complex conjugate of ␰ = U Rx x x. For this

68

Second-order description of complex random vectors

we need V = U∗ . So how exactly is the Takagi factorization determined from the SVD? Because the matrix C is symmetric, it has two SVDs: C = UKVH = V∗ KUT = CT .

(3.36)

Now, since CCH = UK2 UH = V∗ K2 VT ⇔ K2 UH V∗ = UH V∗ K2 ⇔ KUH V∗ = UH V∗ K,

(3.37)

the unitary matrix UH V∗ commutes with K. This is possible only if the (i, j)th element of UH V∗ is zero whenever ki = k j . Thus, every C can be expressed as C = UKDUT

(3.38)

n with D = VH U∗ . Assume that among the circularity coefficients {ki }i=1 there are N distinct coefficients, denoted by σ1 , . . ., σ N . Their respective multiplicities are m 1 , . . ., m N . Then we may write

D = Diag(D1 , D2 , . . ., D N ),

(3.39)

where Di is an m i × m i unitary and symmetric matrix, and KD = Diag(σ1 D1 , σ2 D2 , . . ., σ N D N ).

(3.40)

In the special case in which all circularity coefficients are distinct, we have σi = ki and Di is a scalar with unit magnitude. Therefore, D = Diag(ejθ1 , ejθ2 , . . ., ejθn )

(3.41)

KD = Diag(k1 ejθ1 , k2 ejθ2 , . . ., kn ejθn ).

(3.42)

C = UD1/2 KDT/2 UT

(3.43)

D1/2 = Diag(±ejθ1 /2 , ±ejθ2 /2 , . . ., ±ejθn /2 )

(3.44)

Then we have

with

with arbitrary signs, so that F = UD1/2 achieves the Takagi factorization C = FKFT . This shows how to obtain the Takagi factorization from the SVD. We note that, if only the circularity coefficients ki need to be computed and the canonical coordinates ␰ are not required, a regular SVD will suffice. The SVD C = UKVH of any matrix C with distinct singular values is unique up to multiplication of U from the right by a unitary diagonal matrix. If C = CT , the Takagi factorization determines this unitary diagonal factor in such a way that the matrix of left singular vectors is the conjugate of the matrix of right singular vectors. Hence, if all circularity coefficients are distinct, the Takagi factorization is unique up to the sign of the diagonal elements in D1/2 . The following result then follows. Result 3.3. For distinct circularity coefficients, the strong uncorrelating transform A = is unique up to the sign of its rows, regardless of the choice of the inverse FH R−1/2 xx square root of Rx x .

3.2 Circularity coefficients

1 R

2 R

3 R

69

4 R

xx = 0 R xx R−∗ ∗ Rxx − R xx Rxx > 0

Figure 3.2 Geometry of the set Q.

To see this, consider any inverse square root GR−1/2 for arbitrary choice of a xx unitary matrix G. The Takagi factorization of the corresponding coherence matrix −T/2 T T T is C = GR−1/2 x x Rx x Rx x G = GFKF G . Hence, the strong uncorrelating transform H H −1/2 H −1/2 A = F G GRx x = F Rx x is independent of G. On the other hand, if there are repeated circularity coefficients, the strong uncorrelating transform is no longer unique. If Di in (3.39) is m i × m i with m i ≥ 2, there is an 1/2 T/2 infinite number of ways of decomposing Di = Di Di . For instance, any given decom1/2 position of Di can be modified by a real orthogonal m i × m i matrix G as Di = (Di G) T T/2 (G Di ).

3.2.3

Characterization of complementary covariance matrices We may now give two different characterizations of the set Q of complementary covariance matrices for a fixed covariance matrix Rx x . When we say “characterization,” we x x to be a valid complementary mean necessary and sufficient conditions for a matrix R covariance matrix for given covariance matrix Rx x . The first characterization follows directly from the positive semidefinite property of the augmented covariance matrix Rx x , which implies a positive semidefinite Schur complement: ∗ x x R−∗ x x ) = Rx x − R Q(R x x Rx x ≥ 0.

(3.45)

This shows that Q is convex and compact. Alternatively, Q can be characterized by T T/2 x x = R1/2 R x x FKF Rx x ,

(3.46)

which follows from (3.27) and (3.28). Here, F is a unitary matrix, and K is a diagonal matrix of circularity coefficients 0 ≤ ki ≤ 1. We will use the characterizations (3.45) and (3.46) to illuminate the structure of the convex and compact set Q. This was first presented by Schreier et al. (2005). The geometry of Q is depicted in Fig. 3.2. The interior of Q contains complementary xx ) > x x for which the Schur complement is positive definite: Q(R covariance matrices R 0. Boundary points are rank-deficient points. However, as we will show now, only

70

Second-order description of complex random vectors

x x ) = 0, are also boundary points where the Schur complement is identically zero, Q(R extreme points. # of the closed convex set Q may be written as a Definition 3.2. An extreme point R 1 + (1 − α)R 2 # = αR convex combination of points in Q in only a trivial way. That is, if R with 0 < α < 1, R1 , R2 ∈ Q, then R# = R1 = R2 . 3 , for instance, can be 4 is an extreme point. The boundary point R In Fig. 3.2, only R obtained as a convex combination of the points R1 and R2 . # is an extreme point of Q if and only Result 3.4. A complementary covariance matrix R # is an extreme if the Schur complement of R x x vanishes: Q(R# ) = 0. In other words, R point if and only if all circularity coefficients have unit value: ki = 1, i = 1, . . ., n. x x ∈ Q with nonzero Schur comWe prove this result by first showing that a point R plement cannot be an extreme point. There exists at least one circularity coefficient, say k j , with k j < 1. Now choose ε > 0 such that k j + ε ≤ 1. Define K1 such that it agrees with K except that the jth entry, k j , is replaced with k j + ε. Similarly, define K2 such that it agrees with K except that the jth entry, k j , is replaced with k j − ε. Then, using (3.46), write x x = 1 R1/2 R F(K1 + K2 )FT RT/2 xx 2 xx 1 + 1 R . = 12 R 2 2

(3.47) (3.48)

T T/2 x x as a nontrivial convex combination of R 1 = R1/2 This expresses R x x FK1 F Rx x and 1/2 T T/2 R2 = Rx x FK2 F Rx x , with α = 1/2. Hence, Rx x is not an extreme point. ∗ x x R−∗ x x ) = Rx x − R We now assume that the Schur complement Q(R x x Rx x = 0 for x x . We first note that some R

2 )R−∗ H 1 − R α(1 − α)(R x x (R1 − R2 ) ≥ 0

(3.49)

2 . This implies 1 = R with equality if and only if R 1∗ + (1 − α)R 2∗ . 2 )Rx x (α R 1 + (1 − α)R 2 )∗ ≤ α R 1 Rx x R 2 Rx x R 1 + (1 − α)R (α R (3.50) Thus, we find 2 ) ≤ Q(R xx ) 1 ) + (1 − α)Q(R α Q(R

(3.51)

1 + (1 − α)R 2 , with equality if and only if R 1 = R 2 . This makes Q(R xx ) x x = αR for R 2 ) ≥ 0, a vanishing 1 ) ≥ 0 and Q(R a strictly matrix-concave function. Since Q(R 1 = R 2 = R x x . Thus, R x x is indeed an extreme point of Q, x x ) = 0 requires R Q(R which concludes the proof of the result.

3.3

Degree of impropriety Building upon our insights from the previous section, we may now ask how to quantify the degree of impropriety of x. Propriety is preserved by strictly linear (but not widely

3.3 Degree of impropriety

71

linear) transformation. This suggests that a measure for the degree of impropriety should also be invariant under linear transformation. Such a measure of impropriety must then be a function of the circularity coefficients ki because the circularity coefficients constitute a maximal invariant for R x x under nonsingular linear transformation. 3 There are several functions that are used to measure the multivariate association between two vectors, as discussed in more detail in Chapter 4. Applied to x and x∗ , some examples are ρ1 = 1 −

r " (1 − ki2 ),

(3.52)

i=1

ρ2 =

r "

ki2 ,

(3.53)

r 1 2 k . n i=1 i

(3.54)

i=1

ρ3 =

These functions are defined for r = 1, . . ., n. For full rank r = n, they can also be written x x . Then, from (3.34), we obtain in terms of Rx x and R ρ1 = 1 −

det R x x det Q =1− 2 det Rx x det Rx x

(3.55)

and from (3.29) we obtain ρ2 =

∗ x x R−∗ det(R x x Rx x ) det Rx x

(3.56)

ρ3 =

1 −∗ ∗ tr(R−1 x x Rx x Rx x Rx x ). n

(3.57)

These measures all satisfy 0 ≤ ρi ≤ 1. However, only ρ3 has the two properties that ρ3 = 0 indicates the proper case, i.e., ki = 0 for all i = 1, . . ., n, and ρ3 = 1 the maximally improper case, in the sense that ki = 1 for all i = 1, . . ., n. Measure ρ2 is 0 if at least one ki is 0, and ρ1 is 1 if at least one ki is 1. While a case can be made for any of these measures, or many other functions of ki , ρ1 seems to be most compelling since it relates the entropy of an improper Gaussian random vector to that of the corresponding proper version through (3.35). Moreover, as we will see in Section 3.4, ρ1 is also used as the test statistic in a generalized likelihood-ratio test for impropriety. For this reason, we will focus on ρ1 in the remainder of this section. Example 3.1. Figure 3.3 depicts a QPSK signalling constellation with I/Q imbalance characterized by gain imbalance (factor) G > 0 and quadrature skew φ. The four equally likely signal points are {±j, ±Gejφ }. We find Rx x = E x 2 = 14 (j2 + (−j)2 + G 2 e2jφ + (−G)2 e2jφ ) = 12 (G 2 e2jφ − 1),

(3.58)

Rx x = E|x|2 = 12 (1 + G 2 ).

(3.59)

72

Second-order description of complex random vectors

Q

j Ge jf f

I

−j Figure 3.3 QPSK with I/Q imbalance.

Since x is scalar, with variance Rx x and complementary variance Rx x , the degree of impropriety ρ1 becomes particularly simple:  2 (G − 1)2   , φ=0   (G 2 + 1)2 | Rx x |2 G 4 − 2G 2 cos(2φ) + 1 (3.60) = = ρ1 = 1, φ = π/2,  Rx2x (1 + G 2 )2   1 (1 − cos(2φ)), G = 1. 2 Perfect I/Q balance is obtained with G = 1 and φ = 0. QPSK with perfect I/Q balance is proper, i.e., ρ1 = 0. The worst possible I/Q imbalance φ = π/2 results in a maximally improper random variable, i.e., ρ1 = 1, irrespective of G.

3.3.1

Upper and lower bounds So far, we have developed two internal descriptions ␰ for x: the principal components (3.14), which are found by a widely unitary transformation of x, and the canonical coordinates (3.30), which are found by a strictly linear transformation of x. Both principal components and canonical coordinates are uncorrelated, i.e., E ξi ξ ∗j = E ξi ξ j = 0

for i = j,

(3.61)

but only the canonical coordinates have unit variance. Both principal components and canonical coordinates are generally improper with 1 (λ2i−1 − λ2i ), if ␰ are principal components, 2 (3.62) E ξi = 2 ki , if ␰ are canonical coordinates. It is natural to ask whether there is a connection between the eigenvalues λi and the 2n restricts circularity coefficients ki . There is indeed. The eigenvalue spectrum {λi }i=1 n the possibilities for the circularity spectrum {ki }i=1 , albeit in a fairly intricate way. In the general setup – not restricted to the conjugate pair x and x∗ – this has been explored by Drury (2002), who characterizes admissible ki s for given eigenvalues {λi }. The results

3.3 Degree of impropriety

73

are very involved, which is due to the fact that the singular values of the sum of two matrices are not easily characterized. It is much easier to develop bounds on certain functions of {ki } in terms of the eigenvalues {λi }. In particular, we are interested in bounds on the degree of impropriety ρ1 if {λi } are known. We first state the upper bound on ρ1 . Result 3.5. The degree of impropriety ρ1 of a vector x with prescribed eigenvalues {λi } of the augmented covariance matrix R x x is upper-bounded by ρ1 = 1 −

r "

(1 − ki2 ) ≤ 1 −

i=1

r " i=1

4λi λ2n+1−i , (λi + λ2n+1−i )2

r = 1, . . ., n.

(3.63)

This upper bound is attained when Rx x =

1 2

Diag(λ1 + λ2n , λ2 + λ2n−1 , . . ., λn + λn+1 ),

(3.64)

xx = R

1 2

Diag(λ1 − λ2n , λ2 − λ2n−1 , . . ., λn − λn+1 ).

(3.65)

This bound has been derived by Bartmann and Bloomfield (1981) for the canonical correlations between arbitrary pairs of real vectors (u, v), and it holds a forteriori for the x x as canonical correlations between x and x∗ . It is easy to see that R x x , with Rx x and R specified in (3.64) and (3.65), is a valid augmented covariance matrix with eigenvalues {λi } and attains the bound. There is no nontrivial lower bound on the canonical correlations between arbitrary pairs of random vectors (x, y). It is always possible to choose, for instance, ExxH = Diag(λ1 , . . ., λn ), EyyH = Diag(λn+1 , . . ., λ2n ), and ExyH = 0, which has the required eigenvalues {λi } and zero canonical correlation matrix K = 0. That there is a lower bound on ρ1 stems from the special structure of the augmented covariance matrix R x x , where the northwest and southeast blocks must be complex conjugates. We will now derive this lower bound for r = n. Let x = u + jv and z = [uT , vT ]T . From (2.21) and (2.22), we know that T − Ruv ), Rx x = Ruu + Rvv + j(Ruv

(3.66)

T x x = Ruu − Rvv + j(Ruv R + Ruv ).

(3.67)

Since the eigenvalues are given, det Rx x =

2n "

λi

(3.68)

i=1

is fixed. Hence, it follows from (3.34) that the minimum ρ1 is achieved when det Rx x is minimized. We can assume without loss of generality that Rx x is diagonal. If it is not, it can be made diagonal with a strictly unitary transform that leaves det Rx x and the eigenvalues {λi } unchanged. Thus, we have min det Rx x = min

n " i=1

(Rx x )ii = min

n " [(Ruu )ii + (Rvv )ii ]. i=1

(3.69)

74

Second-order description of complex random vectors

Now let qi be the ith largest diagonal element of Ruu + Rvv , and ri the ith largest diagonal element of Rzz . Then, r i=1

qi ≤

r

r2i−1 + r2i ≤

i=1

r λ2i−1 + λ2i

2

i=1

,

r = 1, . . ., n,

(3.70)

with equality for r = n. We have the second inequality because the diagonal ele' ments of Rzz are majorized by the eigenvalues of Rzz (cf. Result A3.5). Since qi is Schur-concave, a consequence of (3.70) is the following variant of Hadamard’s inequality: n "

[(Ruu )ii + (Rvv )ii ] ≥

i=1

n " λ2i−1 + λ2i

2

i=1

.

(3.71)

This implies min det Rx x =

n " λ2i−1 + λ2i

2

i=1

.

(3.72)

Using this result in (3.34), we get 2n " n " (1 − ki2 ) ≤ i=1

λi

i=1 n " (λ2i−1 + λ2i )2

,

(3.73)

4

i=1

from which we obtain the following lower bound on ρ1 . Result 3.6. The degree of impropriety ρ1 of a vector x with prescribed eigenvalues {λi } of the augmented covariance matrix Rx x is lower-bounded by ρ1 = 1 −

n n " " (1 − ki2 ) ≥ 1 − i=1

i=1

4λ2i−1 λ2i . (λ2i−1 + λ2i )2

(3.74)

The lower bound is attained if Ruu = 12 Λ(1) , Rvv = 12 Λ(2) , and Ruv = 0, or, equivalently, Rx x = 12 (Λ(1) + Λ(2) ) =

1 2

Diag (λ1 + λ2 , λ3 + λ4 , . . ., λ2n−1 + λ2n ),

(3.75)

x x = 1(Λ(1) − Λ(2) ) = R 2

1 2

Diag(λ1 − λ2 , λ3 − λ4 , . . ., λ2n−1 − λ2n ),

(3.76)

where Λ(1) = Diag(λ1 , λ3 , . . ., λ2n−1 ) and Λ(2) = Diag(λ2 , λ4 , . . ., λ2n ). Example 3.2. In the scalar case n = 1, the upper bound equals the lower bound, which leads to the following expression for the degree of impropriety: ρ1 = k12 = 1 −

4λ1 λ2 (λ1 − λ2 )2 = . (λ1 + λ2 )2 (λ1 + λ2 )2

3.3 Degree of impropriety

75

Thus we also have a simple expression for the circularity coefficient k1 in terms of the eigenvalues λ1 and λ2 : k1 =

λ 1 − λ2 . λ 1 + λ2

While the upper bound (3.63) holds for r = 1, . . ., n, we have been able to establish the lower bound (3.74) for r = n only. A natural conjecture is to assume that the diagonal x x given by (3.75) and (3.76) also attain the lower bound for r < n. Let ci be Rx x and R the ith largest of the factors 4λ2 j−1 λ2 j , (λ2 j−1 + λ2 j )2

j = 1, . . ., n.

Then the conjecture may be written as ρ1 = 1 −

r " i=1

(1 − ki2 ) ≥ 1 −

n "

ci ,

r = 1, . . ., n.

(3.77)

i=n−r +1

x x that achieve the upper bound do We point out that the diagonal matrices Rx x and R not necessarily give upper bounds for other functions of {ki }, such as ρ2 and ρ3 . Drury et al. (2002) prove an upper bound for ρ2 and conjecture an upper bound for ρ3 . Lower bounds for ρ2 and ρ3 are still unresolved problems. Example 3.3. We have seen that the principal components minimize the degree of impropriety ρ1 under widely unitary transformation. In this example, we show that they do not necessarily minimize other measures of impropriety such as ρ2 and ρ3 . Consider an augmented covariance matrix R x x with eigenvalues 100, 50, 50, and 2. The principal components ␰ have covariance matrix Rξ ξ = Diag(75, 26), complementary ξ ξ = Diag(25, 24), and circularity coefficients k1 = 24/26 and k2 = covariance matrix R 25/75. We compute ρ2 = 0.481 and ρ3 = 0.095. On the other hand, there obviously exists a widely unitary transformation into coordinates x with covariance matrix Rx x = Diag(51, 50) and complementary covariance x x = Diag(49, 0), and circularity coefficients k1 = 49/51 and k2 = 0/50. The matrix R description x is less improper than the principal components ␰ when measured in terms of ρ2 = 0.461 and ρ3 = 0.

Least improper analog Given a random vector x, we can produce a least improper analog ␰ = UH x, using a widely unitary transformation U. It is clear from (3.75) and (3.76) that the principal components obtained from (3.14) are such an analog, with U determined by the EVD R x x = U Λ UH . The principal components ␰ have the same eigenvalues and thus the same power as x. They minimize ρ1 and thus maximize entropy under widely unitary transformation. We note that a least improper analog is not unique, since any strictly

76

Second-order description of complex random vectors

unitary transform will leave both the eigenvalues {λi } and the canonical correlations {ki } unchanged.

3.3.2

Eigenvalue spread of the augmented covariance matrix Let us try to further illuminate the upper and lower bounds. Both the upper and x x are diagonal matrices. For Rx x = the lower bounds are attained when Rx x and R Diag(R11 , . . ., Rnn ) and Rx x = Diag( R11 , . . ., Rnn ), we find that ( ) n n " " | Rii |2 2 1− . (3.78) (1 − ki ) = Rii2 i=1 i=1 n x x , it has eigenvalues {Rii ± | Rii |}i=1 . This gives If R x x has diagonal blocks Rx x and R n n n " " " 4ai bi (ai − bi )2 1− = (1 − ki2 ) = , (3.79) (ai + bi )2 (ai + bi )2 i=1 i=1 i=1 n n 2n and {bi }i=1 are two disjoint subsets of {λi }i=1 . Each factor where {ai }i=1

4ai bi (ai + bi )2 is the squared ratio of the geometric and arithmetic means of ai and bi . Hence, it is 1 if ai = bi , and 0 if ai or bi are 0, and thus measures the spread between ai and bi . Minimizing or maximizing (3.79) is a matter of choosing the subsets {ai } and {bi } from the eigenvalues {λi } using a combinatorial argument presented by Bloomfield and Watson (1975). In order to minimize (3.79), we need maximum spread between the two sets {ai } and {bi }, which is achieved by choosing ai = λi and bi = λ2n−i . In order to maximize (3.79), we need minimum spread between {ai } and {bi }, which is achieved by ai = λ2i−1 and bi = λ2i . Hence, the degree of impropriety is related to the eigenvalue spread of R x x .

3.3.3

Maximally improper vectors Following this line of thought, one might expect a vector that is maximally improper – in the sense that K = I – to correspond to an augmented covariance matrix with maximum possible eigenvalue spread. This was in fact claimed by Schreier et al. (2005) but, unfortunately, it is only partially true. Let ev(Rx x ) = [µ1 , µ2 , . . ., µn ]T and ev(R x x ) = [λ1 , λ2 , . . ., λ2n ]. Let R# be the augmented covariance matrix of a maximally improper vector with K = I. Using (3.46), we may write T T/2 R1/2 Rx x x x FF Rx x (3.80) R# = ∗ H H/2 R∗/2 R∗x x x x F F Rx x for some unitary matrix F. The matrix R# has a vanishing Schur complement and thus T T/2 # = R1/2 R x x FF Rx x is an extreme point in the set Q. Schreier et al. (2005) incorrectly # . While this is not stated that ev(R# ) = [2µ1 , 2µ2 , . . ., 2µn , 0Tn ]T for any extreme point R

3.4 Testing for impropriety

77

## such that the augmented covariance true, there is indeed at least one extreme point R matrix R## has these eigenvalues. Let Rx x = UMUH be the EVD of Rx x . Choosing T T/2 ## = UMUT = UM1/2 UH UUT U∗ M1/2 UT = R1/2 R x x FF Rx x

with F = U means that R##

R = x∗x R##

## R R∗x x

(3.81)

(3.82)

has eigenvalues ev(R## ) = [2µ1 , 2µ2 , . . ., 2µn , 0Tn ]T . Let us now establish the following. Result 3.7. There is the majorization preordering ev(R0 ) ≺ ev(R x x ) ≺ ev(R## ).

(3.83)

This says that, for given Hermitian covariance matrix Rx x , the vector whose augmented x x = 0 and R0 = covariance matrix R x x has least eigenvalue spread must be proper: R ∗ Diag(Rx x , Rx x ). The vector whose augmented covariance matrix R x x has maximum eigenvalue spread must be maximally improper, i.e., K = I. In order to show this result we note that the left inequality is (3.23), and the right inequality is a consequence of Result A3.8. Applied to the matrix R x x , Result A3.8 says that k

λi + λ2n−k+i ≤

i=1

k

2µi ,

k = 1, . . ., n,

(3.84)

i=1

and, since λ2n−k+i ≥ 0, k

λi ≤

i=1

k

2µi ,

k = 1, . . ., n.

(3.85)

i=1

Moreover, the trace constraint tr R x x = 2 tr Rx x and λi ≥ 0 imply k i=1

λi ≤

n

2µi ,

k = n + 1, . . ., 2n.

(3.86)

i=1

Together, (3.85) and (3.86) prove the result. Again, K = I is a necessary condition only for maximum eigenvalue spread of R x x . It is not sufficient, as Schreier et al. (2005) incorrectly claimed. In Section 5.4.2, we will use (3.83) to maximize/minimize Schur-convex/concave functions of ev(R x x ) for fixed Rx x . It follows from (3.83) that these maxima/minima will be achieved for the extreme ## , but not necessarily for all extreme points. point R

3.4

Testing for impropriety In practice, the complementary covariance matrix must often be estimated from the data available. Such an estimate will in general be nonzero even if the source is actually

78

Second-order description of complex random vectors

proper. So how do we classify a problem as proper or improper? In this section, we present a hypothesis test for impropriety that is based on a generalized likelihood-ratio test (GLRT), which is a special case of a more general class of tests presented in Section 4.5. A general introduction to likelihood-ratio tests is provided in Section 7.1. x x in our case) are replaced by In a GLR, the unknown parameters (Rx x and R maximum-likelihood estimates. The GLR is always invariant with respect to transformations for which the hypothesis-testing problem itself is invariant. Since propriety is preserved by strictly linear, but not widely linear, transformations, the hypothesis test must be invariant with respect to strictly linear, but not widely linear, transformations. A maximal invariant statistic under linear transformation is given by the circularity coefficients. Since the GLR must be a function of a maximal invariant statistic the GLR is a function of the circularity coefficients. Let x be a complex Gaussian random vector with probability density function (3.87) p(x) = π −n (det R x x )−1/2 exp − 12 (x − ␮x )H R−1 x x (x − ␮x ) with augmented mean vector ␮x = Ex and augmented covariance matrix R x x = E[(x − ␮x )(x − ␮x )H ]. Consider M independent and identically distributed (i.i.d.) random samples X = [x1 , x2 , . . ., x M ] drawn from this distribution, and let X = [x1 , x2 , . . ., x M ] denote the augmented sample matrix. As shown in Section 2.4, the joint probability density function of these samples is M −Mn −M/2 −1 (det R x x ) exp − tr(R x x Sx x ) , (3.88) p(X) = π 2 where Sx x is the augmented sample covariance matrix M 1 1 Sx x Sx x Sx x = ∗ = (x − mx )(xm − mx )H = X XH − mx mHx Sx x S∗x x M m=1 m M

(3.89)

and mx is the augmented sample mean vector mx =

M 1 x . M m=1 m

(3.90)

We will now develop the GLR test of the hypotheses x x = 0), H0 : x is proper (R x x = 0). H1 : x is improper (R The GLRT statistic is max p(X) Rx x

R =0 λ = xx . max p(X)

(3.91)

Rx x

This is the ratio of likelihood with R x x constrained to have zero off-diagonal blocks, x x = 0, to likelihood with Rx x unconstrained. We are thus testing whether or not R x x R is block-diagonal.

3.4 Testing for impropriety

79

As discussed in Section 2.4, the unconstrained maximum-likelihood (ML) estimate of R x x is the augmented sample covariance matrix Sx x . The ML estimate of R x x under x x = 0 is the constraint R 0 Sx x . (3.92) S0 = 0 S∗x x Hence, the GLR (3.91) can be expressed as 2/M M −1 = λ2/M = det S−1 exp − S S − I tr S xx xx 0 0 2 I S−1 x x Sx x = det −∗ ∗ Sx x Sx x I = det−2 Sx x det Sx x =

(3.93) (3.94) (3.95)

n " (1 − kî2 ) = 1 − ρ&1 .

(3.96)

i=1

In the last line, {kî } denotes the estimated circularity coefficients, which are computed from the augmented sample covariance matrix Sx x . The estimated circularity coefficients are then used to estimate the degree of impropriety ρ1 . Our main finding follows. Result 3.8. The estimated degree of impropriety ρ&1 is a test statistic for a GLRT for impropriety. Equations (3.95) and (3.96) are equivalent formulations of this GLR. A full-rank implementation of this test relies on (3.95) since it does not require computation of S−1 x x . However, a reduced-rank implementation considers only the r largest estimated circularity coefficients in the product (3.96). This test was first proposed by Andersson and Perlman (1984), then in complex notation by Ollila and Koivunen (2004), and the connection with canonical correlations was established by Schreier et al. (2006). Andersson and Perlman (1984) also show that the estimated degree of impropriety ρ&3 =

n 1 ˆ2 k n i=1 i

(3.97)

rather than ρ&1 is the locally most powerful (LMP) test for impropriety. An LMP test has the highest possible power (i.e., probability of detection) for H1 close to H0 , where all circularity coefficients are small. Intuitively, it is clear that ρ&1 and ρ&3 behave quite differently. While ρ&1 is close to 1 if at least one circularity coefficient is close to 1, ρ&3 is close to 1 only if all circularity coefficients are close to 1. Walden and Rubin-Delanchy (2009) reexamined testing for impropriety, studying the null distributions of ρ&1 and ρ&3 , and deriving a distributional approximation for ρ&1 . They also point out that no uniformly powerful (UMP) test exists for this problem, simply because the GLR and the LMP tests are

80

Second-order description of complex random vectors

(a)

(b)

Figure 3.4 BPSK symbols transmitted over (a) a noncoherent and (b) a partially coherent AWGN

channel.

different for dimensions n ≥ 2. (In the scalar case n = 1, the GLR and LMP tests are identical.) Example 3.4. In the scalar case n = 1, the GLR becomes =1−

| Sx x |2 , Sx2x

(3.98)

with sample variance Sx x and sample complementary variance Sx x . As a simple example, we consider the transmission of BPSK symbols – that is, equiprobable binary data bm ∈ {±1} – over an AWGN channel that also rotates the phase of the transmitted bits by φm . The received statistic is xm = bm ejφm + n m ,

(3.99)

where n m are samples of white Gaussian noise and φm are samples of the channel phase. We are interested in classifying this channel as either noncoherent or partially coherent. We will see that, even though xm is not Gaussian, the GLR (3.98) is well suited for this hypothesis test. We evaluate the performance of the GLRT detector by Monte Carlo simulations. Under H0 , we assume that the phase samples φm are i.i.d. and uniformly distributed. Under H1 , we assume that the phase samples are i.i.d. and drawn from a Gaussian distribution. This means that, under H0 , no useful phase information can be extracted, whereas under H1 a phase estimate is available, albeit with a tracking error. Figure 3.4 plots BPSK symbols that have been transmitted over a noncoherent additive white Gaussian noise channel with uniformly distributed phase in (a), and for a partially coherent channel in (b). Figure 3.5 shows experimentally estimated receiver operator characteristics (ROC) for this detector for various signal-to-noise ratios (SNRs) and variances for the phase tracking error. In an ROC curve, the probability of detection PD , which is the probability of correctly accepting H1 , is plotted versus the probability of false alarm PFA , which is the probability of incorrectly rejecting H0 . The ROC curve will be taken up more thoroughly in Section 7.3.

81

1

0.8

0.8

0.6

0.6

D

1

P

PD

3.5 Independent component analysis

0.4

0.4

0.2

0.2

0 0

0.2

0.4

0.6

0.8

1

0 0

0.2

0.4

0.6

P

P

(a)

(b)

FA

0.8

1

FA

Figure 3.5 Receiver operating characteristics of the GLRT detector. In (a), the SNR is fixed at 0 dB. From northwest to southeast, the curves correspond to phase tracking-error variance of 0.7, 0.95, 1.2, and 1.5. In (b), the phase tracking-error variance is fixed at 1. From northwest to southeast, the curves correspond to SNR of 5 dB, 0 dB, and −5 dB. In all cases, the number of samples was M = 1000.

3.5

Independent component analysis An interesting application of the invariance property of the circularity coefficients is independent component analysis (ICA). In ICA, we observe a linear mixture y of independent complex components (sources) x, as described by y = Mx.

(3.100)

We will make a few simplifying assumptions in this section. The dimensions of y and x are assumed to be equal, and the mixing matrix M is assumed to be nonsingular. The objective is to blindly recover the sources x from the observations y, without knowledge of M, using a linear transformation M# . This transformation M# can be regarded as a blind inverse of M, which is usually called a separating matrix. Note that, since the model (3.100) is linear, it is unnecessary to consider widely linear transformations. ICA seeks to determine independent components. Arbitrary scaling of x, i.e., multiplication by a diagonal matrix, and reordering the components of x, i.e., multiplication by a permutation matrix, preserves the independence of its components. The product of a diagonal and a permutation matrix is a monomial matrix, which has exactly one nonzero entry in each column and row. Hence, we can determine M# up to multiplication with a monomial matrix. Standard ICA requires the use of higher-order statistical information, and the blind recovery of x cannot work if more than one source xi is Gaussian. If only second-order information is available, the best possible solution is to decorrelate the components, rather than to make them independent. This is done by determining the principal components UH y using the EVD R yy = EyyH = UΛUH . However, the restriction to unitary

82

Second-order description of complex random vectors

ξ

A−1 xx

M

A−∗ xx

x∗

y

A yy

yy R

xx R

K ∗ ξ

x

M∗

y∗

ω K

A∗yy

ω∗

Figure 3.6 Two-channel model for complex ICA. The vertical arrows are labeled with the cross-covariance matrix between the upper and lower lines (i.e., the complementary covariance).

rather than general linear transformations wastes a considerable degree of freedom in designing the blind inverse M# . In this section, we demonstrate that, in the complex case, it can be possible to determine M# using second-order information only. This was first shown by DeLathauwer and DeMoor (2002) and independently discovered by Eriksson and Koivunen (2006). The key insight in our demonstration is that the independence of the components of x means that, up to simple scaling and permutation, x is already given in canonical coordinates. The idea is then to exploit the invariance of circularity coefficients of x under the linear mixing transformation M. The assumption of independent components x implies that the covariance matrix Rx x x x are both diagonal. It is therefore easy to and the complementary covariance matrix R compute canonical coordinates between x and x∗ , denoted by ␰ = Ax x x. In the strong −1/2 is a diagonal scaling matrix, and FHx x is uncorrelating transform Ax x = FHx x R−1/2 x x , Rx x a permutation matrix that rearranges the canonical coordinates ␰ such that ξ1 corresponds to the largest circularity coefficient k1 , ξ2 to the second largest coefficient k2 , and so on. This makes the strong uncorrelating transform Ax x monomial. As a consequence, ␰ also has independent components. The mixture y has covariance matrix R yy = MRx x MH and complementary covari x x MT . The canonical coordinates of y and y∗ are computed as yy = MR ance matrix R ∗ ∗ ∗ ␻ = A yy y = FHyy R−1/2 yy y, and ␻ = A yy y . The strong uncorrelating transform A yy is determined as explained in Section 3.2.2. Figure 3.6 shows the connection between the different coordinate systems. The important observation is that ␰ and ␻ are both in canonical coordinates with the same circularity coefficients ki . In the next paragraph, we will show that ␰ and ␻ are related as ␻ = D␰ by a diagonal matrix D with diagonal entries ±1, provided that all circularity coefficients are distinct. Since ␰ has independent components, so does ␻. Hence, we have a solution to the ICA problem. Result 3.9. The strong uncorrelating transform A yy is a separating matrix for the complex linear ICA problem if all circularity coefficients are distinct. The only thing left to show is that D = A yy MA−1 x x is indeed diagonal with diagonal elements ±1. Since ␰ and ␻ are both in canonical coordinates with the same diagonal

3.5 Independent component analysis

canonical correlation matrix K, we find I K DH I K D 0 E ␰ ␰H = = 0 K I 0 D∗ K I DKDT DDH . = ∗ H D KD D∗ DT

0 = E ␻ ␻H DT

83

(3.101) (3.102)

This shows that D is unitary and DKDT = K. The latter can be true only if Di j = 0 whenever ki = k j . Therefore, D is diagonal and unitary if all circularity coefficients are distinct. Since K is real, the corresponding diagonal entries of all nonzero circularity coefficients are actually ±1. On the other hand, components with identical circularity coefficient cannot be separated. Example 3.5. Consider a source x = [x1 , x2 ]T . The first component x1 is the signalspace representation of a QPSK signal with amplitude 2 and phase offset π/8, i.e., x1 ∈ {±2ejπ/8 , ±2jejπ/8 }. The second component x2 , independent of x1 , is the signalspace representation of a BPSK signal with amplitude 1 and phase offset π/4, i.e., x1 ∈ {±ejπ/4 }. Hence, 4 0 0 0 and Rx x = . Rx x = 0 1 0 j In order to take x into canonical coordinates, we use the strong uncorrelating transform 0 1 0 1 12 0 H −1/2 . = 1 Ax x = Fx x Rx x = 0 1 0 0 1 2 We see that FHx x is a permutation matrix, R−1/2 is diagonal, and the product of the two is xx monomial. The circularity coefficients are k1 = 1 and k2 = 0. Note that the circularity coefficients carry no information about the amplitude or phase of the two signals. Now consider the linear mixture y = Mx with −j 1 M= . 2−j 1+j yy = MR x x MT , we compute the SVD of the coherence With R yy = MRx x MH and R −1/2 −T/2 H matrix C = R yy R yy R yy = UKV . The unitary Takagi factor F yy is then obtained as F yy = U(VH U∗ )1/2 . The circularity coefficients of y are the same as those of x. In order to take y into canonical coordinates, we use the strong uncorrelating transform (rounded to four decimals) 0.7071 − 2.1213j 0.7071 + 0.7071j = . A yy = FHyy R−1/2 yy −0.7045 − 0.0605j 0.3825 − 0.3220j We see that

A yy M =

0 0.3825 − 0.3220j

0.7071 − 0.7071j 0

84

Second-order description of complex random vectors

is monomial, and A yy MA−1 xx

0.7071 − 0.7071j = 0

0 0.7650 − 0.6440j

is diagonal and unitary. One final comment is in order. The technique presented in this section enables the blind separation of mixtures using second-order statistics only, under two crucial assumptions. First, the sources must be complex and uncorrelated with distinct circularity coefficients. Second, y must be a linear mixture. If y does not satisfy the linear model (3.100), the objective of ICA is to find components that are as independent as possible. The degree of independence is measured by a contrast function such as mutual information or negentropy. It is important to realize that the strong uncorrelating transform A yy is not guaranteed to optimize any contrast function. Finding maximally independent components in the nonlinear case requires the use of higher-order statistics. 4

Notes 1 Much of the material presented in this chapter has been drawn from Schreier and Scharf (2003a) and Schreier et al. (2005). 2 The circularity coefficients and the strong uncorrelating transform were introduced by Eriksson and Koivunen (2006). The facts that the circularity coefficients are canonical correlations between x and x∗ , and that the strong uncorrelating transform takes x into canonical coordinates c were shown by Schreier et al. (2006) and Schreier (2008a). These two papers are IEEE, and portions of them are reused with permission. More mathematical background on the Takagi factorization can be found in Horn and Johnson (1985). 3 The degree of impropriety and bounds in terms of eigenvalues were developed by Schreier (2008a). 4 Our discussion of independent component analysis barely scratches the surface of this rich topic. A readable introductory paper to ICA is Comon (1994). The proof of complex second-order c ICA in Section 3.5 was first presented by Schreier et al. (2009). This paper is IEEE, and portions are reused with permission.

4

Correlation analysis

Assessing multivariate association between two random vectors x and y is an important problem in many research areas, ranging from the natural sciences (e.g., oceanography and geophysics) to the social sciences (in particular psychometrics and behaviormetrics) and to engineering. While “multivariate association” is often simply visualized as “similarity” between two random vectors, there are many different ways of measuring it. In this chapter, we provide a unifying treatment of three popular correlation analysis techniques: canonical correlation analysis (CCA), multivariate linear regression (MLR), and partial least squares (PLS). 1 Each of these techniques transforms x and y into its respective internal representation ␰ and ␻. Different correlation coefficients may then be defined as functions of the diagonal cross-correlations {ki } between the internal representations ξi and ωi . The key differences among CCA, MLR, and PLS are revealed in their invariance properties. CCA is invariant under nonsingular linear transformation of x and y, MLR is invariant under nonsingular linear transformation of y but only unitary transformation of x, and PLS is invariant under unitary transformation of x and y. Correlation coefficients then share the invariance properties of the correlation analysis technique on which they are based. Analyzing multivariate association of complex data is further complicated by the fact that there are different types of correlation. 2 Two scalar complex random variables x and y are called rotationally dependent if x = ky for some complex constant k. This term is motivated by the observation that, in the complex plane, sample pairs of x and y rotate in the same direction (counterclockwise or clockwise), by the same angle. They are called reflectionally dependent if x = k˜ y ∗ for some complex constant ˜ This means that sample pairs of x and y rotate by the same angle, but in opposite k. directions – one rotating clockwise and the other counterclockwise. Rotational and reflectional correlations measure the degree of rotational and reflectional dependence. The combined effect of rotational and reflectional correlation is assessed by a total correlation. Thus, there are two fundamental choices that must be made when analyzing multivariate association between two complex random vectors: the desired invariance properties of the analysis (linear/linear, unitary/linear, or unitary/unitary) and the type of correlation (rotational, reflectional, or total). These choices will have to be motivated by the problem under consideration.

86

Correlation analysis

The techniques in this chapter are developed for random vectors using ensemble averages. This assumes that the necessary second-order information is available, namely, the correlation matrices of x and y and their cross-correlation matrix are known. However, it is straightforward to apply these techniques to sample data, using sample correlation matrices. Then M independent snapshots of (x, y) would be assembled into matrices X = [x1 , . . ., x M ] and Y = [y1 , . . ., y M ], and the correlation matrices would be estimated as Sx x = M −1 XXH , Sx y = M −1 XYH , and S yy = M −1 YYH . The structure of this chapter is as follows. In Section 4.1, we look at the foundations for measuring multivariate association between a pair of complex vectors, which will lead to the introduction of the three correlation analysis techniques CCA, MLR, and PLS. In Section 4.2, we discuss their invariance properties. In particular, we show that the diagonal cross-correlations {ki } produced by CCA, MLR, and PLS are maximal invariants under linear/linear, unitary/linear, and unitary/unitary transformation, respectively, of x and y. In Section 4.3, we introduce a few scalar-valued correlation coefficients as different functions of the diagonal cross-correlations {ki }, and show how these coefficients can be interpreted. An important feature of CCA, MLR, and PLS is that they all produce diagonal cross-correlations that have maximum spread in the sense of majorization (see Appendix 3 for background on majorization). Therefore, any correlation coefficient that is an increasing and Schur-convex function of {ki } is maximized, for arbitrary rank r . This allows assessment of correlation in a lower-dimensional subspace of dimension r . In Section 4.4, we introduce the correlation spread as a measure that indicates how much of the overall correlation can be compressed into a lower-dimensional subspace. Finally, in Section 4.5, we present several generalized likelihood-ratio tests for the correlation structure of complex Gaussian data, such as sphericity, independence within one data set, and independence between two data sets. All these tests have natural invariance properties, and the generalized likelihood ratio is a function of an appropriate maximal invariant.

4.1

Foundations for measuring multivariate association between two complex random vectors The correlation coefficient between two scalar real zero-mean random variables u and v is defined as ρuv = √

Euv Ruv √ √ =√ . Ruu Rvv Eu 2 Ev 2

(4.1)

The correlation coefficient is a convenient measure for how closely u and v are related. It satisfies −1 ≤ ρuv ≤ 1. If ρuv = 0, then u and v are uncorrelated. If |ρuv | = 1, then u is a linear function of v, or vice versa, with probability 1: u=

Ruv v. Rvv

(4.2)

4.1 Measuring multivariate association

2

2

1

1

0

0

0

1

2

0

(a)

1

87

2

(b)

Figure 4.1 Scatter plots of 100 sample pairs of u and v with different ρuv .

In general, for |ρuv | ≤ 1, uˆ =

Ruv v Rvv

(4.3)

is a linear minimum mean-squared error (LMMSE) estimate of u from v, and the sign of ρuv is the sign of the slope of the LMMSE estimate. An expression for the MSE is E|uˆ − u|2 = Ruu −

2 Ruv 2 = Ruu (1 − ρuv ). Rvv

(4.4)

Example 4.1. Consider a zero-mean, unit-variance, real random variable u and a zeromean, unit-variance, real random variable n, uncorrelated with u. Now let v = au + bn for given real a and b. We find ρuv = √

Ruv a √ =√ . Ruu Rvv a 2 + b2

Figure 4.1 depicts 100 sample pairs of u and v, for uniform u and Gaussian n. Plot (a) shows the case a = 0.8, b = 0.1, which results in ρuv = 0.9923. Plot (b) shows a = 0.8, b = 0.4, which results in ρuv = 0.8944. The line is the LMMSE estimate vˆ = au (or, equivalently, uˆ = a −1 v). How may we define a correlation coefficient between a pair of complex random vectors to measure their multivariate association? We shall consider the extension from real to complex quantities and the extension from scalars to vectors separately, and then combine our findings.

4.1.1

Rotational, reflectional, and total correlations for complex scalars Consider a pair of scalar complex zero-mean random variables x and y. As a straightforward extension of the real case, let us define the complex correlation coefficient ρx y =

E x y∗ Rx y =√ , Rx x R yy E|x|2 E|y|2

(4.5)

88

Correlation analysis

2

2

1

1

0

0

(a)

0

1

2

0

(b)

1

2

Figure 4.2 Sample pairs of two complex random variables x and y with ρx y = exp(jπ/2) and |Rx y |/R yy = 1.2. Plot (a) depicts samples of x and (b) samples of y, in the complex plane. For corresponding samples we use the same symbol.

which satisfies 0 ≤ |ρx y | ≤ 1. The LMMSE estimate of x from y is ˆ x(y) =

Rx y |Rx y | jRx y y= e y, R yy R yy

(4.6)

which achieves the minimum error ˆ E|x(y) − x|2 = Rx x −

|Rx y |2 = Rx x (1 − |ρx y |2 ). R yy

(4.7)

ˆ is a perfect estimate of x from y. Figure 4.2 depicts five sample Hence, if |ρx y | = 1, x(y) pairs of two complex random variables x and y with ρx y = exp(jπ/2), in the complex plane. Plot (a) shows samples of x and (b) the corresponding samples of y. We observe that (b) is simply a scaled and rotated version of (a). The amplitude is scaled by the factor |Rx y |/R yy , preserving the aspect ratio, and the rotation angle is Rx y = ρx y . Now what about complementary correlations? Instead of estimating x as a linear function of y, we may employ the conjugate linear minimum mean-squared error (CLMMSE) estimator ˆ ∗) = x(y

E xy ∗ Rx y ∗ | Rx y | j Rx y ∗ y = y = e y . 2 E|y| R yy R yy

(4.8)

The corresponding correlation coefficient is ρx y = √

Rx y

Rx x

R yy

,

(4.9)

with 0 ≤ |ρx y | ≤ 1, and the CLMMSE is ˆ ∗ ) − x|2 = Rx x − E|x(y

| Rx y |2 = Rx x (1 − |ρx y |2 ). R yy

(4.10)

ˆ ∗ ) is a perfect estimate of x from y ∗ . Figure 4.3 depicts five Hence, if |ρx y | = 1, x(y sample pairs of two complex random variables x and y with ρx y = exp(jπ/2), in the complex plane. Plot (a) shows samples of x and (b) the corresponding samples of y.

4.1 Measuring multivariate association

2

2

1

1

0

0

0

1

(a)

2

0

1

89

2

(b)

Figure 4.3 Sample pairs of two complex random variables x and y with ρ x y = exp(jπ/2) x y |/R yy = 1.2. Plot (a) depicts samples of x and (b) samples of y, in the complex plane. and | R Samples of y correspond to amplified and reflected samples of x. The reflection axis is the x y /2 = ρx y /2 = π/4. dashed line, which is given by R

We observe that (b) is a scaled and reflected version of (a). The amplitude is scaled by the factor | Rx y |/R yy , preserving the aspect ratio. Since x = Rx y − y, we have, with probability 1: (4.11) x − 12 Rx y = − y − 12 Rx y . Thus, the reflection axis is Rx y /2 = ρx y /2, which is the dashed line in Fig. 4.3. Depending on whether rotation or reflection better models the relationship between x and y, |ρx y | or |ρx y | will be greater. We note the ease with which the best possible reflection axis is determined as half the angle of the complementary correlation Rx y (or half the angle of the correlation coefficient ρx y ). This would be significantly more cumbersome in real-valued notation. Of course, data might exhibit a combination of rotational and reflectional correlation, motivating use of a widely linear minimum mean-squared error (WLMMSE) estimator ˆ x(y, y ∗ ) = αy + βy ∗ ,

(4.12)

ˆ where α and β are chosen to minimize E|x(y, y ∗ ) − x|2 . We will be discussing WLMMSE estimation in much detail in Section 5.4. At this point, we content ourselves with stating that the solution is ˆ x(y, y∗) = =

ˆ ∗ ) − x[ ˆ yˆ (y ∗ )] ˆ ˆ yˆ ∗ (y)] + x(y x(y) − x[ 2 1 − |ρyy | (Rx y R yy − Rx y R∗yy )y + ( Rx y R yy − Rx y Ryy )y ∗ , R 2 − | Ryy |2

(4.13)

yy

where Ryy = E y 2 is the complementary variance of y, |ρyy |2 = | Ryy |2 /R 2yy is the degree of impropriety of y, yˆ ∗ (y) = [ R∗yy /R yy ]y is the CLMMSE estimate of y ∗ from y, and yˆ (y ∗ ) = [ Ryy /R yy ]y ∗ is the CLMMSE estimate of y from y ∗ .

90

Correlation analysis

Through the connection ˆ E|x(y, y ∗ ) − x|2 = Rx x (1 − ρ¯x2y ),

(4.14)

we obtain the corresponding squared correlation coefficient ρ¯x2y = =

|ρx y |2 + |ρx y |2 − 2 Re[ρx y ρx∗y ρyy ] 1 − |ρyy |2 (|Rx y |2 + | Rx y |2 )R yy − 2 Re(Rx y Rx∗y Ryy ) , Rx x (R 2 − | Ryy |2 )

(4.15)

yy

∗ and with 0 ≤ ρ¯x2y ≤ 1. We note that the correlation coefficient ρ¯x y , unlike ρx y = ρ yx 2 2 ρx y = ρyx , is not symmetric in x and y: in general, ρ¯x y = ρ¯ yx . The correlation coefficient ρ¯x y is bounded in terms of the coefficients ρx y and ρx y as

max(|ρx y |2 , |ρx y |2 ) ≤ ρ¯x2y ≤ min(|ρx y |2 + |ρx y |2 , 1).

(4.16)

The lower bound holds because a WLMMSE estimator subsumes both the LMMSE and CLMMSE estimators, so we must have WLMMSE ≤ LMMSE and WLMMSE ≤ CLMMSE. However, there is no general ordering of LMMSE and CLMMSE, which we write as LMMSE CLMMSE. A common scenario in which the lower bound is attained is when y is maximally improper, i.e., R yy = | Ryy | ⇔ |ρyy |2 = 1, which yields a zero denominator in (4.15). This means that, with probability 1, y ∗ = ejα y for some constant α, and Rx y = ejα Rx y . In this case, y and y ∗ carry exactly the same information about x. Therefore, WLMMSE estimation is unnecessary, and can be replaced with either LMMSE or CLMMSE estimation. In the maximally improper case, ρ¯x2y = |ρx y |2 = |ρx y |2 . Two other examples of attaining the lower bound in (4.16) are either Rx y = 0 and Ryy = 0 (i.e., ρx y = 0 and ρyy = 0), which leads to ρ¯x2y = |ρx y |2 , or Rx y = 0 and Ryy = 0 (i.e., ρx y = 0 and ρyy = 0), which yields ρ¯x2y = |ρx y |2 ; cf. (4.15). The upper bound ρ¯x2y = |ρx y |2 + |ρx y |2 is attained when the WLMMSE estimator is ˆ ˆ ∗ ). In this case, ˆ + x(y the sum of the LMMSE and CLMMSE estimators: x(y, y ∗ ) = x(y) ∗ y and y carry completely complementary information about x. This is possible only for uncorrelated y and y ∗ , that is, a proper y. It is easy to see that Ryy = 0 ⇔ |ρyy |2 = 0 in (4.15) leads to ρ¯x2y = |ρx y |2 + |ρx y |2 . The following example gives two scenarios in which the lower and upper bounds are attained.

Example 4.2. Attaining the lower bound. Consider a complex random variable y = ejα (u + n), where u is a real random variable and α is a fixed constant. Further, assume that n is a real random variable, uncorrelated with u, and Rnn = Ruu . Let x = Re (ejα u) = ˆ cos(α)u. The LMMSE estimator x(y) = 12 cos(α)e−jα y and the CLMMSE estimator 1 ∗ jα ∗ ˆ ) = 2 cos(α)e y both perform equally well. However, they both extract the same x(y information from y because y is maximally improper. Hence, a WLMMSE estimator has

4.1 Measuring multivariate association

91

no performance advantage over an LMMSE or CLMMSE estimator: |ρx y |2 = |ρx y |2 = ρ¯x2y = 12 . Attaining the upper bound. Consider a proper complex random variable y and let ˆ x be its real part. The LMMSE estimator of x from y is x(y) = 12 y, the CLMMSE 1 ∗ ∗ ∗ ˆ ˆ ) = 2 y , and the WLMMSE estimator x(y, y ) = 12 y + 12 y ∗ produces estimator is x(y a perfect estimate of x. Here, rotational and reflectional models are equally appropriate. Each tells only half the story, but they complement each other perfectly. It is easy to see that |ρx y |2 = |ρx y |2 = 12 and ρ¯x2y = |ρx y |2 + |ρx y |2 = 1. (This also shows that 2 since it is obviously impossible to perfectly reconstruct y as a widely linear ρ¯x2y = ρ¯ yx function of x.) The following definition sums up the main findings of this section, and will be used to classify correlation coefficients throughout this chapter. Definition 4.1. A correlation coefficient that measures how well a (1) linear function (2) conjugate linear function (3) widely linear function models the relationship between two complex random variables is respectively called a (1) rotational correlation coefficient (2) reflectional correlation coefficient (3) total correlation coefficient.

4.1.2

Principle of multivariate correlation analysis We would now like to define a scalar-valued correlation coefficient that gives an overall measure of the association between two zero-mean random vectors. The following definition sets out the minimum requirements for such a correlation coefficient. Definition 4.2. A correlation coefficient ρx y between two random vectors x and y must satisfy the following conditions for all nonzero scalars α and β, provided that x and y are not both zero. 0 ≤ ρx y ≤ 1,

(4.17)

ρx y = ρx y = ρx y

for x = αx, y = βy,

(4.18)

ρx y = 1

if y = βx,

(4.19)

ρx y = 0

if x and y are uncorrelated.

(4.20)

Note that we do not require the symmetry ρx y = ρ yx . 3 If a correlation coefficient is allowed to be negative or complex-valued (as we have seen in the previous section), these conditions apply to its absolute value. However, the correlation coefficients considered hereafter are all real and nonnegative. For simplicity, we consider only rotational

92

Correlation analysis

A

x

ξ

Rxy y

xy R

K B

A

x

ω

y∗

(a)

ξ K

B

ω∗

(b)

A1 ξ

x (·)∗

A2

Rxy

K B1

y

(·)∗

ω B2 (c)

Figure 4.4 The principles of multivariate correlation analysis: (a) rotational correlations, (b) reflectional correlations, and (c) total correlations.

correlations and strictly linear transforms in this section. Reflectional and total correlations will be discussed in the following section. A correlation coefficient that only satisfies (4.17)–(4.20) will probably not be very useful. It is usually required to have further cases that result in a unit correlation coefficient ρx y = 1, such as when y = Mx, where M is any nonsingular matrix, or y = Ux, where U is any unitary matrix. There are further desirable properties. Chief among them are invariance under specified classes of transformations on x and y, and the ability to assess correlation in a lower-dimensional subspace. What exactly this means will become clearer as we move along in our development. The cross-correlation properties between x and y are described by the crosscorrelation matrix Rx y = ExyH , but this matrix is generally difficult to interpret. In order to illuminate the underlying cross-correlation structure, we shall transform ndimensional x and m-dimensional y into p-dimensional internal (latent) representations ␰ = Ax and ␻ = By, with p = min(m, n), as shown in Fig. 4.4(a). The way in which the full-rank matrices A ∈ C p×n and B ∈ C p×m are chosen will determine the type of correlation analysis. In the statistical literature, the latent vectors ␰ and ␻ are usually called score vectors, and the matrices A and B are called the matrices of loadings. Our goal is to define different correlation coefficients as different functions of the correlations ki = E ξi ωi∗ , i = 1, . . ., p, which are the diagonal elements of the crosscorrelation matrix K = E ␰␻H in the internal coordinate system of (␰, ␻). We would like as much correlation as possible concentrated in the first r coefficients {k1 , k2 , . . ., kr }, for any r ≤ p, because this will allow us to assess correlation in a lower-dimensional subspace of dimension r . Hence, our aim is to choose A and B such that all partial sums

4.1 Measuring multivariate association

93

over the absolute values of the diagonal cross-correlations ki are maximized: max A,B

r

|ki |,

r = 1, . . ., p

(4.21)

i=1

In order to make this a well-defined maximization problem, we need to impose some constraints on A and B. The following three choices are most compelling. r Require that the internal representations ␰ and ␻ each have identity correlation matrix (we avoid using the term “white” because ␰ and ␻ may have non-identity complementary correlation matrices): Rξ ξ = ARx x AH = I and Rωω = BR yy BH = I. This choice leads to canonical correlation analysis (CCA). The corresponding diagonal crosscorrelations ki are called the canonical correlations, and the latent vectors ␰ and ␻ are given in canonical coordinates. r Require that A have unitary rows (which we will simply call row-unitary) and ␻ have identity correlation matrix: AAH = I and Rωω = BR yy BH = I. This choice leads to multivariate linear regression (MLR), also known as half-canonical correlation analysis. The corresponding diagonal cross-correlations ki are called the half-canonical correlations, and the latent vectors ␰ and ␻ are given in half-canonical coordinates. r Require that A and B be row-unitary: AAH = I and BBH = I. This choice leads to partial least-squares (PLS) analysis. The corresponding diagonal cross-correlations ki are called the PLS correlations, and the latent vectors ␰ and ␻ are given in PLS coordinates. Sometimes, when there is a risk of confusion, we will use the subscript C, M, or P to emphasize that quantities were derived using CCA, MLR, or PLS. 2 Example 4.3. If x and y are scalars, then the CCA constraints are | A|2 = Rx−1 x and |B| = −1 −1/2 jφ1 −1/2 jφ2 R yy . Thus, the latent variables are ξ = Rx x e x and ω = R yy e y for arbitrary φ1 and φ2 . We then obtain

|k| = |Eξ ω∗ | = √

|Rx y | = |ρx y |, Rx x R yy

with ρx y defined in (4.5). We will find in Section 4.1.4 that the solution to the maximization problem (4.21) for CCA, MLR, or PLS results in a diagonal cross-correlation matrix K = E ␰␻H = Diag(k1 , . . ., k p ),

(4.22)

with k1 ≥ k2 ≥ · · · ≥ k p ≥ 0. Of course, the ki s depend on which of the principles CCA, MLR, and PLS is employed, so they are canonical correlations, half-canonical correlations, or PLS correlations. Furthermore, we will show that CCA, MLR, and PLS each produce a set {ki } that has maximum spread in the sense of majorization (see Appendix 3). Therefore, any correlation coefficient that is an increasing, Schur-convex function of {ki } is maximized, for arbitrary rank r .

94

Correlation analysis

The key difference among CCA, MLR, and PLS lies in their invariance properties. We will see that CCA is invariant under nonsingular linear transformation of both x and y, MLR is invariant under nonsingular linear transformation of y but only unitary transformation of x, and PLS is invariant under unitary transformation of both x and y. Therefore, CCA and PLS provide a symmetric assessment of correlation since the roles of x and y are interchangeable. MLR, on the other hand, distinguishes between the message (or predictor/explanatory variables) x and the measurement (or criterion/response variables) y. The correlation analysis technique must be chosen to match the invariance properties of the problem at hand.

4.1.3

Rotational, reflectional, and total correlations for complex vectors It is easy to see how to apply these principles to reflectional and total correlations, as shown in Figs. 4.4(b) and (c). For reflectional correlations, the internal representation is ∗ , whose complementary cross-correlation matrix is K = E␰␻T , and ␻∗ = By ␰ = Ax and k˜i = E ξi ωi . The maximization problem is then to maximize all partial sums over the absolute value of the complementary cross-correlations max A, B

r

|k˜i |,

r = 1, . . ., p,

(4.23)

i=1

and B. with the following constraints on A r For CCA, AR xx A H = I and BR H = I. ∗yy B r For MLR, A A H = I and BR H = I. ∗yy B r For PLS, A A H = I and B B H = I. For total correlations, the internal representation is computed as a widely linear function: ␰ = A x (i.e., ␰ = A1 x + A2 x∗ ) and ␻ = B y (i.e., ␻ = B1 y + B2 y∗ ). Here, our goal is to maximize the diagonal cross-correlations between the vectors of real and imaginary parts of ␰ and ␻. That is, for i = 1, . . ., p, we let k¯2i−1 = 2E(Re ξi Re ωi ) = Re(E ξi ωi∗ + E ξi ωi ),

(4.24)

k¯2i = 2E(Im ξi Im ωi ) = Re(E ξi ωi∗ − E ξi ωi ),

(4.25)

and maximize max A,B

r

|k¯i |,

r = 1, . . ., 2 p,

(4.26)

i=1

with the following constraints placed on A and B. r For CCA, A R AH = I and B R BH = I. Hence, ␰ and ␻ are each white and xx yy proper. r For MLR, A AH = I and B R BH = I. While ␻ is white and proper, ␰ is generally yy improper. r For PLS, A AH = I and B BH = I. Both ␰ and ␻ are generally improper.

4.1 Measuring multivariate association

95

Hence, we have a total of nine possible combinations between the three correlation analysis techniques (CCA, MLR, PLS) and the three different correlation types (rotational, reflectional, total). Each of these nine cases leads to different latent vectors (␰, ␻), and different diagonal cross-correlations ki , k˜i , or k¯i . We thus speak of rotational canonical correlations, reflectional half-canonical correlations, total canonical correlations, and so on.

4.1.4

Transformations into latent variables We will now derive the transformations that solve the maximization problems for rotational, reflectional, and total correlations using CCA, MLR, or PLS. In doing so, we determine the internal (latent) coordinate system for (␰, ␻). The approach is the same in all cases, and is based on majorization theory. The background on majorization necessary to understand this material can be found in Appendix 3. Result 4.1. The solutions to the maximization problems for rotational correlations (4.21), reflectional correlations (4.23), and total correlations (4.26), using CCA, MLR, or PLS, all yield an internal (latent) coordinate system with mutually uncorrelated components ξi and ω j for all i = j. Consider first the maximization problem (4.21) subject to the CCA constraints ARx x AH = I and BR yy BH = I. These two constraints determine A and B up to multiplication from the left by FH and GH , respectively, where F ∈ C n× p and G ∈ C m× p are both column-unitary: A = FH R−1/2 xx ,

(4.27)

B = GH R−1/2 yy .

(4.28)

The cross-correlation matrix between ␰ = Ax and ␻ = By is therefore −H/2 G, K = FH R−1/2 x x Rx y R yy

(4.29)

and the singular values of K are invariant to the choice of F and G. Maximizing all partial sums (4.21) requires maximum spread (in the sense of majorization) among the diagonal elements of K. As discussed in Appendix 3 in Result A3.6, the absolute values of the diagonal elements of an n × m matrix are weakly majorized by its singular values: |diag(K)| ≺w sv(K).

(4.30)

Maximum spread is therefore achieved by making K diagonal, which means that F and G are determined by the singular value decomposition (SVD) of −H/2 = FKGH . C = R−1/2 x x Rx y R yy

(4.31)

We call the matrix C the rotational coherence matrix. The diagonal elements of the diagonal matrix K = Diag(k1 , k2 , . . ., k p ) are nonnegative, and arranged in decreasing order k1 ≥ k2 ≥ · · · ≥ k p ≥ 0.

96

Correlation analysis

Table 4.1 SVDs and optimum transformations for various correlation types and correlation analysis c techniques. This table has been adapted from Schreier (2008c) IEEE, and is used with permission. Correlation analysis technique Correlation type

SVD

CCA

MLR

PLS

Rotational (x, y)

C = FKGH

−H/2 C = R−1/2 x x Rx y R yy H −1/2 A = F Rx x B = GH R−1/2 yy

C = Rx y R−H/2 yy A = FH B = GH R−1/2 yy

C = Rx y A = FH B = GH

= T G C FK

= R−1/2 −T/2 C x x Rx y R yy H −1/2 = A F Rx x T R−∗/2 =G B yy

=R x y R−T/2 C yy = A FH T R−∗/2 =G B yy

=R xy C = A FH T =G B

C = F K GH

−H/2 C = R−1/2 x x R x y R yy A = FH R−1/2 xx B = GH R−1/2 yy

C = Rx y R−H/2 yy A = FH B = GH R−1/2 yy

C = Rx y A = FH B = GH

Reflectional (x, y∗ ) Total (x, y)

It is straightforward to extend this result to the remaining eight combinations of correlation analysis technique (CCA, MLR, or PLS) and correlation type (rotational, reflectional, or total). This is shown in Table 4.1. In each of these cases, the solution or a total matrix C, involves the SVD of a rotational matrix C, a reflectional matrix C, whose definition depends on the correlation analysis technique. =F T in Table 4.1 is the usual K G For reflectional correlations, the expression C ∗ . Using the regular transpose rather than the Hermitian SVD but with column-unitary G transpose matches the structure of C. For total correlations, the augmented SVD of the augmented matrix C = F K GH is obtained completely analogously to the augmented EVD of an augmented covariance matrix, which was given in Result 3.1. The singular values of C are k¯1 ≥ k¯2 ≥ · · · ≥ k¯2 p ≥ 0 and the matrix K1 K2 (4.32) K= K2 K1 consists of a diagonal block K1 with diagonal elements 12 (k¯2i−1 + k¯2i ) and a diagonal block K2 with diagonal elements 12 (k¯2i−1 − k¯2i ). Therefore, the internal description (␰, ␻) is mutually uncorrelated, E ξi ω∗j = 0

and

E ξi ω j = 0 for i = j.

(4.33)

However, besides the real Hermitian cross-correlation E ξi ωi∗ = 12 (k¯2i−1 + k¯2i ),

(4.34)

there is also a generally nonzero real complementary cross-correlation E ξi ωi = 12 (k¯2i−1 − k¯2i ),

(4.35)

4.2 Invariance properties

97

unless all singular values of C have even multiplicity. Thus, unlike in the rotational and reflectional case, the solution to the maximization problem (4.26) does not lead to a diagonal matrix, but a matrix K = E ␰ ␻H with diagonal blocks instead. This means that generally the internal description (␰, ␻) cannot be made cross-proper because this would require a transformation jointly operating on x and y – yet all we have at our disposal are widely linear transformations separately operating on x and y.

4.2

Invariance properties In this section, we take a closer look at the properties of CCA, MLR, and PLS. Our focus will be on the invariance properties that characterize these correlation analysis techniques.

4.2.1

Canonical correlations Canonical correlation analysis (CCA), which was introduced by Hotelling (1936), is an extremely popular classical tool for assessing multivariate association. Canonical correlations have a number of important properties. The following is an immediate consequence of the fact that the canonical vectors have identity correlation matrix, i.e., Rξ ξ = I and Rωω = I in the rotational and reflectional case, and Rξ ξ = I and Rωω = I in the total case. Result 4.2. Canonical correlations (rotational, reflectional, and total) satisfy 0 ≤ ki ≤ 1. A key property of canonical correlations is their invariance under nonsingular linear transformation, which can be more precisely stated as follows. Result 4.3. Rotational and reflectional canonical correlations are invariant under nonsingular linear transformation of x and y, i.e., (x, y) and (Nx, My) have the same rotational and reflectional canonical correlations for all nonsingular N ∈ C n×n and M ∈ C m×m . Total canonical correlations are invariant under nonsingular widely linear transformation of x and y, i.e., (x, y) and (N x, M y) have the same total canonical correlations for all nonsingular N ∈ W n×n and M ∈ W m×m . Moreover, the canonical correlations of (x, y) and (y, x) are identical. We will show this for rotational correlations. The rotational canonical correlations ki are the singular values of C, or, equivalently, the nonnegative roots of the eigenvalues of CCH . Keeping in mind that the nonzero eigenvalues of XY are the nonzero eigenvalues of YX, we obtain −1 H −H/2 ev(CCH ) = ev(R−1/2 ) x x Rx y R yy Rx y Rx x −1 H = ev(R−1 x x Rx y R yy Rx y ).

(4.36)

98

Correlation analysis

Now consider the coherence matrix C between x = Nx and y = My. We find −1 H ev(C C ) = ev(R−1 x x Rx y R y y Rx y ) H

−1 H −H −1 −1 R yy M MRHx y NH ) = ev(N−H R−1 x x N NRx y M M −1 H H = ev(N−H R−1 x x Rx y R yy Rx y N )

= ev(CCH ).

(4.37)

Finally, the canonical correlations of (x, y) and (y, x) are identical because of −1 H −1 H −1 ev(R−1 x x Rx y R yy Rx y ) = ev (R yy Rx y Rx x Rx y ).

(4.38)

A similar proof works for reflectional and total correlations. Canonical correlations are actually not just invariant under nonsingular linear transformation, they are maximal invariant. The following is the formal definition of a maximal invariant function (see Eaton (1983)). Definition 4.3. Let G be a group acting on a set P. A function f defined on P is invariant if f (R) = f (g(R)) for all R ∈ P and g ∈ G. A function f is maximal invariant if f is invariant and f (R) = f (R ) implies that R = g(R ) for some g ∈ G. Let us first discuss how this definition applies to rotational canonical correlations. Here, the set P is the set of all positive definite composite correlation matrices Rx x Rx y . (4.39) Rx y = RHx y R yy The group G consists of all nonsingular linear transformations applied to x and to y. We write g = (N, M) ∈ G to describe the nonsingular linear transformation

which acts on Rx y as

x −→ Nx,

(4.40)

y −→ My,

(4.41)

NRx x NH g(Rx y ) = MRHx y NH

NRx y MH . MR yy MH

(4.42)

The function f produces the set of rotational canonical correlations {ki } from Rx y . We have already proved above that f is invariant because f (Rx y ) = f (g(Rx y )). What is left to show is that if two composite correlation matrices Rx y and Rx y have the same rotational canonical correlation matrix K, then there exists g ∈ G that relates the two matrices as Rx y = g(Rx y ). The fact that Rx y and Rx y have the same K means that g = (A, B) ∈ G, with A and B determined by (4.27) and (4.28), and g = (A , B ) ∈ G act on Rx y and Rx y as I K g(Rx y ) = = g (Rx y ). (4.43) K I

4.2 Invariance properties

99

Therefore, Rx y = (g −1 ◦ g )(Rx y ), which makes f maximal invariant. In other words, the rotational canonical correlations are a complete, or maximal, set of invariants for Rx y under nonsingular linear transformation of x and y. This maximal invariance property accounts for the name “canonical correlations.” For reflectional canonical correlations, the set to be considered contains all positive definite composite correlation matrices xy Rx x R , (4.44) Rx y = H Rx y R∗yy on which g acts as

x y MT NR . M∗ R∗yy MT

xy) = g(R

NRx x NH Hx y NH M∗ R

(4.45)

It is obvious that the reflectional canonical correlations are a maximal set of invariants x y under nonsingular linear transformation of x and y. for R For total canonical correlations, we deal with the set of all positive definite composite augmented correlation matrices Rx x Rx y . (4.46) Rx y = RHx y R yy Now let the group G consist of all nonsingular widely linear transformations applied to x and y. That is, g = (N, M) ∈ G describes x −→ N x,

(4.47)

y −→ M y,

(4.48)

and acts on Rx y as

N R x x NH g(Rx y ) = M RHx y NH

N R x y MH . M R yy MH

(4.49)

A maximal set of invariants for Rx y under widely linear transformation of x and y 2 p or, alternatively, by the Hermitian is given by the total canonical correlations k¯i i=1

cross-correlations

p

E ξi ωi∗ = 12 (k¯2i−1 + k¯2i )

i=1

together with the complementary cross-correlations p E ξi ωi = 12 (k¯2i−1 − k¯2i ) i=1 . The importance of a maximal invariant function is that a function is invariant if and only if it is (only) a function of a maximal invariant. For our purposes, this yields the following key result whose importance is difficult to overstate. Result 4.4. Any function of Rx x , Rx y , and R yy that is invariant under nonsingular linear transformation of x and y is a function of the rotational canonical correlations only.

100

Correlation analysis

x y , and R∗yy that is invariant under nonsingular linear transAny function of Rx x , R formation of x and y is a function of the reflectional canonical correlations only. Any function of R x x , R x y , and R yy that is invariant under nonsingular widely linear transformation of x and y is a function of the total canonical correlations only. This result is the basis for testing for independence between two Gaussian random vectors x and y (see Section 4.5), and testing for propriety, i.e., uncorrelatedness of x and x∗ (see Section 3.4).

4.2.2

Multivariate linear regression (half-canonical correlations) We now consider multivariate linear regression (MLR), which is also called halfcanonical correlation analysis. A key difference between MLR and CCA is that MLR is not symmetric because it distinguishes between the message (or predictor/explanatory variables) x and the measurement (or criterion/response variables) y. The invariance properties of MLR are as follows. Result 4.5. Rotational and reflectional half-canonical correlations are invariant under unitary transformation of x and nonsingular linear transformation of y, i.e., (x, y) and (Ux, My) have the same rotational and reflectional half-canonical correlations for all unitary U ∈ C n×n and nonsingular M ∈ C m×m . Total half-canonical correlations are invariant under widely unitary transformation of x and nonsingular widely linear transformation of y, i.e., (x, y) and (U x, M y) have the same total halfcanonical correlations for all unitary U ∈ W n×n , U UH = UH U = I, and nonsingular M ∈ W m×m . This can be shown similarly to the invariance properties of CCA. The rotational halfcanonical correlations ki are the singular values of C, or equivalently, the nonnegative roots of the eigenvalues of CCH . Let C be the half-coherence matrix between x = Ux and y = My. We obtain H ev(C C ) = ev(Rx y R−1 y y Rx y ) H

−1 H H = ev(URx y MH M−H R−1 yy M MRx y U )

= ev(CCH ).

(4.50)

Similar proofs work for reflectional and total half-canonical correlations. Rotational half-canonical correlations are a maximal set of invariants for Rx y and xy R yy , and reflectional half-canonical correlations are a maximal set of invariants for R ∗ and R yy , under the transformation x −→ Ux,

(4.51)

y −→ My

(4.52)

for unitary U ∈ C n×n and nonsingular M ∈ C m×m .

4.2 Invariance properties

101

Total half-canonical correlations are a maximal set of invariants for R x y and R yy under the transformation x −→ U x,

(4.53)

y −→ M y

(4.54)

for unitary U ∈ W n×n and nonsingular M ∈ W m×m . This maximal invariance property is proved along the same lines as the maximal invariance property of canonical correlations in the previous section. Thus, we have the following important result. Result 4.6. Any function of Rx y and R yy that is invariant under unitary transformation of x and nonsingular linear transformation of y is a function of the rotational halfcanonical correlations only. x y and R∗yy that is invariant under unitary transformation of x and Any function of R nonsingular linear transformation of y is a function of the reflectional half-canonical correlations only. Any function of R x y and R yy that is invariant under widely unitary transformation of x and nonsingular widely linear transformations of y is a function of the total halfcanonical correlations only. One may have been tempted to assume that half-canonical correlations are maximal x y in (4.44), or Rx y invariants for the composite correlation matrices Rx y in (4.39), R in (4.46). But it is clear that half-canonical correlations do not characterize Rx x or R x x . A maximal set of invariants for Rx x under unitary transformation of x is the set of eigenvalues of Rx x , and a maximal set of invariants for R x x under widely unitary transformation of x is the set of eigenvalues of R x x . For instance, tr Rx x , which is invariant under unitary transformation, is the sum of the eigenvalues of Rx x .

Weighted MLR A generalization of MLR replaces the condition that A have unitary rows, AAH = I, with AWAH = I, where W ∈ C n×n is a Hermitian, positive definite weighting matrix. In the rotational case, the optimum transformations and corresponding SVD are then given by A = FH W−1/2 ,

(4.55)

R−1/2 yy ,

(4.56)

B=G

H

= FKGH . C = W−1/2 Rx y R−H/2 yy

(4.57)

Extensions to the reflectional and total case work analogously. The invariance properties of weighted half-canonical correlations are different than those of half-canonical correlations and depend on the choice of W. It is interesting to note that canonical correlations are actually weighted half-canonical correlations with weighting matrix W = Rx x .

4.2.3

Partial least squares Finally, we turn to partial least squares (PLS). We emphasize that there are many variants of PLS. Our description of PLS follows Sampson et al. (1989), which differs from the

102

Correlation analysis

original PLS algorithm presented by Wold (1975, 1985). In the original PLS algorithm, both Rξ ξ and Rωω are diagonal matrices – albeit not identity matrices – and K is not generally diagonal. Moreover, A and B are not unitary. In the version of Sampson et al. (1989), which we follow, K is diagonal but Rξ ξ and Rωω are generally not, and A and B are both unitary. The maximal invariance properties of PLS should be obvious and straightforward to prove because PLS is obtained directly by the SVD of the cross-correlation matrix between x and y. The following two results are stated for completeness. Result 4.7. Rotational and reflectional PLS correlations are invariant under unitary transformation of x and y, i.e., (x, y) and (Ux, Vy) have the same rotational and reflectional PLS correlations for all unitary U ∈ C n×n and V ∈ C m×m . Total PLS correlations are invariant under widely unitary transformation of x and y, i.e., (x, y) and (U x, V y) have the same total PLS correlations for all unitary U ∈ W n×n and V ∈ W m×m . Moreover, the PLS correlations of (x, y) and (y, x) are identical. Result 4.8. Any function of Rx y that is invariant under unitary transformation of x and y is a function of the rotational PLS correlations only. x y that is invariant under unitary transformation of x and y is a Any function of R function of the reflectional PLS correlations only. Any function of R x y that is invariant under widely unitary transformation of x and y is a function of the total PLS correlations only. Both PLS and CCA provide a symmetric measure of multivariate association since the roles of x and y are interchangeable. PLS has an advantage over CCA if it is applied to sample correlation matrices. Since the computation of the sample coherence matrix −H/2 −1 S−1/2 requires the computation of inverses S−1 x x Sx y S yy x x and S yy , we run into stability problems if Sx x or S yy are close to singular. That is, the SVD of the coherence matrix can change significantly after recomputing sample correlation matrices with added samples that are nearly collinear with previous samples. These problems do not arise with PLS, which is why PLS has been called “robust canonical analysis” by Tishler and Lipovetsky (2000). However, it is possible to alleviate some of the numerical problems of CCA by applying appropriate penalties to the sample correlation matrices Sx x and S yy .

4.3

Correlation coefficients for complex vectors In order to summarize the degree of multivariate association between x and y, we now define different scalar-valued overall correlation coefficients ρ as functions of the diagonal cross-correlations {ki }, which may be computed as rotational, reflectional, or total CCA, MLR, or PLS coefficients. Therefore, these correlation coefficients inherit their invariance properties from {ki }. It is easy to show that all correlation coefficients presented in this chapter are increasing, Schur-convex functions. (Refer to Appendix 3,

4.3 Correlation coefficients for complex vectors

103

Section A3.1.2, for background on increasing, Schur-convex functions.) This means they are maximized by CCA, MLR, or PLS, for all ranks r .

4.3.1

Canonical correlations There is a number of commonly used correlation coefficients that are defined on the basis of the first r canonical correlations {ki }ri=1 . Three particularly compelling coefficients are 4 ρC2 1 =

r 1 2 k , p i=1 i r "

ρC2 2 = 1 −

(4.58)

(1 − ki2 ),

(4.59)

i=1 r

ρC2 3 =

i=1 r i=1

ki2 1 − ki2

1 + (p − r) 1 − ki2

.

(4.60)

These are rotational coefficients, and their reflectional versions ρ are obtained simply by replacing ki with k˜i . Their total versions use k¯i but require slightly different normalizations because there are 2r rather than r coefficients: ρ¯C2 1 =

2r 1 ¯2 k , 2 p i=1 i

ρ¯C2 2 = 1 −

(4.61)

2r " (1 − k¯i2 )1/2 ,

(4.62)

i=1 2r

ρ¯C2 3 =

i=1 2r i=1

k¯i2 1 − k¯i2

1 + 2( p − r ) 1 − k¯i2

.

(4.63)

In the full-rank case r = p = min(m, n), these coefficients can also be expressed directly in terms of the correlation and cross-correlation matrices. For the rotational coefficients, we find the following expressions: ρC2 1 =

1 1 −1 H tr K2 = tr(R−1 x x Rx y R yy Rx y ), p p

−1 H ρC2 2 = 1 − det(I − K2 ) = 1 − det(I − R−1 x x Rx y R yy Rx y ), H −1 H −1 tr Rx y R−1 tr K2 (I − K2 )−1 yy Rx y (Rx x − Rx y R yy Rx y ) 2 = ρ C3 = . H −1 tr (I − K2 )−1 tr Rx x (Rx x − Rx y R−1 yy Rx y )

(4.64) (4.65) (4.66)

104

Correlation analysis

x y , and The formulae for the reflectional versions are obtained by replacing Rx y with R −∗ with R . The expressions for the total full-rank coefficients are R−1 yy yy ρ¯C2 1 =

1 1 −1 H tr K2 = tr(R−1 x x R x y R yy R x y ), 2p 2p

−1 H ρ¯C2 2 = 1 − det1/2 (I − K2 ) = 1 − det1/2 (I − R−1 x x R x y R yy R x y ) H −1 H −1 tr R x y R−1 tr K2 (I − K2 )−1 yy R x y (R x x − R x y R yy R x y ) 2 = ρ¯C3 = . H −1 tr (I − K2 )−1 tr R x x (R x x − R x y R−1 yy R x y )

(4.67) (4.68) (4.69)

These coefficients inherit the invariance properties of the canonical correlations. That is, the rotational and reflectional coefficients are invariant under nonsingular linear transformation of x and y, and the total coefficients are invariant under nonsingular widely linear transformation. How should we interpret these coefficients? The rotational version of ρC1 characterizes the MMSE when constructing a linear estimate of the canonical vector ␰ from y. This estimate is H −1/2 −1 H −1/2 ˆ ␰(y) = Rξ y R−1 yy y = F Rx x Rx y R yy y = F CR yy y = KBy = K␻,

(4.70)

and the resulting MMSE is ˆ − ␰2 = tr(Rξ ξ − Rξ ω R−1 RH ) = tr(I − K2 ) = p(1 − ρ 2 ). E␰(y) ωω ξ ω C1

(4.71)

Since CCA is symmetric in x and y, the same MMSE is obtained when estimating ␻ from x. In a similar fashion, the reflectional version ρC1 is related to the MMSE of the conjugate linear estimator ˆ ∗ ) = K␻ ∗, ␰(y

(4.72)

and the total version ρ¯C1 is related to the MMSE of the widely linear estimator ˆ y∗ ) = K1 ␻ + K2 ␻∗ . ␰(y,

(4.73)

For jointly Gaussian x and y, the second coefficient, ρC2 , determines the mutual information between x and y, ) ( H det(Rx x − Rx y R−1 yy Rx y ) = −log det(I − K2 ) = −log(1 − ρC2 2 ). I (x; y) = −log det Rx x (4.74) The rotational and reflectional versions only take rotational and reflectional dependences into account, respectively, whereas the total version characterizes the total mutual information between x and y. Finally, ρC3 has an interesting interpretation in the signal-plus-uncorrelated-noise case y = x + n. Let Rx x = Rx y = S and R yy = S + N. It is easy to show that the eigenvalues of the signal-to-noise-ratio (SNR) matrix SN−1 are {ki2 /(1 − ki2 )}. Hence, they are invariant under nonsingular linear transformation of the signal x, and the numerator of ρC2 3 in (4.66) is tr(SN−1 ). Correlation coefficient ρC3 can thus be interpreted as a normalized SNR.

105

4.3 Correlation coefficients for complex vectors

1 rotational reflectional total

0.7

0.98 0.96 2

rC

2

0.5

2

rC

1

0.6

0.4

0.94

0.3 0.92

0.2 0.1 (a)

1

2

3 r

4

5

0.9

1

2

3 r

(b)

4

5

Figure 4.5 Correlation coefficients ρC2 1 (a) and ρC2 2 (b) for various ranks r .

Example 4.4. Consider the widely linear model y = H1 x + H2 x∗ + n,

(4.75)

where x has dimension n = 5 and y and n have dimension m = 6. Furthermore, x and n are independent Gaussians, both proper, with zero mean and correlation matrices Rx x = I and Rnn = I. The matrix describing the linear part of the transformation   1 1+j 1−j 2 1 − 2j 1 0 0 j −j      1 − j −2j 1+j 0   j H1 =   −1 1 + j 1 − j 2 − 2j 1     0 1+j 1−j 2−j 1−j j 2 1 − 3j 3 1−j has rank 3, the matrix describing the conjugate linear part  1 −j 0 1−j 1+j 2 − j 2 − j 1   1+j 1 + 2j 1  j H2 =  3 − 2j −2 − 5j −2 − 4j 1 − 3j  3 + 3j 5 − 2j 5 4−j 0 −1 + j −1 + 3j 1 − j

 0 1    j   −2j   2+j  −1 + j

has rank 2, and [H1 , H2 ], which models the widely linear transformation, has rank 5. From 1000 sample pairs of (x, y) we computed estimates of the correlation and crosscorrelation matrices of x and y. From these, we estimated the correlation coefficients ρC2 1 and ρC2 2 for r = 1, . . ., 5, as depicted in Fig. 4.5. We emphasize that this computation does not assume any knowledge whatsoever about how x and y are generated (i.e.,

106

Correlation analysis

from what distribution they are drawn). The way in which x and y are generated in this example is simply illustrative. We can make the following observations in Fig. 4.5. Correlation coefficient ρC1 is a good means of estimating the ranks of the linear, conjugate linear, or widely linear components of the relationship between x and y. Even though the widely linear transformation has rank 5, the first four dimensions capture most of the total correlation between x and y. Correlation coefficient ρC2 is sensitive to the presence of a canonical correlation ki close to 1. This happens whenever there is a strong linear relationship between x and y. As a consequence, we will find in Section 4.5 that ρC2 is the test statistic in a generalized likelihood-ratio test for whether x and y are independent. In this example, x and y are clearly not independent, and ρC2 bears witness to this fact.

4.3.2

Multivariate linear regression (half-canonical correlations) There are many problems in which the roles of x and y are not interchangeable. For these, the symmetric assessment of correlation by CCA is unsatisfactory. The most obvious example is multivariate linear regression, where a message x is estimated from a measurement y. The resulting MMSE is invariant under nonsingular transformation of y, but only under unitary transformation of x. Half-canonical correlations have exactly these invariance properties. While there are many conceivable correlation coefficients based on the first r half-canonical correlations {ki }ri=1 , the most common correlation coefficient is 2 ρM =

r 1 2 k . tr Rx x i=1 i

(4.76)

For full rank r = p, it can also be written as 2 = ρM

H tr(Rx y R−1 tr K2 yy Rx y ) = . tr Rx x tr Rx x

(4.77)

xy, The reflectional version ρM is obtained by replacing ki with k˜i , Rx y with R −1 −∗ and R yy with R yy . The total version is obtained by including a normalizing factor of 1/2, 2r 1 k¯ 2 , 2 tr Rx x i=1 i

(4.78)

H tr(R x y R−1 tr K2 yy R x y ) = . 2 tr Rx x tr R x x

(4.79)

2 = ρ¯M

which, for r = p, yields 2 ρ¯M =

2 has been called the redundancy index 5 because it is the fraction The coefficient ρM of the total variance of x that is explained by a linear estimate from the measurement y.

107

4.3 Correlation coefficients for complex vectors

0.8 0.7 Correlation coefficient

0.6 0.5 0.4 0.3 0.2 0.1

1

2

3 r

4

5

2 Figure 4.6 Correlation coefficients ρC2 1 = ρM for Rx x = I (shown as ×), ρC2 1 for

2 Rx x = Diag(4, 14 , 14 , 14 , 14 ) (shown as ), and ρM for Rx x = Diag(4, 14 , 14 , 14 , 14 ) (shown as ◦).

This estimate is −1/2 H −1/2 xˆ (y) = Rx y R−1 yy y = CR yy y = FKG R yy y = AKBy,

(4.80)

and the MMSE is H 2 2 Eˆx(y) − x2 = tr(Rx x − Rx y R−1 yy Rx y ) = tr Rx x − tr K = tr Rx x (1 − ρM ).

(4.81)

2 = 1, then x is perfectly linearly estimable from y. Similarly, the reflectional version If ρM ρM is related to the MMSE of the conjugate linear estimate

K By ∗, xˆ (y∗ ) = A

(4.82)

and the total version ρ¯M characterizes the MMSE of the widely linear estimate xˆ (y, y∗ ) = A K B y.

(4.83)

If there is a perfect linear relationship between x and y, then both the redundancy 2 = 1 and the corresponding CCA correlation coefficient ρC2 1 = 1. In general, index ρM 2 and ρC2 1 can behave quite differently, as the following example shows. however, ρM Example 4.5. Consider again the setup in the previous example, but we will only look at 2 for all ranks r . rotational correlations here. Since Rx x = I, ρC2 1 = ρM 1 1 1 1 Now consider Rx x = Diag(4, 4 , 4 , 4 , 4 ), which has the same trace as before but now most of its variance is concentrated in one dimension. This change affects the coefficients 2 2 and ρC2 1 quite differently, as shown in Fig. 4.6. While ρC2 1 decreases for each r , ρM ρM increases. This indicates that it is easier to linearly estimate x from y if most of the variance of x is concentrated in a one-dimensional subspace. Moreover, while CCA still spreads the correlation over the first three dimensions, MLR concentrates most of the correlation in the first half-canonical correlation k12 .

108

Correlation analysis

4.3.3

Partial least squares For problems that are invariant under unitary transformation of x and y, the most common correlation coefficient is defined in terms of the PLS correlations {ki }ri=1 as r

ki2

i=1 ρP2 = √ . tr R2x x tr R2yy

(4.84)

This coefficient was proposed by Robert and Escoufier (1976). For r = p, it can be expressed as tr(Rx y RHx y ) tr K2 =√ . ρP2 = √ tr R2x x tr R2yy tr R2x x tr R2yy

(4.85)

This correlation coefficient measures how closely x and y are related under unitary transformation. For y = Ux, with UH U = I but not necessarily UUH = I, we have perfect correlation ρP2 = 1. xy. The reflectional version ρP is obtained by replacing ki with k˜i and Rx y with R The total version, which is invariant under widely unitary transformation of x and y, is 2r

k¯i2

i=1 . ρ¯P2 = √ tr R2x x tr R2yy In the full-rank case r = p, this yields

tr(R x y RHx y ) tr K2 =√ . ρ¯P2 = √ tr R2x x tr R2yy tr R2x x tr R2yy

4.4

(4.86)

(4.87)

Correlation spread Using the results from Appendix 3, Section A3.2, it is not difficult to show that all correlation coefficients presented in this chapter are Schur-convex and increasing functions of the diagonal cross-correlations {ki }. By maximizing all partial sums over their absolute values, subject to the constraints imposed by the correlation analysis technique, the correlation coefficients are then also maximized for all ranks r . The importance of this result is that we may assess correlation in a low(er)-dimensional subspace of dimension r < p. In fact, the development in this chapter anticipates some closely related results on reduced-rank estimation, which we will study in Section 5.5. There we will find that canonical coordinates are the optimum coordinate system for reduced-rank estimation and transform coding if the aim is to maximize information rate, whereas half-canonical coordinates are the optimum coordinate system for

4.4 Correlation spread

109

reduced-rank estimation and transform coding if we want to minimize mean-squared error. An interesting question in this context is how much of the overall correlation is captured by r coefficients {ki }ri=1 , which can be rotational, reflectional, or total coefficients defined either through CCA, MLR, or PLS. One could, of course, compute the fraction ρ 2 (r )/ρ 2 ( p) for all 1 ≤ r < p. The following definition, however, provides a more convenient approach. Definition 4.4. The rotational correlation spread is defined as σ2 =

1 var({ki2 }) p µ2 ({ki2 }) 

p



  ki4 p  1  i=1  = ( p )2 − , p − 1 p   ki2

(4.88)

i=1

where var({ki2 }) denotes the variance of the correlations {ki2 } and µ2 ({ki2 }) denotes their squared mean. The correlation spread provides a single, normalized measure of how concentrated the overall correlation is. If there is only one nonzero coefficient k1 , then σ 2 = 1. If all coefficients are equal, k1 = k2 = · · · = k p , then σ 2 = 0. In essence, the correlation spread gives an indication of how compressible the cross-correlation between x and y is. The definition (4.88) is inspired by the definition of the degree of polarization of a random vector x. The degree of polarization measures the spread among the eigenvalues of Rx x . A random vector x is said to be completely polarized if all of its energy is concentrated in one direction, i.e., if there is only one nonzero eigenvalue. On the other hand, x is unpolarized if its energy is equally distributed among all dimensions, i.e., if all eigenvalues are equal. The correlation spread σ 2 generalizes this idea to the correlation between two random vectors x and y.

Example 4.6. Continuing Example 4.5 for Rx x = I, we find σC2 = σM2 = 0.167 for both CCA and MLR. For Rx x = Diag(4, 14 , 14 , 14 , 14 ), we find σC2 = 0.178 for CCA and σM2 = 0.797 for MLR. A σ 2 -value close to 1 indicates that the correlation is highly concentrated. Indeed, as we have found in Example 4.5, most of the MLR correlation is concentrated in a one-dimensional subspace.

The reflectional correlation spread is defined as a straightforward extension by replacing ki2 with k˜i2 in Definition 4.4, but the total correlation spread is defined in a slightly different manner.

110

Correlation analysis

Definition 4.5. The total correlation spread is defined as 2 + k¯2i2 ) 1 var 12 (k¯2i−1 2 σ = p µ2 12 (k¯2i−1 + k¯2i2 )   p 2 2 2   (k¯2i−1 + k¯2i ) 1 p   i=1  = ( p )2 − . p − 1 p   2 k¯2i−1 + k¯2i2 2

(4.89)

(4.90)

i=1

The motivation for defining σ 2 in this way is twofold. First, since k¯2i−1 is the crosscorrelation between the real parts and k¯2i is the cross-correlation between the imaginary parts of ξi and ωi , they belong to the same complex dimension. Secondly, the expression (4.90) becomes (4.88) if k¯2i−1 = k¯2i = ki for all i.

4.5

Testing for correlation structure In practice, correlation matrices will be estimated from measurements. Given these estimates, how can we make informed decisions such as whether x is white, or whether x and y are independent? Questions such as these must be answered by employing statistical tests for correlation structure. In this section, we develop generalized likelihood-ratio tests (GLRTs) for complex Gaussian data. 6 That is, we assume that the data can be modeled as a complex Gaussian random vector x: −→ C n with probability density function p(x) =

πn

1 exp −(x − ␮x )H R−1 x x (x − ␮x ) , det Rx x

(4.91)

mean ␮x , and covariance matrix Rx x . Later we will also replace x with other vectors such as [xT , yT ]T . We note that (4.91) is a proper Gaussian pdf, and the story we will be telling here is strictly linear, for simplicity. Appropriate extensions exist for augmented vectors, where the pdf (4.91) is adapted by replacing det with det1/2 and including a factor of 1/2 in the quadratic form of the exponential (cf. Result 2.4). These two changes to the pdf have no significant consequences for the following development. Consider M independent and identically distributed (i.i.d.) random samples X = [x1 , x2 , . . ., x M ] drawn from this distribution. The joint probability density function of these samples is given by p(X) = π

−Mn

(det Rx x )

−M

exp −

M

(xm − ␮x )

H

R−1 x x (xm

− ␮x )

(4.92)

m=1

= π −Mn (det Rx x )−M exp −M tr(R−1 x x Sx x ) ,

(4.93)

4.5 Testing for correlation structure

111

where Sx x is the sample covariance matrix Sx x =

M 1 1 (xm − mx )(xm − mx )H = XXH − mx mHx M m=1 M

(4.94)

and mx is the sample mean vector mx =

M 1 xm . M m=1

(4.95)

We would now like to test whether Rx x has structure 0 or the alternative structure 1. We write this hypothesis-testing problem as H0: Rx x ∈ R0 , H1: Rx x ∈ R1 . The GLRT statistic is max p(X)

λ=

Rx x ∈R0

max p(X)

.

(4.96)

Rx x ∈R1

This ratio compares the likelihood of drawing the samples x1 , . . ., x M from a distribution whose covariance matrix has structure 0 with the likelihood of drawing them from a distribution whose covariance matrix has structure 1. Since the actual covariance matrices are not known, they are replaced with their maximum-likelihood (ML) estimates & 0 the ML estimate of Rx x under H0 and computed from the samples. If we denote by R & 1 the ML estimate of Rx x under H1 , we find by R # $ & 0−1 R & 0−1 Sx x − R & 1 exp −M tr R & 1−1 Sx x . (4.97) λ = det M R If we further assume that R1 is the set of positive semidefinite matrices (i.e., no special & 1 = Sx x , and constraints are imposed), then R # $ & 0−1 Sx x exp Mn − M tr R & 0−1 Sx x . (4.98) λ = det M R This may be expressed in the following result, which is Theorem 5.3.2 in Mardia et al. (1979). Result 4.9. The generalized likelihood ratio for testing whether Rx x has structure R0 is = λ1/(Mn) = ge1−a

(4.99)

& 0−1 Sx x : where a and g are the arithmetic and geometric means of the eigenvalues of R 1 & −1 (4.100) a = tr R 0 Sx x , n # $1/n & 0−1 Sx x g = det R . (4.101)

112

Correlation analysis

In the hypothesis-testing problem, a threshold 0 is chosen to achieve a desired probability of false alarm or probability of detection. Then, if ≥ 0 , we accept hypothesis H0 , and if < 0 , we reject it. The GLRT is always invariant with respect to transformations for which the hypothesis-testing problem itself is invariant. The GLR will therefore turn out to be a function of a maximal invariant, which is determined by the hypothesis-testing problem. Let’s consider a few interesting special cases.

4.5.1

Sphericity We would like to test whether x has a spherical distribution, i.e., R0 = {Rx x = σx2 I},

(4.102)

where σx2 is the variance of each component of x. The ML estimate of Rx x under H0 is &0 = σ &x2 I, where the variance is estimated as σ&x2 = n −1 tr Sx x . The GLR in Result 4.9 is R therefore =

[det Sx x ]1/n . 1 tr Sx x n

(4.103)

This test is invariant with respect to scale and unitary transformation. That is, x −→ αUx with α ∈ C and unitary U ∈ C n×n leaves the GLR unchanged. The GLR is therefore a function of a maximal invariant under scale and unitary transformation. Such a maximal invariant is the set of estimated eigenvalues of Rx x (i.e., the eigenvalues of Sx x ), denoted n , normalized by tr Sx x : by {λˆ i }i=1

λˆ i tr Sx x

n . i=1

Indeed, we can express the GLR as (n)n =

n " λˆ i , tr Sx x i=1

(4.104)

which also allows a reduced-rank implementation of the GLRT by considering only the r largest estimated eigenvalues.

4.5.2

Independence within one data set Now we would like to test whether x has independent components, i.e., whether Rx x is diagonal: R0 = {Rx x = Diag(R11 , R22 , . . ., Rnn )}.

(4.105)

4.5 Testing for correlation structure

113

& 0 = Diag (S11 , S22 , . . ., Snn ), where Sii is the ith The ML estimate of Rx x under H0 is R diagonal element of Sx x . The GLR in Result 4.9 is therefore the Hadamard ratio n =

det Sx x . n " Sii

(4.106)

i=1

This test is invariant with respect to multiplication with a nonsingular diagonal matrix. That is, x −→ Dx with nonsingular diagonal D ∈ C n×n leaves the GLR unchanged. The GLR is therefore a function of a maximal invariant under multiplication with a diagonal matrix. Such a maximal invariant is obtained as follows. Let Zii denote the submatrix obtained from rows i + 1 through n and columns i + 1 through n of the sample covariance matrix Sx x . Let zi be the column-vector obtained from rows i + 1 through n in column i of the sample covariance matrix Sx x . For i = 1, . . ., n − 1, they can be combined in Sii ziH Zi−1,i−1 = (4.107) zi Zii and Z00 = Sx x . Now define zH Z−1 zi βî2 = i ii , Sii βˆn2 = 0.

i = 1, . . ., n − 1,

(4.108) (4.109)

We notice that βî2 is the estimated squared canonical correlation between the ith component of x and components i + 1 through n of x, and thus 0 ≤ βî2 ≤ 1. It can be shown n−1 is the maximal invariant we are looking for. Indeed, we can express that the set {βî2 }i=1 the GLR as n " Sii (1 − βî2 ) n−1 " i=1 = (1 − βî2 ), (4.110) n = n " i=1 Sii i=1

which also allows a reduced-rank implementation of the GLRT by considering only the r largest estimated canonical correlations βî2 .

4.5.3

Independence between two data sets Now replace x with the composite vector [xT , yT ]T , where x: −→ C n and y: −→ C m , and replace Rx x with the composite covariance matrix of x and y: Rx x Rx y . (4.111) Rx y = RHx y R yy We would like to test whether x and y are independent, i.e., whether x and y have a block-diagonal composite covariance matrix Rx y with Rx y = 0: 0 Rx x . (4.112) R0 = Rx y = 0 R yy

114

Correlation analysis

The ML estimate of Rx y under H0 is S & R0 = x x 0

0 . S yy

The GLR in Result 4.9 is therefore I S−1 −1 H x x Sx y = det(I − S−1 n = det −1 H x x Sx y S yy Sx y ). S yy Sx y I

(4.113)

(4.114)

This test is invariant with respect to multiplication of x and y, each with a nonsingular matrix. That is, x −→ Nx and y −→ My for nonsingular N ∈ C n×n and M ∈ C m×m leaves the GLR unchanged. A maximal invariant for these invariances is, of course, the set of canonical correlations. Hence, the GLR can also be written as = n

p "

(1 − kî2 ) = 1 − ρ&C2 2 ,

(4.115)

i=1 p where p = min(m, n), {kî }i=1 are the estimated canonical correlations between x and y, computed from the sample covariances, and ρ&C2 2 is the estimated correlation coefficient defined in (4.59). This expression allows a reduced-rank implementation of the GLRT by considering only the r largest estimated canonical correlations. The test for impropriety in Section 3.4 is a special case of the test presented here, where y = x∗ . However, the test for impropriety tests for uncorrelatedness between x and x∗ rather than independence: x and x∗ cannot be independent because x∗ is perfectly determined by x through complex conjugation. 7

Notes 1 Since the assessment of multivariate association between random vectors is of importance in many research areas, there is a rich literature on this topic. Canonical correlation analysis is the oldest technique, which was invented by Hotelling (1936). Partial least squares was introduced by Wold (1975) in the field of chemometrics but our description of it follows a different variant suggested by Sampson et al. (1989). Much of this chapter, in particular the discussion of majorization and correlation spread, c closely follows Schreier (2008c). This paper is IEEE, and portions are used with permission. An excellent unifying discussion of correlation analysis techniques with emphasis on their invariance properties is given by Ramsay et al. (1984). Different research areas often use quite different jargon, and we have attempted to use terms that are as generic as possible. 2 To the best of our knowledge, there are not many papers on correlation analysis of complex data. Usually, the analysis of rotational and reflectional correlations proceeds in terms of real data of double dimension rather than complex data, see, for instance, Jupp and Mardia (1980). Hanson et al. (1992) proposed working directly with complex data when analyzing rotational and reflectional dependences. 3 Many definitions of correlation coefficients (see, e.g., Renyi (1959)) require the symmetry ρx y = ρ yx . However, this does not seem like a fundamental requirement. In fact, if the correlation analysis technique itself is not symmetric (as for instance is the case for MLR) it would not be reasonable to ask for symmetry in correlation coefficients based on it.

Notes

115

4 Correlation coefficients based on CCA were discussed by Rozeboom (1965), Coxhead (1974), Yanai (1974), and Cramer and Nicewander (1979). 5 The redundancy index, which is based on MLR, was analyzed by Stewart and Love (1968) and Gleason (1976). 6 Section 4.5 draws on material from Mardia et al. (1979). A general introduction to likelihoodratio tests is provided in Section 7.1. 7 Much more can be done in Section 4.5. There are the obvious extensions to augmented vectors and widely linear transformations. The results of Sections 4.5.2 and 4.5.3 can be combined to test for block-diagonal structure with more than two diagonal blocks. Then hypothesis tests that have the invariance properties of half-canonical correlations and partial least squares can be developed, for which half-canonical correlations and partial least-squares correlations will be the maximal invariants. Finally, we have completely ignored the distributional properties of the GLRT statistics. The moments of the statistic in (4.114) are well known, cf. Mardia et al. (1979) and Lehmann and Romano (2005). The latter reference provides more background on the rich topic of hypothesis testing.

5

Estimation

One of the most important applications of probability in science and engineering is to the theory of statistical inference, wherein the problem is to draw defensible conclusions from experimental evidence. The three main branches of statistical inference are parameter estimation, hypothesis testing, and time-series analysis. Or, as we say in the engineering sciences, the three main branches of statistical signal processing are estimation, detection, and signal analysis. A common problem is to estimate the value of a parameter, or vector of parameters, from a sequence of measurements. The underlying probability law that governs the generation of the measurements depends on the parameter. Engineering language would say that a source of information, loosely speaking, generates a signal x and a channel carries this information in a measurement y, whose probability law p(y|x) depends on the signal. There is usually little controversy over this aspect of the problem because the measurement scheme generally determines the probability law. There is, however, a philosophical divide about the modeling of the signal x. Frequentists adopt the point of view that to assign a probability law to the signal assumes too much. They argue that the signal should be treated as an unknown constant and the data should be allowed to speak for itself. Bayesians argue that the signal should be treated as a random variable whose prior probability distribution is to be updated to a posterior distribution as measurements are made. In the realm of exploratory data analysis and in the absence of a physical model for the experiment or measurement scheme, the difference between these points of view is profound. But, for most problems of parameter estimation and hypothesis testing in the engineering and applied sciences, the question resolves itself on the basis of physical reasoning, either theoretical or empirical. For example, in radar, sonar, and geophysical imaging, the question of estimating a signal, or resolving a hypothesis about the signal, usually proceeds without the assignment of a probability law or the assignment of a prior probability to the hypothesis that a nonzero signal will be returned. The point of view is decidedly frequentist. On the other hand, in data communication, where the problem is to determine which signal from a set of signals was transmitted, it is quite appropriate to assign prior probabilities to elements of the set. The appropriate view is Bayesian. Finally, in most problems of engineering and applied science, measurements are plentiful by statistical standards, and the weight of experimental evidence overwhelms the prior model, making the practical distinction between the frequentist and the Bayesian a philosophical one of marginal practical effect.

5.1 Hilbert-space geometry

117

In this chapter on estimation, we shall be interested, by and large, in linear and widely linear least-squares problems, wherein parameter estimators are constrained to be linear or widely linear in the measurement and the performance criterion is mean-squared error or squared error under a constraint. The estimators we compute are then linear or widely linear minimum mean-squared error (LMMSE or WLMMSE) estimators. These estimators strike a balance between frequentism and Bayesianism by using only second-order statistical models for signal and noise. When the underlying joint distribution between the measurement y and the signal x is multivariate Gaussian, these estimators are also conditional mean and maximum a-posteriori likelihood estimators that would appeal to a Bayesian. No matter what your point of view is, the resulting estimators exploit all of the Hermitian and complementary correlation that can be exploited, within the measurements, within the signals, and between the signals and measurements. As a prelude to our development of estimators, Section 5.1 establishes a Hilbert-space geometry for augmented random variables that will be central to our reasoning about WLMMSE estimators. We then review a few fundamental results for MMSE estimation in Section 5.2, without the constraint that the estimator be linear or widely linear. From there we proceed to the development of linear channel and filtering models, since these form the foundation for LMMSE estimation, which is discussed in Section 5.3. Most of these results translate to widely linear estimation in a straightforward fashion, but it is still worthwhile to point out a few peculiarities of WLMMSE estimation and compare it with LMMSE estimation, which is done in Section 5.4. Section 5.5 considers reducedrank widely linear estimators, which either minimize mean-squared error or maximize information rate. We derive linear and widely linear minimum-variance distortionless response (MVDR) receivers in Section 5.6. These estimators assign no statistical model to the signal, and therefore appeal to the frequentist. Finally, Section 5.7 presents widely linear-quadratic estimators.

5.1

Hilbert-space geometry of second-order random variables The Hilbert space geometry of random variables is based on second-order moment properties only. Why should such a geometry be important in applications? The obvious answer is that the dimension of signals to be estimated is generally small and the dimension of measurements large, so that quality estimates can be obtained from sample means and generalizations of these that are based on correlation structure. The subtle answer is that the geometry of second-order random variables enables us to exploit a comfortable imagery and intuition based on our familiarity with Euclidean geometry. An algebraic approach to Euclidean geometry, often called analytic geometry, is founded on the concept of an inner product between two vectors. To exploit this point of view, we will have to think of a random variable as a vector. Then norm, distance, and orthogonality can be defined in terms of inner products. A random experiment is modeled by the sample space , where each point ω ∈ corresponds to a possible outcome, and a probability measure P on . The probability

118

Estimation

measure associates with each event A (a measurable subset of ) a probability P( A). A random variable x is a measurable function x: −→ C, which assigns each outcome ω ∈ a complex number x. The set of second-order random variables (for which E|x|2 < ∞) forms a Hilbert space with respect to the inner product defined as x, y = E(x ∗ y) = x ∗ (ω)y(ω)dP(ω), (5.1)

which is the correlation between x and y. We may thus identify the space of second-order random variables with the space of square-integrable functions L 2 (, P). This space is closed under addition and multiplication by complex scalars. From this inner product, we obtain the following geometrical quantities: r the norm (length) of a random variable x: x = √x, x = E|x|2 , r the distance between two random variables x and y: x − y = √x − y, x − y = E|x − y|2 , r the angle between two random variables x and y: cos2 α =

| x, y |2 |E(x ∗ y)|2 = , 2 2 x y E|x|2 E|y|2

r orthogonality: x and y are orthogonal if x, y = E(x ∗ y) = 0, r the Cauchy–Schwarz inequality: |x, y| ≤ xy, i.e., |E(x ∗ y)| ≤ E|x|2 E|y|2 . The random variable x may be replaced by the random variable x ∗ , in which case the inner product is the complementary correlation 6 ∗ 7 x , y = E(x y) = x(ω)y(ω)dP(ω). (5.2)

This inner product is nonzero if and only if x and y are cross-improper. For y = x the complementary correlation is x ∗ , x = E x 2 , which is nonzero if and only if x is improper. It follows from the Cauchy–Schwarz inequality that |E x 2 | ≤ E|x|2 . In the preceding chapters we have worked with augmented random variables x = [x, x ∗ ]T , which are measurable functions x: −→ C 2∗ . The vector space C 2∗ contains all complex pairs of the form [x, x ∗ ]T . It is a real linear (or complex widely linear) subspace of C 2 because it is closed under addition and multiplication with a real scalar, but not with a complex scalar. The inner product for augmented random variables is 6 7 (5.3) x, y = E(xH y) = 2 Re E(x ∗ y) = 2 Re x, y, which amounts to taking (twice) the real part of the inner product 6 7between x and y. In this book, we use all three inner products, x, y, x ∗ , y, and x, y . Example 5.1. Let x be a real random variable and y a purely imaginary random variable. The product x, y is generally nonzero, but it must be purely imaginary. Therefore, 7 6 inner x, y = 2 Re x, y = 0. This means that x and y are orthogonal even though x and y might not be.

5.2 Minimum mean-squared error estimation

119

If z = x + y, we may perfectly reconstruct x from z. This, however, requires the widely linear operation x = Re z, which assumes that we know the phase of z. It is not generally possible to perfectly reconstruct x from z using strictly linear operations. The preceding generalizes to random vectors x = [x1 , x2 , . . ., xn ]T : −→ C n and x = [xT , xH ]T : −→ C 2n ∗ . Let H (x) denote the Hilbert space of second-order augmented random vectors x. The Euclidean view characterizes the angle between spanning vectors in terms of a Grammian matrix, consisting of all inner products between vectors. In the Hilbert space H (x), this Grammian matrix consists of all inner products of the form E(xi∗ x j ) and E(xi x j ). We have come to know this Grammian matrix as the augmented correlation matrix of x: R R x x x x x . (5.4) R x x = E(x xH ) = E ∗ xH xT = ∗ x Rx x R∗x x It is commonplace in adaptive systems to use a vector of N realizations si ∈ C N of a random variable xi to estimate the Hilbert-space inner products E(xi∗ x j ) and E(xi x j ) by the Euclidean inner products N −1 siH s j and N −1 siT s j . There is one more connection to be made with Euclidean spaces. Given two sets of (deterministic) spanning vectors S = [s1 , s2 , . . ., sn ] and R = [r1 , r2 , . . ., rn ], we may define the cosines of the principal angles between their linear subspaces S and R as the singular values of the matrix (SH S)−1/2 SH R(RH R)−H/2 . In a similar way we define the cosines of the principal angles between the subspaces H (x) and H (y) as the singular −H/2 , which are called the canonical correlations between values of the matrix R−1/2 x x Rx y R yy x and y. They were introduced in Section 4.1. We will use them again in Section 5.5 to develop reduced-rank linear estimators.

5.2

Minimum mean-squared error estimation The general problem we address here is the estimation of a complex random vector x from a complex random vector y. As usual we think of x as the signal or source, and y as the measurement or observation. Presumably, y carries information about x that can be extracted with an estimation algorithm. We begin with the consideration of an arbitrary estimator xˆ , which is a function of y, and its mean-squared error matrix Q = E[(ˆx − x)(ˆx − x)H ]. We first expand the expression for Q using the conditional mean E[x|y] – which is itself a random vector since it depends on y – as Q = E[(ˆx − E[x|y] + E[x|y] − x)(ˆx − E[x|y] + E[x|y] − x)H ],

(5.5)

where the outer E stands for expectation over the joint distribution of x and y. We may then write Q = E[(E[x|y] − x)(E[x|y] − x)H ] + E[(ˆx − E[x|y])(E[x|y] − x)H ] + E[(E[x|y] − x)(ˆx − E[x|y])H ] + E[(ˆx − E[x|y])(ˆx − E[x|y])H ].

(5.6)

120

Estimation

The third term is of the form E[(E[x|y] − x)gH (y)] with g(y) = xˆ − E[x|y]. Using the law of total expectation, we see that this term vanishes: E{E[(E[x|y] − x)gH (y)|y]} = 0.

(5.7)

The same reasoning can be applied to conclude that the second term in (5.6) is also zero. Therefore, the optimum estimator, obtained by making the fourth term in (5.6) equal to zero, turns out to be the conditional mean estimator xˆ = E[x|y]. Let e = xˆ − x = E[x|y] − x

(5.8)

be the error vector. Its mean E[e] = 0, and thus E[ˆx] = E[x]. This says that xˆ is an unbiased estimator of x. The covariance matrix of the error vector is Q = E[eeH ] = E[(E[x|y] − x)(E[x|y] − x)H ].

(5.9)

Any competing estimator xˆ with mean-squared error matrix Q = E[e e H ] = E[(ˆx − x)(ˆx − x)H ] will be suboptimum in the sense that Q ≥ Q, meaning that Q − Q is positive semidefinite. As a consequence, the conditional mean estimator is a minimum mean-squared error (MMSE) estimator: Ee2 = tr Q ≤ tr Q = Ee 2 .

(5.10)

But we can say more. For the class of real-valued increasing functions of matrices, Q ≤ Q implies f (Q) ≤ f (Q ). Thus we have the following result. Result 5.1. The conditional mean estimator E[x|y] minimizes (maximizes) any increasing (decreasing) function of the mean-squared error matrix Q. Besides the trace, the determinant is another example of an increasing function. Thus, the volume of the error covariance ellipsoid is also minimized: det Q ≤ det Q .

(5.11)

A very important property of the MMSE estimator is the so-called orthogonality principle, a special case of which we have already encountered in the derivation (5.7). However, it holds much more generally. Result 5.2. The error vector e is orthogonal to every measurable function of y, g(y). That is, E[egH (y)] = E{(E[x|y] − x)gH (y)} = 0. For g(y) = E[x|y], this orthogonality condition says that the estimator error e is orthogonal to the estimator E[x|y]. Moreover, since the conditional mean estimator is an idempotent operator Px = E[x|y] – which is to say that P2 x = E{E[x|y]|y} = E[x|y] = Px – we may think of the conditional mean estimator as a projection operator that orthogonally resolves x into its estimator E[x|y] minus its estimator error e: x = E[x|y] − (E[x|y] − x).

(5.12)

The conditional mean estimator summarizes everything of use for MMSE estimation of x from y. This means conditioning x on y and y∗ would change nothing, since this has already been done, so to speak, in the conditional mean E[x|y]. Similarly the conditional

5.3 Linear MMSE estimation

121

mean estimator of x∗ is just the complex conjugate of the conditional mean estimator of x. Thus it makes no sense to consider the estimation of x from y and y∗ , since this brings no refinement to the conditional mean estimator. As we shall see, this statement is decidedly false when the estimator is constrained to be a (widely) linear or (widely) linear-quadratic estimator.

5.3

Linear MMSE estimation We shall treat the problem of linearly or widely linearly estimating a signal from a measurement as a virtual two-channel estimation problem, wherein the underlying experiment consists of generating a composite vector of signal and measurement, only one of which is observed. 1 Our point of view is that the augmented covariance matrix for this composite vector then encapsulates all of the second-order information that can be extracted. In order to introduce our methods, we shall begin here with the Hermitian case, wherein complementary covariances are ignored, and then extend our methods to include Hermitian and complementary covariances in the next section. Let us begin with a signal (or source) x: −→ C n and a measurement vector y: −→ m C . There is no requirement that the signal dimension n be smaller than the measurement dimension m. We first assume that the signal and measurement have zero mean, but remove this restriction later on. Their composite covariance matrix is the matrix Rx x Rx y x H H = . (5.13) Rx y = E y x RHx y R yy y The error between the signal x and the linear estimator xˆ = Wy is e = xˆ − x and the error covariance matrix is Q = E[(ˆx − x)(ˆx − x)H ]. This error covariance is Q = E[(Wy − x)(Wy − x)H ] = WR yy WH − Rx y WH − WRHx y + Rx x .

(5.14)

After completing the square, this may be written H −1 −1 H Q = Rx x − Rx y R−1 yy Rx y + (W − Rx y R yy )R yy (W − Rx y R yy ) .

(5.15)

H This quadratic form in W is positive semidefinite, so Q ≥ Rx x − Rx y R−1 yy Rx y with equality for

W = Rx y R−1 yy

and

H Q = Rx x − Rx y R−1 yy Rx y .

(5.16)

The solution for W may be written as the solution to the normal equations WR yy − Rx y = 0, or more insightfully as 2 E[(Wy − x)yH ] = 0.

(5.17)

Thus the orthogonality principle is at work: the estimator error e = Wy − x is orthogonal to the measurement y, as illustrated in Fig. 5.1. There H (y) denotes the Hilbert space of measurement vectors y and orthogonality means E[(Wy − x)i y ∗j ] = 0 for all pairs (i, j). Moreover, the linear operator W is a projection operator, since the LMMSE estimator of x from Wy remains Wy. The LMMSE estimator is sometimes referred to as the

122

Estimation

x

e H(y)

Wy

Figure 5.1 Orthogonality between the error e = Wy − x and H (y) in LMMSE estimation.

x

x

y

H

x

x

H (a)

n

(b)

n

x H

y

(c)

y

n

Figure 5.2 The signal-plus-noise channel model: (a) channel, (b) synthesis, and (c) analysis.

(discrete) Wiener filter, which explains our choice of notation W. However, the LMMSE estimator predates Wiener’s work on causal LMMSE prediction and smoothing of time series.

5.3.1

The signal-plus-noise channel model We intend to argue that it is as if the source and measurement vectors were drawn from a linear channel model y = Hx + n that draws a source vector x, linearly filters the source vector with a channel filter H, and adds an uncorrelated noise vector n to produce the measurement vector. This scheme is illustrated with the signal-plus-noise channel model of Fig. 5.2(a). According to this channel model, the source and measurement have the synthesis and analysis representations of Figs. 5.2(b) and (c): x I 0 x = , y H I n x I 0 x = . n −H I y

(5.18) (5.19)

The synthesis model of Fig. 5.2(b) and the analysis model of Fig. 5.2(c) produce these block LDU (Lower triangular–Diagonal–Upper triangular) Cholesky factorizations:

Rx x RHx y

Rx y =

Rx x 0

Rx y 0 I 0 Rx x I HH , = R yy 0 Rnn 0 I H I 0 I 0 Rx x Rx y I −HH . = I Rnn −H I RHx y R yy 0

(5.20) (5.21)

5.3 Linear MMSE estimation

123

For this channel model and its corresponding analysis and synthesis models to work, we must choose H = RHx y R−1 xx

and

Rnn = R yy − RHx y R−1 x x Rx y .

(5.22)

The noise covariance matrix Rnn is the Schur complement of Rx x within the composite covariance matrix Rx y . Thus, up to second order, every virtual two-channel estimation problem is a problem of representing the measurement as a noisy and linearly filtered version of the signal, which is to say that the LDU Cholesky factorization of the composite covariance matrix Rx y has actually produced a channel model, a synthesis model, and an analysis model for the source and measurement. From these block Cholesky factorizations we can also extract block UDL (Upper triangular–Diagonal–Lower triangular) Cholesky factorizations of inverses: −1 Rx x Rx y 0 I 0 I −HH R−1 −1 xx , (5.23) = Rx y = −1 RHx y R yy 0 I 0 Rnn −H I −1 −1 Rx x 0 I HH Rx x Rx y I 0 = . (5.24) −1 RHx y R yy 0 I 0 Rnn H I −1 is the southeast block of R−1 It follows that Rnn x y . The block Cholesky factors, in turn, produce these factorizations of determinants:

det Rx y = det Rx x det Rnn ,

(5.25)

−1 −1 det R−1 x y = det Rx x det Rnn .

(5.26)

Let us summarize. From the synthesis model we say that every composite source and measurement vector [xT , yT ]T may be modeled as if it were synthesized in a virtual twochannel experiment, wherein the source is the unobserved channel and the measurement is the observed channel y = Hx + n. Of course, the filter and the noise covariance must be chosen just right. This model then decomposes the composite covariance matrix for the source and measurement, and its inverse, into block Cholesky factors.

5.3.2

The measurement-plus-error channel model We now interchange the roles of the signal and measurement in order to obtain the measurement-plus-error channel model, wherein the source x is produced as a noisy measurement of a linearly filtered version of the measurement y. That is, x = Wy − e, as illustrated in Fig. 5.3(a). It will soon become clear that our choice of W to denote this filter is not accidental. It will turn out to be the LMMSE filter. According to this channel model, the source and measurement have the representations of Figs. 5.3(b) and (c), x −I W e = , (5.27) y 0 I y e −I W x = , (5.28) y 0 I y

124

Estimation

y

x

W

e

x

W (a)

e

(b)

y

e

x

W y

(c)

y

y

Figure 5.3 The measurement-plus-error channel model: (a) channel, (b) synthesis, and (c) analysis

where the error e and the measurement y are uncorrelated. The synthesis model of Fig. 5.3 (b) and the analysis model of Fig. 5.3(c) produce these block UDL Cholesky factorizations: Rx x Rx y −I W Q 0 −I 0 , (5.29) = Rx y = RHx y R yy 0 I 0 R yy WH I Q 0 −I W Rx x Rx y −I 0 . (5.30) = RHx y R yy WH I 0 R yy 0 I For this channel model and its corresponding analysis and synthesis models to work, we must choose W as the LMMSE filter and Q = E[eeH ] as the corresponding error covariance matrix: W = Rx y R−1 yy

and

H Q = Rx x − Rx y R−1 yy Rx y .

(5.31)

The error covariance matrix Q is the Schur complement of R yy within Rx y . Thus, up to second order, every virtual two-channel estimation problem is a problem of representing the signal as a noisy and linearly filtered version of the measurement, which is to say that the block UDL Cholesky factorization of the composite covariance matrix Rx y has actually produced a channel model, a synthesis model, and an analysis model for the source and measurement. The orthogonality between the estimator error and the measurement is expressed by the northeast and southwest zeros of the composite covariance matrix for the error e and the measurement y. From these block Cholesky factorizations of the composite covariance matrix Rx y we can also extract block LDU Cholesky factorizations of inverses: −1 0 Rx x Rx y −I W −I 0 Q−1 , (5.32) = = R−1 xy RHx y R yy 0 R−1 0 I WH I yy −1 −1 Q 0 −I 0 Rx x Rx y −I W = . (5.33) 0 R−1 WH I RHx y R yy 0 I yy It follows that Q−1 is the northwest block of R−1 x y . The block Cholesky factors produce the following factorizations of determinants: det Rx y = det Q det R yy ,

(5.34)

−1 det R−1 det R−1 x y = det Q yy .

(5.35)

5.3 Linear MMSE estimation

125

Let us summarize. From the analysis model we say that every composite signal and measurement vector [xT , yT ]T may be modeled as if it were a virtual two-channel experiment, wherein the signal is subtracted from a linearly filtered measurement to produce an error that is orthogonal to the measurement. Of course, the filter and the error covariance must be chosen just right. This model then decomposes the composite covariance matrix for the signal and measurement, and its inverse, into block Cholesky factors.

5.3.3

Filtering models We might say that the models we have developed give us two alternative parameterizations: r the signal-plus-noise channel model (R , H, R ) with H = RH R−1 and R = xx nn nn xy xx R ; and R yy − RHx y R−1 x y xx r the measurement-plus-error channel model (R , W, Q) with W = R R−1 and Q = yy x y yy H Rx x − Rx y R−1 yy Rx y . These correspond to the two factorizations (A1.2) and (A1.1) of Rx y , and Rnn and Q are the Schur complements of Rx x and R yy , respectively, within Rx y . Let’s mix the synthesis equation for the signal-plus-noise channel model with the analysis equation of the measurement-plus-error channel model to solve for the filter W and error covariance matrix Q in terms of the channel parameters H and Rnn : e −I W I 0 x = . (5.36) y 0 I H I n This composition of maps produces these factorizations: Q 0 0 −I W I 0 Rx x I HH −I 0 , = 0 R yy WH I 0 Rnn 0 I 0 I H I −1 Q 0 0 −I 0 I −HH R−1 I 0 −I W xx = . −1 0 R−1 I 0 Rnn WH I 0 −H I 0 I yy

(5.37) (5.38)

We now evaluate the northeast block of (5.37) and the southwest block of (5.38) to obtain two formulae for the filter W: H −1 −1 H −1 W = Rx x HH (HRx x HH + Rnn )−1 = (R−1 x x + H Rnn H) H Rnn .

(5.39)

In a similar fashion, we evaluate the northwest blocks of both (5.37) and (5.38) to get two formulae for the error covariance Q: H −1 −1 Q = Rx x − Rx x HH (HRx x HH + Rnn )−1 HRx x = (R−1 x x + H Rnn H) .

(5.40)

These equations are Woodbury identities, (A1.43) and (A1.44). They determine the LMMSE inversion of the measurement y for the signal x. In the absence of noise, if the signal and measurement were of the same dimension, then W would be H−1 , assuming that this inverse existed. However, generally WH = I, but is approximately so if Rnn is

126

Estimation

small compared with HRx x HH . The LMMSE estimator may be written as Wy = W(Hx + n) = WHx + Wn.

(5.41)

So, the LMMSE estimator W, sometimes called a deconvolution filter, decidedly does not equalize H to produce WH = I. Rather, it approximates I so that the error e = Wy − x = (WH − I)x + Wn, with covariance matrix Q = (WH − I)Rx x (WH − I)H + WRnn WH provides the best tradeoff between model-bias-squared (WH − I)Rx x (WH − I)H and filtered noise variance WRnn WH to minimize the error covariance Q. The importance of these results cannot be overstated. Let us summarize. Every problem of LMMSE estimation requiring only first- and second-order moments can be phrased as a problem of estimating one unobserved channel from another observed channel in a virtual two-channel experiment. In this virtual two-channel experiment, there are two different channel models. The first channel model says it is as if the measurement were a linear combination of filtered source and uncorrelated additive noise. The second channel model says it is as if the source vector were a linear combination of the filtered measurement and the estimator error. The first channel model produces a block LDU Cholesky factorization for the composite covariance matrix Rx y and a block UDL Cholesky factorization for its inverse. The second channel model produces a block UDL Cholesky factorization for the composite covariance matrix Rx y and a block LDU Cholesky factorization for its inverse. By mixing these two factorizations, two different solutions are found for the LMMSE estimator and its error covariance.

Example 5.2. In many problems of signal estimation in communication, radar, and sonar, or imaging in geophysics and radio astronomy, the measurement model may be taken to be the rank-one linear model y = ␺ x + n,

(5.42)

where ␺ ∈ C m is the channel vector that carries the unknown complex signal amplitude x to the measurement, x is a zero-mean random variable with variance σx2 , and Rnn is the covariance matrix of the noise vector n. Using the results of the previous subsections, we may write down the following formulae for the LMMSE estimator of x and its mean-squared error: −1 −1 ␺)−1 ␺ H Rnn y, xˆ = σx2 ␺ H (σx2 ␺␺ H + Rnn )−1 y = (σx−2 + ␺ H Rnn

Q = σx2 − σx4 ␺ H (σx2 ␺␺ H + Rnn )−1 ␺ =

1 . −1 ␺ σx−2 + ␺ H Rnn

(5.43) (5.44)

In both forms the estimator consists of a matrix inverse operator, followed by a correlator, followed by a scalar multiplication. When the channel filter is a vector’s worth of a geometric sequence, ␺ = [1, ejθ , . . ., ejmθ ]T , as in a uniformly sampled complex

5.3 Linear MMSE estimation

127

exponential time series or a uniformly sampled complex exponential plane wave, then the correlation step is a correlation with a discrete-time Fourier-transform (DTFT) vector.

5.3.4

Nonzero means We have assumed until now that the signal and measurement have zero means. What if the signal has known mean ␮x and the measurement has known mean ␮ y ? How do these filtering formulae change? The centered signal and measurement x − ␮x and y − ␮ y then share the composite covariance matrix Rx y . So the LMMSE estimator of x − ␮x from y − ␮ y should obey all of the equations already derived. That is, xˆ − ␮x = W(y − ␮ y ) ⇔ xˆ = W(y − ␮ y ) + ␮x .

(5.45)

But what about the orthogonality principle which says that the error between the estimator and the signal is orthogonal to the measurement? This is still so due to E[(ˆx − x)yH ] = E{[ˆx − ␮x − (x − ␮x )](y − ␮ y )H } + E{[ˆx − ␮x − (x − ␮x )]␮Hy } = 0. (5.46) The first term on the right is zero due to the orthogonality principle already established for zero-mean LMMSE estimators. The second term is zero because xˆ is an unbiased estimator of x, i.e., E[ˆx] = ␮x .

5.3.5

Concentration ellipsoids Let’s call Bx x = {x: xH R−1 x x x = 1} the concentration ellipsoid for the signal vector x. For scalar x, this is the circle Bx x = {x: |x|2 = Rx x }. The concentration ellipsoid for H the error vector e is Bee = {e: eH Q−1 e = 1}. From the equation Q = Rx x − Rx y R−1 yy Rx y −1 −1 we know that Rx x ≥ Q and hence Rx x ≤ Q . The posterior concentration ellipsoid Bee is therefore smaller than, and completely embedded within, the prior concentration ellipsoid Bx x . When the signal and measurement are jointly Gaussian, these ellipsoids are probability-density contour lines (level curves). Among measures of effectiveness for LMMSE are relative values for the trace and determinant of Q and Rx x . Various forms of these may be derived from the matrix identities obtained from the channel models. They may be given insightful forms in canonical coordinate systems, as we will see in Section 5.5. One particularly illuminating formula is the gain of LMMSE filtering, which is closely related to mutual information (cf. Section 5.5.2): det R yy det Rx x . = det Q det Rnn

(5.47)

So the gain of LMMSE estimation is the ratio of the volume of the measurement concentration ellipsoid to the volume of the noise concentration ellipsoid.

128

Estimation

Example 5.3. Let’s consider the problem of linearly estimating the zero-mean signal x from the zero-mean measurement y when the measurement is in fact the complex conjugate of the signal, namely y = x∗ . The MMSE estimator is obviously the widely linear estimator xˆ = y∗ and its error covariance matrix is Q = 0. The composite covariance matrix for x and x∗ is the augmented covariance matrix R x x . From the structure of this matrix, namely xx Rx x R , Rx x = ∗ Rx x R∗x x ∗ x x R−∗ we know that the LMMSE estimator of x from x∗ is xˆ = R x x x , with error covariance −∗ ∗ matrix Q = Rx x − Rx x Rx x Rx x . This error covariance matrix satisfies the inequality x x = 0, then linear MMSE estimation is no 0 ≤ Q ≤ Rx x . Of course, if x is proper, R good at all. But, for improper signals, the error concentration ellipsoid for the error e = xˆ − x lies inside the concentration ellipsoid for x.

5.3.6

Special cases Two special cases of LMMSE estimation warrant particular attention: the pure signalplus-noise case and the Gaussian case.

Signal plus noise The pure signal-plus-noise problem is y = x + n, with H = I and x and n uncorrelated. As a consequence, Rx y = Rx x , R yy = Rx x + Rnn , and the composite covariance matrix is Rx x Rx x Rx y = . (5.48) Rx x Rx x + Rnn The prior signal-plus-noise covariance is the series formula R yy = Rx x + Rnn , but the −1 −1 posterior error covariance is the parallel formula Q = (R−1 x x + Rnn ) . The gain of LMMSE filtering is det Rx x det(Rx x + Rnn ) −1/2 −H/2 = det(I + Rnn Rx x Rnn ). = det Q det Rnn

(5.49)

−1/2 −H/2 Rx x Rnn can reasonably be called a signal-to-noise-ratio matrix. The matrix Rnn These formulae are characteristic of LMMSE problems.

The Gaussian case Suppose the signal and measurement are multivariate Gaussian, meaning that the composite vector [xT , yT ]T is multivariate normal with zero mean and composite covariance matrix Rx y . Then the conditional pdf for x, given y, is det R yy p(x, y) x H −1 − y exp xH yH R−1 R y . (5.50) = n p(x|y) = xy yy y p(y) π det Rx y

129

5.4 Widely linear MMSE estimation

Using the identity det Rx y = det R yy det Q and one of the Cholesky factors for R−1 x y , this pdf may be written p(x|y) =

πn

1 exp (x − Wy)H Q−1 (x − Wy) . det Q

(5.51)

Thus the posterior pdf for x, given y, is Gaussian with conditional mean Wy and conditional covariance Q. This means that the MMSE estimator – without the qualifier that it be linear – is this conditional mean Wy, and its error covariance is Q.

5.4

Widely linear MMSE estimation We have already established that widely linear transformations are required in order to access the information contained in the complementary covariance matrix. In this section, we consider widely linear minimum mean-squared error (WLMMSE) estimation of the zero-mean signal x: −→ C n from the zero-mean measurement y: −→ C m . To extend our results for LMMSE estimation to WLMMSE estimation we need only replace the signal x by the augmented signal x and the measurement y by the augmented measurement y, specify the composite augmented covariance matrix for these two vectors, and proceed as before. Nothing else changes. Thus, most of the results from Section 5.3 apply straightforwardly to WLMMSE estimation. It is, however, still worthwhile to summarize some of these results and to compare LMMSE with WLMMSE estimation. 3 The widely linear (or linear–conjugate linear) estimator is xˆ = W y ⇐⇒ xˆ = W1 y + W2 y∗ , where

W=

W1 W∗2

W2 W∗1

(5.52)

(5.53)

is determined such that the mean-squared error Eˆx − x2 = 12 Eˆx − x2

(5.54)

is minimized. This can be done by applying the orthogonality principle, (ˆx − x) ⊥ y

and

(ˆx − x) ⊥ y∗ ,

(5.55)

or (ˆx − x) ⊥ y. This says that the error between the augmented estimator and the augmented signal must be orthogonal to the augmented measurement. This leads to E(ˆx yH ) − E(x yH ) = 0, W R yy − R x y = 0 ⇔ W = R x y R−1 yy .

(5.56)

Using the matrix-inversion lemma (A1.42) for R−1 yy , we obtain the following result. Result 5.3. Given an observation y: −→ C m , the WLMMSE estimator of the signal x: −→ C n in augmented notation is xˆ = R x y R−1 yy y.

(5.57)

130

Estimation

Equivalently, −1 −∗ ∗ ∗ −1 x y R−∗ xˆ = (Rx y − R yy R yy )P yy y + (Rx y − Rx y R yy R yy )P yy y .

(5.58)

∗ yy R−∗ In this equation, the Schur complement P yy = R yy − R yy R yy is the error covariance ∗ matrix for linearly estimating y from y . The augmented error covariance matrix Q of the error vector e = xˆ − x is H Q = E[e eH ] = R x x − R x y R−1 yy R x y .

(5.59)

A competing estimator xˆ = W y will produce an augmented error e = xˆ − x with covariance matrix $ # H Q = E e e = Q + (W − W )R yy (W − W )H , (5.60) which shows that Q ≤ Q . As a consequence, all real-valued increasing functions of Q are minimized, in particular, Ee2 = tr Q ≤ tr Q = Ee 2 ,

(5.61)

det Q ≤ det Q .

(5.62)

These statements hold for the error vector e as well as the augmented error vector e because Q ≤ Q ⇒ Q ≤ Q .

(5.63)

The error covariance matrix Q of the error vector e = xˆ − x is the northwest block of the augmented error covariance matrix Q, which can be evaluated as −1 −∗ H ∗ −1 H x y R−∗ Q = E[eeH ] = Rx x − (Rx y − R yy R yy )P yy Rx y − (Rx y − Rx y R yy R yy )P yy Rx y . (5.64) A particular choice for a generally suboptimum filter is the LMMSE filter 0 Rx y R−1 yy ⇐⇒ W = Rx y R−1 (5.65) W = yy , 0 R∗x y R−∗ yy

which ignores complementary covariance matrices. We will examine the relation between LMMSE and WLMMSE filters in the following subsections.

5.4.1

Special cases x y = R∗x y . This leads to the simplified expression If the signal x is real, we have R ∗ −1 x y R−∗ (5.66) xˆ = 2 Re (Rx y − R yy R yy )P yy y . While the WLMMSE estimate of a real signal is always real, the LMMSE estimate is generally complex. Result 5.4. The WLMMSE and LMMSE estimates are identical if and only if the error of the LMMSE estimate is orthogonal to y∗ , i.e., (W y − x) ⊥ y∗ ⇐⇒ Rx y R−1 yy R yy − Rx y = 0.

(5.67)

5.4 Widely linear MMSE estimation

131

There are two important special cases in which (5.67) holds. r The signal and measurement are cross-proper, R x y = 0, and the measurement is proper, R yy = 0. Joint propriety of x and y will suffice but it is not necessary that x be proper. r The measurement is maximally improper, i.e., y = αy∗ with probability 1 for con yy = αR yy and Rx y R−1 x y = αRx y and R stant α with |α| = 1. In this case, R yy R yy α − ∗ Rx y α = 0. WL estimation is unnecessary since y and y both carry exactly the same information about x. This is irrespective of whether or not x is proper. It may be surprising that (5.67) puts no restrictions on whether or not x be proper. This is true for the performance criterion of MMSE. Other criteria such as weighted MMSE or mutual information do depend on the complementary covariance matrix of x. Finally, the following result for Gaussian random vectors is now obvious by comparing xˆ with ␮x|y in (2.60) and Q with R x x|y in (2.61). Result 5.5. Let x and y be two jointly Gaussian random vectors. The mean of the conditional distribution for x given y is the WLMMSE estimator of x from y, and the covariance matrix of the conditional distribution is the error covariance matrix of the WLMMSE estimator.

5.4.2

Performance comparison between LMMSE and WLMMSE estimation It is interesting to compare the performance of LMMSE and WLMMSE estimation. We will consider the signal-plus-uncorrelated-noise case y = x + n with white and proper noise n, i.e., Rnn = N0 I, R x y = R x x , and R yy = R x x + N0 I. The LMMSE is LMMSE = Ee 2 = tr(Rx x − Rx x (Rx x + N0 I)−1 RHx x ) n µi2 µi − = µi + N0 i=1 = N0

n i=1

µi , µi + N0

(5.68)

n are the eigenvalues of Rx x . On the other hand, the WLMMSE is where {µi }i=1

WLMMSE = Ee2 = =

1 2

tr (R x x − R x x (R x x + N0 I)−1 RHx x )

2n N0 λi , 2 i=1 λi + N0

(5.69)

2n are the eigenvalues of R x x . In order to evaluate the maximum perforwhere {λi }i=1 mance advantage of WLMMSE over LMMSE processing, we need to minimize the x x . Using (A3.13) from Appendix 3, we see that WLMMSE for fixed Rx x and varying R the WLMMSE is a Schur-concave function of the eigenvalues {λi }. Therefore, minimizing the WLMMSE requires maximum spread of the {λi } in the sense of majorization.

132

Estimation

According to Result 3.7, this is achieved for λi = 2µi ,

i = 1, . . ., n,

and

λi = 0,

i = n + 1, . . ., 2n.

(5.70)

On plugging this into (5.69), we obtain n N0 2µi , min WLMMSE = 2 i=1 2µi + N0 Rx x

(5.71)

and thus min N0

min WLMMSE 1 Rx x = , LMMSE 2

(5.72)

which is attained for N0 = 0. This gives the following result. Result 5.6. When estimating an improper complex signal in uncorrelated additive white (i.e., proper) noise, the maximum performance advantage of WLMMSE estimation over LMMSE estimation is a factor of 2. This is achieved as the noise level N0 approaches zero. The advantage diminishes for larger N0 and disappears completely for N0 → ∞. We note that, if (5.70) is not satisfied, the maximum performance advantage, which is then less than a factor of 2, occurs for some noise level N0 > 0. The factor of 2 is a very conservative performance-advantage bound because it assumes the worst-case scenario of white additive noise. If the noise is improper or colored, the performance difference between WLMMSE and LMMSE processing can be much larger. Consider for instance the scenario in which the signal x is real and the noise n is purely imaginary. It is clear that the WL operation Re y will yield a perfect estimate of x. The LMMSE estimator, on the other hand, is not real-valued and therefore incurs a nonzero estimation error. In this special case, the performance advantage of WLMMSE over LMMSE estimation would be infinite.

5.5

Reduced-rank widely linear estimation We now consider reduced-rank widely linear estimators. There are two different aims for rank reduction that we pursue: either keep the mean-squared error as small as possible, or keep the concentration ellipsoid for the error as small as possible. In the Gaussian case, the latter keeps the mutual information between the reduced-rank estimator and the signal as large as possible. The first goal requires the minimization of the trace of the error covariance matrix and will be referred to as the min-trace problem. The second goal requires the minimization of the determinant of the error covariance matrix and correspondingly will be called the min-det problem. The mean-squared estimation error is invariant under nonsingular widely linear transformation of the measurement and widely unitary transformation of the signal, and mutual information is invariant under nonsingular widely linear transformation of both measurement and signal. In Chapter 4, we have already found that half-canonical

5.5 Reduced-rank widely linear estimation

133

correlations and canonical correlations are maximal invariants for these transformation groups. Therefore, it should come as no surprise that the min-trace problem is solved by performing rank reduction in half-canonical coordinates, and the min-det problem by performing rank reduction in canonical coordinates. 4

5.5.1

Minimize mean-squared error (min-trace problem) We first consider the min-trace problem, in which we minimize the reduced-rank MSE. The full-rank WLMMSE is Eˆx − x2 = tr Q = 12 Eˆx − x2 =

1 2

tr Q =

1 2

H tr(R x x − R x y R−1 yy R x y ),

(5.73)

and it is easy to see that it is invariant under widely linear transformation of the measurement y and widely unitary transformation of the signal x. Therefore, the full-rank WLMMSE must be a function of the augmented covariance matrix R x x and the total half-canonical correlations k1 ≥ k2 ≥ · · · ≥ k2 p ≥ 0 between x and y. As discussed in Section 4.1.4, the total half-canonical correlations are obtained by the SVD = F K GH , C = R x y R−H/2 yy

(5.74)

where FH F = I, F ∈ W n× p , and GH G = I, G ∈ W m× p . The matrix K1 K2 K= K2 K1 consists of a diagonal block K1 with diagonal elements 12(k2i−1 + k2i ) and a diagonal block K2 with diagonal elements 12(k2i−1 − k2i ). (In Chapter 4 we used a bar over k¯i to differentiate total correlations from rotational correlations, but this is not necessary here.) The full-rank WLMMSE is Eˆx − x2 =

1 2

tr(R x x − K2 ) = tr Rx x −

1 2

2p

ki2 .

(5.75)

i=1

We may also rewrite the widely linear estimator as −H/2 −1/2 xˆ = W y = R x y R−1 R yy y yy y = R x y R yy

= F K GH R−1/2 yy y.

(5.76)

We thus break up the estimator into three blocks, as shown in Fig. 5.4. We first transform the measurement y into white and proper half-canonical measurement coordinates ␻ = B y using the coder B = GH R−1/2 yy . We then estimate the signal in half-canonical ˆ coordinates as ␰ = K ␻. A key observation is that K has diagonal blocks. Thus, for a given i, every component ξî is a widely linear function of ωi only and does not depend on ω j , j = i. The estimator ␰ˆ still has mutually uncorrelated components but is generally no longer white or proper. Finally, the estimate xˆ is produced by passing ␰ˆ through the decoder A−1 = F (where the inverse is a right inverse).

134

Estimation

ξ x

A

y

B coder

ω

K diagonal estimator

ξˆ

A−1

e

A−1

xˆ

decoder

Figure 5.4 WLMMSE estimation in half-canonical coordinates.

So how do we obtain the optimum rank-r approximation of this estimator? Looking at (5.75), it is tempting to assume that we should discard the 2( p − r ) smallest halfcanonical correlations. That is, we would build the rank-r estimator xˆ r = Wr y = F Kr GH R−1/2 yy y,

(5.77)

where Kr is obtained from K by replacing the 2( p − r ) smallest half-canonical correlations with zeros. This is indeed the optimum solution. In order to turn this into a rigorous proof, we need to establish that the reduced-rank estimator uses the same halfcanonical coordinate system as the full-rank estimator. This can be done by proceeding from (5.60). In order to determine the rank-2r matrix Wr that minimizes the trace of the extra covariance 1/2 1/2 1/2 H tr(W − Wr )R yy (W − Wr )H = tr(W R1/2 yy − Wr R yy )(W R yy − Wr R yy ) ,

(5.78)

1/2 −H/2 we need to find the rank-2r matrix Wr R1/2 , yy that is closest to W R yy = R x y R yy 2 H where closeness is measured by the trace norm X = tr(XX ). It is a classical result that the reduced-rank matrix closest to a given matrix, as measured by any unitarily invariant norm including the trace norm, is obtained via the SVD (see, e.g., Marshall = F K GH the best rank-2r and Olkin (1979)). That is, with the SVD C = R x y R−H/2 yy approximation is H Wr R1/2 yy = F Kr G

(5.79)

Wr = F Kr GH R−1/2 yy .

(5.80)

and therefore

This confirms our proposition (5.77). Result 5.7. Optimum rank reduction for WLMMSE estimation is performed in total half-canonical coordinates. The reduced-rank WLMMSE can be expressed as the sum of the full-rank WLMMSE and the WLMMSE due to rank reduction: ) ( 2p 2p 2 2 1 1 ki + 12 ki2 . (5.81) Eˆxr − x = 2 tr Qr = tr Rx x − 2 i=1

i=2(r +1)

5.5 Reduced-rank widely linear estimation

135

Rank reduction, or data compaction, is therefore most effective if much of the total 2r . In Section 4.4, correlation is concentrated in a few half-canonical correlations {ki }i=1 we introduced the correlation spread as a single, normalized measure of the effectiveness of rank reduction. There are analogous suboptimum stories for strictly linear estimation and conjugate linear estimation, which utilize rotational and reflectional half-canonical correlations, respectively, in place of total half-canonical correlations. For any given rank r , strictly linear and conjugate linear estimators can never outperform widely linear estimators.

5.5.2

Maximize mutual information (min-det problem) Mutual information has different invariance properties than the mean-squared error. The full-rank mutual information between jointly Gaussian x and y is I (x; y) = H (x) − H (x|y). The differential entropy of the signal is H (x) =

1 2

log[(π e)2n det R x x ]

(5.82)

and the conditional differential entropy of the signal given the measurement is H (x|y) = H (e) =

1 2

log[(π e)2n det Q].

Thus, the mutual information is

(5.83)

det Q . log det R x x

I (x; y) = H (x) − H (x|y) =

− 12

(5.84)

It is easy to see that I (x; y) is invariant under widely linear transformation of the signal x and the measurement y. Hence, it must be a function of the total canonical correlations {ki } between x and y. As discussed in Section 4.1.4, the total canonical correlations are obtained by the SVD of the augmented coherence matrix C, −H/2 C = R−1/2 = F K GH , x x R x y R yy

(5.85)

with FH F = I, F ∈ W n× p , and GH G = I, G ∈ W m× p , and K defined as in the mintrace problem above. We assume the ordering 1 ≥ k1 ≥ k2 ≥ · · · ≥ k2 p ≥ 0. The mutual information can thus be written in terms of canonical correlations as ) ( −1 H −H/2 ) det R x x det(I − R−1/2 x x R x y R yy R x y R x x 1 I (x; y) = − 2 log det R x x ( 2p ) 2p " 2 2 1 1 1 = − 2 log det(I − K ) = − 2 log (1 − ki ) = − 2 log(1 − ki2 ). (5.86) i=1

i=1

We may also express the WLMMSE estimator in canonical coordinates as xˆ = W y = R x y R−1 yy y −1/2 −H/2 −1/2 = R1/2 R yy y x x R x x R x y R yy H −1/2 = R1/2 x x F K G R yy y.

(5.87)

136

Estimation

As in the previous section, this breaks up the estimator into three blocks. However, this time we use a canonical coordinate system rather than a half-canonical coordinate system. We first transform the measurement y into white and proper canonical measurement coordinates ␻ = B y using the coder B = GH R−1/2 yy . We then estimate the signal in canonical coordinates as ␰ˆ = K ␻. While the estimator ␰ˆ has mutually uncorrelated components but is generally not white or proper, the canonical signal ␰ is white and proper. Finally, the estimate xˆ is produced by passing ␰ˆ through the decoder A−1 = F R1/2 x x (where the inverse is a right inverse). We now ask how to find a rank-r widely linear estimator that provides as much mutual information about the signal as possible. Looking at (5.86), we are inclined to discard the 2( p − r ) smallest canonical correlations. We would thus construct the rank-r estimator H −1/2 xˆ r = Wr y = R1/2 x x F Kr G R yy y,

(5.88)

which is similar to the solution of the min-trace problem (5.77) except that it uses a canonical coordinate system in place of a half-canonical coordinate system. The rank-r matrix Kr is obtained from K by replacing the 2( p − r ) smallest canonical correlations with zeros. As in the min-trace case, our intuition is correct but it still requires a proof that the reduced-rank maximum mutual information estimator does indeed use the same canonical coordinate system as the full-rank estimator. The proof given below follows Hua et al. (2001). Starting from (5.60), maximizing mutual information means minimizing det Qr = det(Q + (W − Wr )R yy (W − Wr )H ) 1/2 −1/2 1/2 H = det R x x det I − C CH + (C − R−1/2 x x Wr R yy )(C − R x x Wr R yy ) . (5.89) For notational convenience, let 1/2 Xr = C − R−1/2 x x Wr R yy .

(5.90)

The minimum det Qr is zero if there is at least one canonical correlation ki equal to 1. We may thus assume that all canonical correlations are strictly less than 1, which allows us to write det R x x det I − C CH + Xr XrH (5.91) = det R x x det(I − C CH )det I + (I − C CH )−1/2 Xr XrH (I − C CH )−H/2 . Since det R x x and det(I − C CH ) are independent of Wr , they can be disregarded in the minimization. The third determinant in (5.91) is of the form det(I + Y YH ) =

2p "

(1 + σi2 ),

(5.92)

i=1

where {σi }i=1 are the singular values of Y = (I − C CH )−1/2 Xr . This amounts to minimizing Y with respect to an arbitrary unitarily invariant norm. In other words, we need H −1/2 1/2 C. This to find the rank-2r matrix (I − C CH )−1/2 R−1/2 x x Wr R yy closest to (I − C C ) 2p

5.6 Linear and widely linear MVDR estimators

137

is achieved by a rank-2r SVD of (I − C CH )−1/2 C. However, because the eigenvectors of (I − C CH )−1/2 are the left singular vectors of C, a simpler solution is to use directly −H/2 = F K GH . The best the SVD of the augmented coherence matrix C = R−1/2 x x R x y R yy rank-2r approximation is then 1/2 H R−1/2 x x Wr R yy = F Kr G

(5.93)

H −1/2 Wr = R1/2 x x F Kr G R yy .

(5.94)

and thus

This confirms our initial claim. Result 5.8. For Gaussian data, optimum rank reduction for widely linear maximum information rate estimation is performed in total canonical coordinates. The reduced-rank information rate is Ir (x; y) =

− 12

2r

log(1 − ki2 ),

(5.95)

i=1

and the loss of information rate due to rank reduction is I (x; y) − Ir (x; y) = − 12

2p

log(1 − ki2 ).

(5.96)

i=2(r +1)

Two remarks made in the min-trace case hold for the min-det case as well. Firstly, rank reduction is most effective if much of the total correlation is concentrated in a 2r . A good measure for the degree of concentration is few canonical correlations {ki }i=1 the correlation spread introduced in Section 4.4. Secondly, there are completely analogous, suboptimum, developments for strictly linear min-det estimation and conjugate linear min-det estimation. These utilize rotational and reflectional canonical correlations, respectively, in place of total canonical correlations. For any given rank, widely linear min-det estimation is at least as good as strictly linear or conjugate linear min-det estimation, but superior if the signals are improper.

5.6

Linear and widely linear minimum-variance distortionless response estimators In the engineering literature, especially in the fields of radar, sonar, and wireless communication, it is commonplace to design a linear minimum-variance unbiased or linear minimum variance-distortionless response (LMVDR) estimator (or receiver). Actually, it would be more accurate to call these receivers best linear unbiased estimators (BLUEs), which conforms to standard language in the statistics literature. In this section, we will first review the design and performance of the LMVDR receiver using Hermitian covariances only, and then we shall extend our results to widely linear minimum-variance distortionless response (WLMVDR) receivers by accounting for complementary covariances.

138

Estimation

5.6.1

Rank-one LMVDR receiver We shall begin by assuming the signal component of the received signal to lie in a known one-dimensional subspace of C m , denoted ␺. We will later generalize to p-dimensional subspaces Ψ. Without loss of generality we may assume ␺ H ␺ = 1, which is to say that ␺ is a unit vector. Thus the measurement model for the received signal y: −→ C m is y = ␺ x + n,

(5.97)

where the complex scalar x is unknown and to be estimated, but modeled as deterministic, i.e., x is not assigned a probability distribution. The additive noise n is zero-mean with Hermitian covariance matrix Rnn = E[nnH ]. The problem is to design a linear estimator of x that is unbiased with variance smaller than any unbiased competitor. A first guess would be the matched filter xˆ = ␺ H y,

(5.98)

ˆ = x and variance which has mean E[x] E[(xˆ − x)(xˆ − x)∗ ] = ␺ H Rnn ␺.

(5.99)

This solution is linear and unbiased, but, as we now show, not minimum variance. To design the LMVDR receiver xˆ = wH y, let’s minimize its variance, under the constraint that its mean be x: min wH Rnn w

under constraint wH ␺ = 1.

(5.100)

It is a simple matter to set up a Lagrangian, solve for the set of candidate solutions, and enforce the constraint. This leads to the LMVDR receiver xˆ =

−1 ␺ H Rnn y, H −1 ␺ Rnn ␺

(5.101)

which, for obvious reasons, is sometimes also called a whitened matched filter. This ˆ = x, but its variance is smaller: solution too has mean E[x] Q = E[(xˆ − x)(xˆ − x)∗ ] =

1 . −1 ␺ ␺ H Rnn

(5.102)

−1 ␺)−1 ≤ ␺ H Rnn ␺ is a special case of the Kantorovich inequality. The inequality (␺ H Rnn The LMVDR or BLUE estimator is also maximum likelihood when the measurement noise n is zero-mean Gaussian with covariance Rnn , in which case the measurement y is Gaussian with mean ␺ x and covariance Rnn .

Relation to LMMSE estimator If the unknown variable x is modeled as in Example 5.2 as a random variable that is uncorrelated with n and has zero mean and variance σx2 , then its LMMSE estimator is xˆ =

−1 ␺ H Rnn y. −1 ␺ σx−2 + ␺ H Rnn

(5.103)

5.6 Linear and widely linear MVDR estimators

ψH

x+u

139

(x + u) − uˆ

y ψ H Rnn G(GH Rnn G)−1 GH

v

Figure 5.5 Rank-one generalized sidelobe canceler.

This estimator is conditionally biased, but unconditionally unbiased. Its mean-squared −1 ␺)−1 . If nothing is known a priori about the scale of x (i.e., its error is (σx−2 + ␺ H Rnn 2 variance σx ), then the LMVDR estimator xˆ =

−1 ␺ H Rnn y −1 ␺ ␺ H Rnn

(5.104)

might be used as an approximation to the LMMSE estimator. It is conditionally unbiased and unconditionally unbiased, so its mean-squared error is the average of its conditional variance, which, when averaged over the distribution of x, remains unchanged −1 ␺)−1 . Obviously the LMVDR approximation to the LMMSE estimator has at (␺ H Rnn worse mean-squared error than the LMMSE estimator. So there is a tradeoff between minimum mean-squared error and conditional unbiasedness. As a practical matter, in many problems of communication, radar, sonar, geophysics, and radio astronomy, the scale of x is unknown and the LMVDR approximation is routinely used, with the virtue that it is conditionally and unconditionally unbiased and its mean-squared error is near the minimum achievable at high SNRs.

5.6.2

Generalized sidelobe canceler The LMVDR estimator can also be derived by considering the generalized sidelobe canceler (GSC) structure in Fig. 5.5. The top branch of the figure is matched to the signal subspace ␺ and the bottom branch is mismatched, which is to say that GH ␺ = 0 for G ∈ C m×(m−1) . Moreover, the matrix U = [␺, G] is a unitary matrix, meaning that ␺ H ␺ = 1, GH ␺ = 0, and GH G = I. When the Hermitian transpose of this matrix is applied to the measurement y, the result is H x +u x + ␺ Hn ␺ = . (5.105) y = UH y = GH v GH n The output of the top branch is x + u, the unbiased matched filter output, with variance ␺ H Rnn ␺, and the output of the GSC in the bottom branch is v = GH y. It contains no signal, but is correlated with the noise variable u = ␺ H n in the output of the top branch. Therefore it may be used to estimate u, and this estimate may be subtracted from the top branch to reduce variance. This is the essence of the LMVDR receiver. Let us implement this program, using the formalism we have developed in our discussion of LMMSE estimators. We organize u and v into a composite vector and compute

140

Estimation

the covariance of this composite vector as H ␺ u ∗ E G u vH = H Rnn ␺ G v H ␺ Rnn ␺ ␺ H Rnn G . = GH Rnn ␺ GH Rnn G

(5.106)

From the unitarity of the matrix [␺, G], it follows that the inverse of this composite covariance matrix is H −1 H −1 G ␺ Rnn ␺ ␺ H Rnn ␺ −1 . (5.107) R ␺ G = −1 −1 GH nn ␺ GH Rnn G GH Rnn Recall that the northwest entry of this matrix is the inverse of the error variance for estimating u from v: E[(uˆ − u)(uˆ − u)∗ ] =

1 −1 ␺ ␺ H Rnn

.

(5.108)

This result establishes that the variance in estimating u from v is the same as the variance in estimating x from y. Let us now show that it is also the variance in estimating x from v, thus establishing that the GSC is indeed an implementation of the LMVDR or BLUE estimator. The LMMSE estimator of u from v may be read out of the composite covariance matrix for u and v: uˆ = ␺ H Rnn G(GH Rnn G)−1 v.

(5.109)

The resulting LMVDR estimator of x from y, illustrated in Fig. 5.5, is the difference ˆ namely xˆ = x + (u − u) ˆ = (x + u) − u: ˆ between x + u and u, xˆ = ␺ H [I − Rnn G(GH Rnn G)−1 GH ]y.

(5.110)

From the orthogonality of ␺ and G, it follows that this estimator is unbiased. Moreover, its variance is the mean-squared error between uˆ and u: Q = E[(xˆ − x)(xˆ − x)∗ ] = E[(uˆ − u)(uˆ − u)∗ ] = ␺ H Rnn ␺ − ␺ H Rnn G(GH Rnn G)−1 GH Rnn ␺ = hH (I − PF )h.

(5.111)

H/2 H/2 ␺, F = Rnn G, and PF = F(FH F)−1 FH is the orthogonal projection Here, h = Rnn onto the subspace F. This result produces the following identity for the variance of the LMVDR receiver: 1 Q = H −1 = hH (I − PF )h = hH h − hH PF h. (5.112) ␺ Rnn ␺

Actually, this equality is a decomposition of the variance of u, namely hH h = ␺ H Rnn ␺, −1 ˆ ␺)−1 , plus the variance of u, into the sum of the variance for (uˆ − u), namely (␺ H Rnn H H H −1 H namely h PF h = ␺ Rnn G(G Rnn G) G Rnn ␺. This decomposition comes from the orthogonal decomposition of u into uˆ − (uˆ − u). The virtue of the identity (5.112) for the variance of the LMVDR receiver is that it shows the variance to depend on the angle between the subspaces h and F, as

5.6 Linear and widely linear MVDR estimators

141

h

q

PF h

F

Figure 5.6 The performance gain of the LMVDR estimator over a matched filter depends on the angle between h and F.

illustrated in Fig. 5.6. In fact, the gain of the LMVDR receiver over the matched filter receiver is hH h 1 ␺ H Rnn ␺ = = , H −1 H 1/(␺ Rnn ␺) h (I − PF )h sin2 θ

(5.113)

where θ is the angle between the colored subspaces h and F. When these two subspaces are close in angle, then sin2 θ is small and the gain is large, meaning that the bottom branch of the GSC can estimate and cancel out the noise component u in the top branch, thus reducing the variance of the error uˆ − u = xˆ − x.

5.6.3

Multi-rank LMVDR receiver These results are easily extended to the multi-rank LMVDR receiver, wherein the measurement model is y = Ψx + n.

(5.114)

In this model, the matrix Ψ = [␺ 1 , ␺ 2 , . . ., ␺ p ] ∈ C m× p consists of p modes and the vector x = [x1 , x2 , . . ., x p ] consists of p complex amplitudes. Without loss of generality we assume ΨH Ψ = I. The multi-rank matched filter estimator of x is xˆ = ΨH y

(5.115)

with mean x and error covariance matrix E[(ˆx − x)(ˆx − x)H ] = ΨH Rnn Ψ. The LMVDR estimator xˆ = WH y, with W ∈ C m× p , is derived by minimizing the trace of the error covariance under an unbiasedness constraint: min tr [WH Rnn W] under constraint WH Ψ = I.

(5.116)

−1 −1 Ψ)−1 ΨH Rnn y xˆ = (ΨH Rnn

(5.117)

The solution is

with mean x and error covariance matrix −1 Q = E[(ˆx − x)(ˆx − x)H ] = (ΨH Rnn Ψ)−1 .

(5.118)

−1 The inequality (ΨH Rnn Ψ)−1 ≤ ΨH Rnn Ψ is again a special case of the Kantorovich inequality.

142

Estimation

Generalized sidelobe canceler As in the one-dimensional case, there is a generalized sidelobe-canceler (GSC) structure. Figure 5.5 still holds, with the one-dimensional matched filter of the top branch relabeled with the p-dimensional matched filter ΨH and the bottom GSC branch labeled with the (m − p)-dimensional sidelobe canceler GH , G ∈ C m×(m− p) , where the matrix U = [Ψ, G] is unitary. The filter connecting the bottom branch to the top branch is ΨH Rnn G(GH Rnn G)−1 v and the resulting LMVDR estimator of x is the difference ˆ namely xˆ = x + (u − u) ˆ = (x + u) − u: ˆ between x + u and u, xˆ = ΨH [I − Rnn G(GH Rnn G)−1 GH ]y.

(5.119)

From the orthogonality of Ψ and G, it follows that this estimator is unbiased. Moreover, ˆ its variance is the mean-squared error between u and u: Q = E[(ˆx − x)(ˆx − x)H ] = E[(uˆ − u)(uˆ − u)H ] = ΨH Rnn Ψ − ΨH Rnn G(GH Rnn G)−1 GH Rnn Ψ = HH (I − PF )H.

(5.120)

H/2 H/2 Here, H = Rnn Ψ and F = Rnn G. As in the one-dimensional case, this result produces the following identity for the error covariance matrix of the LMVDR receiver: −1 Q = (ΨH Rnn Ψ)−1 = HH (I − PF )H.

(5.121)

The virtue of this identity is that it shows the error covariance to depend on the principal angles between the subspaces H and F. Figure 5.6 applies correspondingly, if one accounts for the fact that H is now multidimensional. The gain of the multi-rank LMVDR receiver over the multi-rank matched filter receiver may be written as det(HH H) 1 , = p " det[HH (I − PF )H] 2 sin θi

(5.122)

i=1

where the θi are the principal angles between the two subspaces H and F. When these angles are small, the gain is large, meaning that the bottom branch of the GSC can estimate and cancel out the noise component u in the top branch, thus reducing the variance of the error uˆ − u.

5.6.4

Subspace identification for beamforming and spectrum analysis Often the measurement model y = ␺ x + n is hypothesized, but the channel vector ␺ that carries the unknown complex parameter x to the measurement is unknown. This kind of model is called a separable linear model. Call the true unknown vector ␺ 0 . A standard ploy in subspace identification, spectrum analysis, and beamforming is to scan the LMVDR estimator xˆ through candidates for ␺ 0 and choose the candidate that best models the measurement y. That is to say, the weighted squared error between the measurement y and its prediction ␺ xˆ is evaluated for each candidate ␺ and the minimizing candidate is taken to be the best estimate of ␺ 0 . Such scanners are called goniometers. This minimizing candidate is actually the maximum-likelihood estimate

5.6 Linear and widely linear MVDR estimators

143

in a multivariate Gaussian model for the noise and it is also the weighted least-squared error estimate, as we now demonstrate. Recall that the LMVDR estimate of the complex amplitude x in the measurement −1 −1 ␺)−1 ␺ H Rnn y. The corresponding LMVDR estimamodel y = ␺ x + n is xˆ = (␺ H Rnn −1/2 −1/2 −1/2 ˆ It is just a few steps of (y − ␺ x). tor of the whitened noise Rnn n is Rnn nˆ = Rnn algebra to write this estimator as −1/2 nˆ = (I − P␻ )w, Rnn

(5.123)

−1/2 −1/2 y is a whitened version of the measurement, ␻ = Rnn ␺ is a whitened where w = Rnn H −1 H version of ␺, and P␻ = (␻ ␻) ␻␻ is the orthogonal projector onto the subspace −1/2 ␻. The mean value of Rnn nˆ is the bias −1/2 ˆ = (I − P␻ )␻0 x, n] b = E[Rnn

(5.124)

−1/2 −1/2 −1/2 ␺ 0 , and its covariance is E[(Rnn with ␻0 = Rnn nˆ − b)(Rnn nˆ − b)H ] = (I − P␻ ). Thus the bias is zero when ␻ = ␻0 , as expected. The output power of the goniometer is the quadratic form −1 −1 −1 ˆ 2 = yH Rnn y − ␻H ␻|x| y− nˆ H Rnn nˆ = wH (I − P␻ )w = yH Rnn

−1 2 y| |␺ H Rnn . (5.125) −1 ␺ ␺ H Rnn

This is a biased estimator of the noise-free squared bias bH b: −1 ˆ = tr (I − P␻ ) + bH b = (m − 1) + bH b E(nˆ H Rnn n)

= (m − 1) + |x|2 ␻0H (I − P␻ )␻0 .

(5.126)

At the true value of the parameter ␺ 0 the mean value of the goniometer, or maximumlikelihood function, is m − 1. At other values of ␺ the mean value is (m − 1) + bH b. So the output power of the goniometer is scanned through candidate values of ␺ to search for a minimum. On the right-hand side of (5.125), only the second term depends on ␺. So a reasonable strategy for approximating the goniometer is to compute this second term, which is sometimes called the noncoherent adaptive matched filter. When ␺ is a DFT vector, the adaptive matched filter is a spectrum analyzer or beamformer. 5

5.6.5

Extension to WLMVDR receiver Let us now extend the results for multi-rank LMVDR estimators to multi-rank WLMVDR estimators that account for complementary covariance. 6 The multi-rank measurement model (5.114) in augmented form is y = Ψ x + n, where

Ψ=

Ψ 0

0 Ψ∗

(5.127)

(5.128)

144

Estimation

and the noise n is generally improper with augmented covariance matrix Rnn . The matched filter estimator of x, however, does not take into account noise, and thus the widely linear matched filter solution is still the linear solution (5.115) xˆ = ΨH y ⇐⇒ xˆ = ΨH y.

(5.129)

The whitened matched filter or WLMVDR estimator, on the other hand, is obtained as the solution to min tr [WH Rnn W] with

under constraint WH Ψ = I

W=

W1 W∗2

W2 . W∗1

(5.130)

(5.131)

This solution is widely linear, −1 H −1 xˆ = (ΨH R−1 nn Ψ) Ψ Rnn y,

(5.132)

with mean x and augmented error covariance matrix −1 Q = E[(ˆx − x)(ˆx − x)H ] = (ΨH R−1 nn Ψ) .

(5.133)

It requires a few steps of algebra to show that the variance of the WLMVDR estimator is indeed less than or equal to the variance of the LMVDR estimator. Alternatively, we can argue as follows. The optimization (5.130) is performed under the constraint WH Ψ = I, H or, equivalently, WH 1 Ψ = I and W2 Ψ = 0. Thus, the WLMMSE optimization problem contains the LMMSE optimization problem as a special case in which W2 = 0 is enforced. Hence, WLMVDR estimation cannot be worse than LMVDR estimation. However, the additional degree of freedom of being able to choose W2 = 0 can reduce the variance if the noise n is improper. The results from the preceding subsections – for instance, the generalized sidelobe canceler – apply in a straightforward manner to WLMVDR estimation. The reduction in variance of the WLMVDR compared with the LMVDR estimator is entirely due to exploiting the complementary correlation of the noise n. Indeed, it is easy to see that, for proper noise n, the WLMVDR solution (5.132) simplifies to the LMVDR solution (5.117). Since x is not assigned statistical properties, the solution is independent of whether or not x is improper. This stands in marked contrast to the case of WLMMSE estimation discussed in Section 5.4.

5.7

Widely linear-quadratic estimation While widely linear estimation is optimum in the Gaussian case, it may be improved upon for non-Gaussian data if we have access to higher-order statistics. The next logical extension of widely linear processing is to widely linear-quadratic (WLQ) processing, which requires statistical information up to fourth order. 7 Because we would like to keep the notation simple, we shall restrict our attention to systems with vector-valued input but scalar output.

5.7 Widely linear-quadratic estimation

5.7.1

145

Connection between real and complex quadratic forms What does the most general widely quadratic transformation look like? To answer this question, we begin with the real-valued quadratic form u = wT Mw, wT = aT bT , M11 M11 M12 = M= M21 M22 MT12

(5.134) (5.135)

M12 , M22

(5.136)

with u ∈ IR, a, b ∈ IRm , and M11 , M12 , M21 , M22 ∈ IRm×m . Without loss of generality, we may assume that M is symmetric, and therefore M11 = MT11 , M21 = MT12 , and M22 = MT22 . Using I jI T= , TTH = TH T = 2I, (5.137) I −jI the complex version of this real-valued quadratic form is u = 12 (wT TH ) 12 TMTH (Tw) = 12 yH N y

(5.138)

with y = a + jb and y = Tw. There is the familiar connection N1 N2 H 1 , N = 2 TMT = N∗2 N∗1

(5.139)

N1 = 12 [M11 + M22 + j(MT12 − M12 )],

(5.140)

N2 =

(5.141)

1 [M11 2

− M22 +

j(MT12

+ M12 )],

T and N ∈ W m×m is an augmented Hermitian matrix, i.e., NH 1 = N1 and N2 = N2 . In general, N does not have to be positive or negative (semi)definite. However, if it is, N will have the same definiteness property as M. Because of the special structure of N, the quadratic form can be expressed as

u = 12 [yH N1 y + yT N∗1 y∗ + yT N∗2 y + yH N2 y∗ ] = yH N1 y + Re (yH N2 y∗ ).

(5.142)

So far, we have considered only a quadratic system with real output u. We may combine two quadratic systems Nr and Ni to produce the real and imaginary parts, respectively, of a complex-valued quadratic form: x = u + jv = 12 [yH Nr y + jyH Ni y] = 12 yH (Nr + jNi )y.

(5.143)

The resulting H=

H11 H21

H12 Nr,1 + jNi,1 = Nr + jNi = H22 N∗r,2 + jN∗i,2

Nr,2 + jNi,2 N∗r,1 + jN∗i,1

(5.144)

no longer has the block pattern of an augmented matrix, i.e., H ∈ W m×m , nor is it generally Hermitian. It does, however, satisfy H22 = HT11 , H12 = HT12 , and H21 = HT21 .

146

Estimation

Thus, a general complex-valued quadratic form can be written as x = 12 yH Hy = 12 [yH H11 y + yT HT11 y∗ + yT H21 y + yH H12 y∗ ] = yH H11 y + 12 [yT H21 y + yH H12 y∗ ].

(5.145)

Combining our results so far, the complex-valued output of a WLQ system can be expressed as x = c + 12 gH y + 12 yH Hy H ∗ H H ∗ 1 T = c + 12 [gH 1 y + g2 y ] + y H11 y + 2 [y H21 y + y H12 y ],

(5.146)

2m H where c is a complex constant, and gH y with gH = [gH constitutes the widely 1 , g2 ] ∈ C H linear part and y H y the widely quadratic part. 2m T If x is real, then g1 = g∗2 , i.e., gH = gH = [gH 1 , g1 ] ∈ C ∗ has the structure of an 1 ∗ augmented vector, and H11 = HH 11 , H21 = H12 , i.e., H has the structure of an augmented matrix (where N1 = H11 and N2 = H12 ). Thus, for real x,

x = u = c + 12 gH y + 12 yH N y 1 H H ∗ = c + Re(gH 1 y) + y N1 y + Re(y N2 y ).

(5.147)

We may consider a pair [g, H] to be a vector in a linear space, 8 with addition defined by [g, H] + [g , H ] = [g + g , H + H ]

(5.148)

and multiplication by a complex scalar a by a[g, H] = [ag, aH]. The inner product in this space is defined by 6 7 [g, H], [g , H ] = gH g + tr(HH H ).

5.7.2

(5.149)

(5.150)

WLQMMSE estimation We would now like to estimate a scalar complex signal x: −→ C from a measurement y: −→ C m , using the WLQ estimator xˆ = c + gH y + yH Hy.

(5.151)

For convenience, we have absorbed the factors of 1/2 in (5.146) into g ∈ C 2m and H ∈ C 2m×2m . We derive the WLQ estimator as a fairly straightforward extension of Picinbono and Duvaut (1988) to the complex noncircular case, which avoids the use of tensor notation. We assume that both x and y have zero mean. In order to ensure that xˆ has zero mean as well, we need to choose c = −tr (R yy H) so that xˆ = gH y + yH Hy − tr (R yy H). Using the definition of the inner product (5.150), the estimator is 6 7 xˆ = [g, HH ], [y, y yH − R yy ] .

(5.152)

(5.153)

5.7 Widely linear-quadratic estimation

147

In order to make xˆ a WLQ minimum mean-squared error (WLQMMSE) estimator that minimizes E|xˆ − x|2 , we apply the orthogonality principle of Result 5.2. That is, (xˆ − x) ⊥ xˆ ,

(5.154)

xˆ = g y + yH H y − tr (R yy H )

(5.155)

where H

is any WLQ estimator with arbitrary [g , H ]. The orthogonality in (5.154) can be expressed as E{(xˆ − x)∗ xˆ } = 0,

for all g and H .

(5.156)

We now obtain E(x ∗ xˆ ) = E(x ∗ g y) + E(x ∗ yH H y) − E(x ∗ )tr (R yy H ), H

(5.157)

and, because x has zero mean, E(x ∗ xˆ ) = g E(x ∗ y) + tr [H E(x ∗ y yH )] 9 8 H = [g , H ], [E(x ∗ y), E(x ∗ y yH )] . H

Because y has zero mean, we find H E(xˆ xˆ ∗ ) = g E ygT y∗ + y yT H∗ y∗ + E yH H ygT y∗ + yH H y yT H∗ y∗ − yH H y tr (R∗yy H∗ ) .

(5.158)

(5.159)

By repeatedly using the permutation property of the trace, tr (AB) = tr (BA), we obtain H E(xˆ xˆ ∗ ) = g E ygT y∗ + y yT H∗ y∗ + tr H E(ygT y∗ yH + y yT H∗ y∗ yH − tr (R∗yy H∗ )y yH ) 8 H = [g , H ], [R yy g + E(y yT H∗ y∗ ), 9 (5.160) E(y yH yH g) + E(y yH yT H∗ y∗ ) − tr (R∗yy H∗ )R yy ] 9 8 H (5.161) = [g , H ], K[g, H] . In the last equation, we have introduced the positive definite operator K, which is defined in terms of the second-, third-, and fourth-order moments of y. It acts on [g, H] as in (5.160). We can find an explicit expression for this. Let y i , i = 1, . . ., 2m, denote the ∗ , i = m + 1, . . ., 2m. Also ith element of y, i.e., y i = yi , i = 1, . . ., m, and y i = yi−m let (R yy )i j denote the (i, j)th element of R yy and (R yy )i the ith row of R yy . Then K[g, H] = [p, N]

(5.162)

148

Estimation

with pi = (R yy )i g +

2m 2m

E(y i y k y l∗ )Hkl∗ ,

i = 1, . . ., 2m,

k=1 l=1

Ni j =

2m

E(y i y ∗j y ∗k )gk

+

k=1

− =

2m 2m k=1 l=1

tr (R∗yy H∗ )(R yy )i j ,

2m k=1

E(y i y ∗j y k y l∗ )Hkl∗

E(y i y ∗j y ∗k )gk +

i, j = 1, . . ., 2m

2m 2m

[E(y i y ∗j y k y l∗ ) − E(y i y ∗j )E(y k y l∗ )]Hkl∗ .

(5.163)

k=1 l=1

Putting all the pieces together, (5.156) becomes 8 9 H [g , H ], K[g, H] − [E(x ∗ y), E(x ∗ y yH )] = 0.

(5.164)

Since this must hold for all g and H , this implies K[g, H] − [E(x ∗ y), E(x ∗ y yH )] = 0, [g, H] = K−1 [E(x ∗ y), E(x ∗ y yH )]. The MSE achieved by this WLQMMSE estimator is 6 7 E|xˆ − x|2 = E|x|2 − [E(x ∗ y), E(x ∗ y yH )], [g, H] 6 7 = E|x|2 − [E(x ∗ y), E(x ∗ y yH )], K−1 [E(x ∗ y), E(x ∗ y yH )] .

(5.165) (5.166)

(5.167)

Two remarks are in order. First, the widely linear part g of a WLQMMSE filter is not the WLMMSE filter determined in Result 5.3. Second, it is worth pointing out that the WLQMMSE filter depends on the complete statistical information up to fourth order. That is, all conjugation patterns of the second-, third-, and fourth-order moments are required. Example 5.4. Consider, for instance, the third-order moments. The formula (5.163) implicitly utilizes all eight conjugation patterns E(yi y j yk ), E(yi∗ y j yk ), E(yi y ∗j yk ), E(yi y j yk∗ ), E(yi∗ y ∗j yk ), E(yi∗ y j yk∗ ), E(yi y ∗j yk∗ ), and E(yi∗ y ∗j yk∗ ). Of course, only two of these eight conjugation patterns are distinct, due to permutation, e.g., E(yi∗ y j yk ) = E(y j yk yi∗ ), or complex conjugation, e.g., E(yi y ∗j yk ) = [E(yi∗ y j yk∗ )]∗ . We may consider E(y i y j y k ) with i, j, k = 1, . . ., 2m as a 2m × 2m × 2m array, more precisely called a tensor of order 3. This tensor is depicted in Fig. 5.7, and it has many symmetry relationships. Each m × m × m cube is the complex conjugate of the cube with which it does not share a side or edge. A common linear estimation scenario is the estimation of a signal in secondorder white (and proper) noise n whose augmented covariance matrix is Rnn = I. For WLQMMSE estimation, we need to know statistical properties of the noise up to fourth

Notes

yi y j y∗k

yi y j yk

yi y∗j yk

y∗i y j yk

y∗i y∗j yk

149

k

yi y∗j y∗k

j

i

y∗i y∗j y∗k

Figure 5.7 Tensor of third-order moments.

order. There exists a wide variety of higher-order noise models but the simplest model is fourth-order white (and circular) noise, which is defined as follows. Definition 5.1. A random vector n is called fourth-order white (and circular) if the only nonzero moments up to fourth order are E(n i n ∗j ) = δi j E(n i n j n ∗k n l∗ ) = δik δ jl + δil δ jk + K δi jkl ,

(5.168) (5.169)

where δi jkl = 1 if i = j = k = l and zero otherwise, and K is the kurtosis of the noise. This means that all moments must have the same number of conjugated and nonconjugated terms. This type of noise is “Gaussian-like” because it shares some of the properties of proper Gaussian random vectors (compare this with Result 2.10). However, Gaussian noise always has zero kurtosis, K = 0, whereas fourth-order white noise is allowed to have K = 0.

Notes 1 The introduction of channel, analysis, and synthesis models for the representation of composite covariance matrices of signal and measurement is fairly standard, but we have guided by Mullis and Scharf (1996). 2 The solution of LMMSE problems requires the solution of normal equations WR yy = Rx y . There has been a flurry of interest in “multistage” approximations to these equations, beginning with the work of Goldstein et al. (1998). The connection between multistage and conjugate gradient algorithms has been clarified by Dietl et al. (2001), Weippert et al. (2002), and Scharf et al. (2008). The attraction of these algorithms is that they converge rapidly (in N steps) if the matrix R yy has only a small number N of distinct eigenvalues. 3 WLMMSE estimation was introduced by Picinbono and Chevalier (1995), but widely linear (or linear–linear-conjugate) filtering had been considered already by Brown and Crane (1969). The performance comparison between LMMSE and WLMMSE estimation is due to Schreier et al. (2005).

150

Estimation

4 Our discussion of reduced-rank estimators follows Scharf (1991), Hua et al. (2001), and Schreier and Scharf (2003a). Canonical coordinate systems are optimum even for the more general problem of transform coding for noisy sources (under some restrictive assumptions), which has been shown by Schreier and Scharf (2006a). This, however, is not considered in this chapter. 5 There are many variations on the goniometer that aim to adapt to unknown noise covariance. They go by the name “adaptive beamforming” or “Capon beamforming,” after Jack Capon, who first advocated alternatives to conventional beamforming. Capon applied his methods to the processing of multi-sensor array measurements. 6 The WLMVDR receiver was introduced by McWhorter and Schreier (2003), where it was applied to widely linear beamforming, and then analyzed in more detail by Chevalier and Blin (2007). 7 Widely linear-quadratic systems for detection and array processing were introduced by Chevalier and Picinbono (1996). 8 The linear space consisting of vectors [g, H] is the direct sum of the Hilbert spaces C 2m and C 2m×2m , usually denoted as C 2m ⊕ C 2m×2m .

6

Performance bounds for parameter estimation

All parameter estimation begins with a measurement and an algorithm for extracting a parameter estimate from the measurement. The algorithm is the estimator. There are two ways to think about performance analysis. One way is to begin with a particular estimator and then to compute its performance. Typically this would amount to computing the bias of the estimator and its error covariance matrix. The practitioner then draws or analyzes concentration ellipsoids to decide whether or not the estimator meets specifications. But the other, more general, way is to establish a limit on the accuracy of any estimator of the parameter. We might call this a uniform limit, uniform over an entire class of estimators. Such a limit would speak to the information that the measurement carries about the underlying parameter, independently of how the information is extracted. Performance bounds are fundamental to signal processing because they tell us when the number and quality of spatial, temporal, or spatial–temporal measurements is sufficient to meet performance specifications. That is, these general bounds speak to the quality of the experiment or the sensing schema itself, rather than to the subsequent signal processing. If the sensing scheme carries insufficient information about the underlying parameter, then no amount of sophisticated signal processing can extract information that is not there. In other words, if the bound says that the error covariance is larger than specifications require, then the experiment or measurement scheme must be redesigned. But there is a cautionary note here: the performance bounds we will derive are lower bounds on variance, covariance, and mean-squared error. They are not necessarily tight. So, even if the bound is smaller than specifications require, there is still no guarantee that you will be able to find an estimator that achieves this bound. So the bound may meet variance specifications, but your estimator might not. Peter Schultheiss has called performance bounds spoiled-sport results because they establish unwelcome limits on what can be achieved and because they can be used to bring into question performance claims that violate these fundamental bounds. In this book, measurements are complex-valued and parameters are complex-valued. So the problem is to extract an estimate of a complex parameter from a complex measurement. As we have seen in previous chapters, we need to consider the correlation structure for errors and measurements and their complex conjugates, since this enables us to account for complementary correlation.

152

Performance bounds for parameter estimation

In this chapter, we shall consider quadratic performance bounds of the Weiss– Weinstein class, which were introduced by Weiss and Weinstein (1985). The idea behind this class of bounds is to establish a lower bound on the error covariance for a parameter estimator by considering an approximation of the estimator error from a linear function of a score function. This is very sophisticated reasoning: “If I use a linear function of a score function to approximate the estimator error, then maybe I can use the error covariance of this approximation to derive a bound on the error covariance for any estimator.” Of course, the bound will depend on the score function that is chosen for the approximation. The idea is then to choose good scores, and among these are the Fisher, Bhattacharyya, Ziv–Zakai, Barankin, and Bobrovsky–Zakai scores. In our extension of this reasoning, we will approximate the error using a widely linear function of the score function. We shall develop a frequentist theory of quadratic performance bounding and a Bayesian theory. Frequentist bounds are sometimes called deterministic bounds and Bayesian bounds are sometimes called stochastic bounds. For a frequentist the underlying assumption is that an experiment produces a measurement y that is drawn from a distribution P␪ (y). The parameter ␪ of this distribution is unknown, and to be estimated. Repeated trials of the experiment would consist of repeated draws of the measurements {yi } from the same distribution, with ␪ fixed. For a Bayesian the underlying assumption is that a random parameter and a random measurement (y, ␪) are drawn from a joint distribution P(y, ␪). Repeated trials of the experiment would consist of repeated draws of the measurement and parameter pairs ({yi }, ␪), with the parameter ␪ a single, fixed, random draw. In some cases the repeated draws are ({yi , ␪i }). For the frequentist and the Bayesian there is a density for the measurement, indexed by the parameter, but for the Bayesian there is also a prior distribution that may concentrate the probability of the parameter. For the frequentist there is none. 1 The organization of this chapter is as follows. Section 6.1 introduces the frequentist and Bayesian concepts and establishes notation. Section 6.2 discusses quadratic frequentist bounds in general, and Section 6.3 presents the most common frequentist bound, the Cramér–Rao bound. Bayesian bounds are investigated in Section 6.4, and the best-known Bayesian bound, the Fisher–Bayes bound (or “stochastic Cramér–Rao bound”), is derived in Section 6.5. When these bounds are specialized to the multivariate Gaussian model, then the formulae of this chapter reproduce or generalize those of a great number of authors who have computed quadratic bounds for spectrum analysis and beamforming. Finally, Section 6.6 establishes connections and orderings among the bounds.

6.1

Frequentists and Bayesians Frequentists model the complex measurement as a random vector y: −→ C n with probability distribution P␪ (y) and probability density function (pdf) p␪ (y). These depend on the parameter ␪, which is unknown but known to belong to the set Θ ⊆ C p . As explained in Section 2.3, the probability distribution and pdf of a complex random

6.1 Frequentists and Bayesians

153

vector are interpreted as the joint distribution and pdf of its real and imaginary parts. Typically, and throughout this chapter, the random vector y is continuous. If y contains discrete components, the pdf can be expressed using Dirac δ-functions. Bayesians regard both the measurement y: −→ C n and the parameter ␪: −→ C p as random vectors defined on the same sample space. They are given a joint probability distribution P(y, ␪), and pdf p(y, ␪). The joint pdf may be expressed as p(y, ␪) = p␪ (y) p(␪),

(6.1)

where p␪ (y) = p(y|␪) is the conditional measurement pdf and p(␪) is the prior pdf on the random parameter ␪. So, when ␪ is deterministic, p␪ (y) denotes a parameterized pdf. When it is random, p␪ (y) denotes a conditional pdf. We do not make any distinction in our notation, since context clarifies which meaning is to be taken. In our development of performance bounds on estimators of the parameter ␪ we will construct error scores and measurement scores. These are functions of the measurement or of the measurement and parameter. Moreover, later on we shall augment these with their complex conjugates so that we can construct bounds based on widely linear estimates of error scores from measurement scores. We work with the space of square-integrable measurable functions (i.e., their Hermitian second-order moments are finite) of the complex random vector y, or of the complex random vector (y, ␪), and assume all errors and scores to be elements of this space. Of course, a finite Hermitian second-order moment ensures a finite complementary second-order moment. Definition 6.1. Throughout this chapter we define f(y)dy Re[f(y)]d[Re y]d[Im y] + j Cn

IR2n

Im[f(y)]d[Re y]d[Im y].

(6.2)

IR2n

That is, an integral with respect to a complex variable is the integral with respect to its real and imaginary parts. The mean value of any function g(y), computed with respect to the parameterized pdf p␪ (y), is g(y) p␪ (y)dy. (6.3) E ␪ [g(y)] = Cn

The mean value of any function g(y, ␪), computed with respect to the joint pdf p(y, ␪), is E[g(y, ␪)] = g(y, ␪) p(y, ␪)dy d␪ =

C p ×C n

Cp

Cn

g(y, ␪) p␪ (y)dy p(␪)d␪

= E{E ␪ [g(y, ␪)]}.

(6.4)

In this notation, E ␪ [g(y, ␪)] is a conditional expectation with respect to the conditional pdf p␪ (y), and the outer E is the expectation over the distribution of ␪.

154

Performance bounds for parameter estimation

b

O

e

ε

Figure 6.1 Illustration of bias b, error ␧, and centered error e. The ellipsoid is the locus of centered errors for which eH Q−1 e is constant.

6.1.1

Bias, error covariance, and mean-squared error ˆ We wish to estimate the parameter ␪ with the estimator ␪(y). From the frequentist point of view, the problem is to determine the pdf p␪ (y), from the set of pdfs { p␪ (y), ␪ ∈ Θ ⊆ C p } that best describes the pdf from which the random measurement y was drawn. The error of the estimator is taken to be ˆ ␧␪ (y) = ␪(y) −␪

(6.5)

and the parameter-dependent frequentist bias of the estimator error is defined to be ˆ b(␪) = E ␪ [␧␪ (y)] = E ␪ [␪(y)] − ␪.

(6.6)

The error then has representation ␧␪ (y) = e␪ (y) + b(␪), where ˆ ˆ e␪ (y) = ␧␪ (y) − b(␪) = ␪(y) − E ␪ [␪(y)]

(6.7)

is called the centered error score. We shall define the function ␴␪ (y) = [σ1 (y), σ2 (y), . . ., σm (y)]T to be an mdimensional vector of complex scores. The mean of the score is E ␪ [␴␪ (y)] and s␪ (y) = ␴␪ (y) − E ␪ [␴␪ (y)]

(6.8)

is called the centered measurement score. Generally we think of the centered measurement score as a judiciously chosen function of the measurement that brings information about the centered error score e␪ (y). The frequentist defines the parameter-dependent mean-squared error matrix to be H H M(␪) = E ␪ [␧␪ (y)␧H ␪ (y)] = E ␪ [e␪ (y)e␪ (y)] + b(␪)b (␪)

= Q(␪) + b(␪)bH (␪),

(6.9)

where Q(␪) is the frequentist error covariance Q(␪) = E ␪ [e␪ (y)eH ␪ (y)].

(6.10)

Here we have exploited the fact that the cross-correlation between e␪ (y) and b(␪) is zero. If the frequentist bias of the estimator is zero, then the mean-squared error matrix is the error covariance matrix: M(␪) = Q(␪). The error ␧␪ (y), centered error score e␪ (y), bias b(␪), and error covariance Q(␪) are illustrated in Fig. 6.1.

6.1 Frequentists and Bayesians

155

From the Bayesian point of view, we wish to determine the joint pdf p(y, ␪) from the set of pdfs { p(y, ␪), ␪ ∈ Θ ⊆ C p } that best describes the joint pdf from which the random measurement–parameter pair (y, ␪) was drawn. The estimator error is ˆ ␧(y, ␪) = ␪(y) −␪

(6.11)

and the Bayes bias of the estimator error is ˆ b = E[␧(y, ␪)] = E{E ␪ [␧␪ (y)]} = E[b(␪)] = E[␪(y)] − E[␪].

(6.12)

The Bayes bias is the average of the frequentist bias because the Bayesian and the frequentist define their errors in the same way: ␧(y, ␪) = ␧␪ (y). The error then has the representation ␧(y, ␪) = e(y, ␪) + b, where the centered error score is defined to be ˆ ˆ e(y, ␪) = ␧(y, ␪) − b = ␪(y) − E[␪(y)] − (␪ − E[␪]).

(6.13)

The measurement score is defined to be the m-dimensional vector of complex scores ␴(y, ␪) = [σ1 (y, ␪), σ2 (y, ␪), . . ., σm (y, ␪)]T . The mean value of the measurement score is E[␴(y, ␪)] = E{E ␪ [␴(y, ␪)]} and the centered measurement score is s(y, ␪) = ␴(y, ␪) − E[␴(y, ␪)].

(6.14)

The mean E[␴(y, ␪)] is generally not E{E ␪ [␴␪ (y)]} because the Bayesian will define a score ␴(y, ␪) that differs from the frequentist score ␴␪ (y), in order to account for prior information about the parameter ␪. Again we think of scores as judiciously chosen functions of the measurement that bring information about the true parameter. ˆ The Bayesian defines the mean-squared error matrix for the estimator ␪(y) to be H H M = E[␧(y, ␪)␧ (y, ␪)] = E{E ␪ [␧␪ (y)␧␪ (y)]} = E[M(␪)], which is the average over the prior distribution on ␪ of the frequentist mean-squared error matrix M(␪). This average may be written M = E[␧(y, ␪)␧H (y, ␪)] = E[e(y, ␪)eH (y, ␪)] + bbH = Q + bbH ,

(6.15)

where Q is the Bayes error covariance Q = E[e(y, ␪)eH (y, ␪)].

(6.16)

Note that the Bayes bias and error covariance matrix are independent of the parameter ␪. If the Bayes bias of the estimator is zero, then the mean-squared error matrix is the error covariance matrix: M = Q.

6.1.2

Connection between frequentist and Bayesian approaches We have established that the Bayes bias b = E[b(␪)] is the average of the frequentist bias and the Bayes mean-squared error matrix M = E[M(␪)] is the average of the frequentist mean-squared error matrix. But what about the Bayes error covariance Q? Is it the average over the distribution of the parameter ␪ of the frequentist error covariance Q(␪)? The answer (surprisingly) is no, as the following argument shows. The Bayesian’s centered error score may be written in terms of the frequentist’s centered error score as e(y, ␪) = e␪ (y) + (b(␪) − b). In this representation, e␪ (y) and

156

Performance bounds for parameter estimation

b(␪) − b are conditionally and unconditionally uncorrelated, so as a consequence the Bayesian’s error covariance may be written in terms of the frequentist’s parameterdependent error covariance as Q = E[Q(␪)] + Rbb ,

(6.17)

where Rbb = E[(b(␪) − b)(b(␪) − b)H ] is the covariance of the parameter-dependent bias b(␪). If the frequentist bias of the estimator is constant at b(␪) = b, then Rbb = 0 and the Bayes error covariance is the average over the distribution of the parameter ␪ of the frequentist error covariance. In other words, we have the following result. Result 6.1. The frequentist error covariance can be averaged for the Bayes error covariance to produce the identity Q = E[Q(␪)]

(6.18)

if and only if the frequentist bias of the estimator is constant. This generally is a property of linear minimum-variance distortionless response (LMVDR) estimators, also called best linear unbiased estimators (BLUEs) – for which b(␪) = 0 – but not of MMSE estimators. The following example is illuminating. Example 6.1. Consider the linear signal-plus-noise model y = H␪ + n,

(6.19)

where H ∈ C n× p , ␪ ∈ C p is zero-mean proper Gaussian with covariance matrix Rθθ , n is zero-mean proper Gaussian with covariance matrix Rnn , and ␪ and n are independent. Following Section 5.6, the LMVDR estimator (also called BLUE), which in this case is also the maximum-likelihood (ML) estimator, is −1 −1 −1 −1 ␪ˆ LMVDR (y) = (HH Rnn H)−1 HH Rnn y = ␪ + (HH Rnn H)−1 HH Rnn n,

(6.20)

which ignores the prior distribution on ␪. Consequently the frequentist bias is b(␪) = 0 and the Bayes bias is b = 0. The centered frequentist error is e␪ (y) = −1 −1 −1 H)−1 HH Rnn n. The frequentist error covariance is then Q(␪) = (HH Rnn H)−1 (HH Rnn and this is also the Bayes error covariance Q: −1 QLMVDR = (HH Rnn H)−1 = E[QLMVDR (␪)] = QLMVDR (␪).

(6.21)

The MMSE estimator accounts for the distribution of ␪: −1 −1 −1 H −1 ␪ˆ MMSE (y) = (HH Rnn H + Rθθ ) H Rnn y −1 −1 −1 H −1 −1 −1 −1 H −1 H + Rθθ ) H Rnn H␪ + (HH Rnn H + Rθθ ) H Rnn n. (6.22) = (HH Rnn −1 −1 H −1 −1 The frequentist bias of this estimator is b(␪) = (HH Rnn H + Rθθ ) H Rnn H␪ − ␪ and the Bayes bias is b = 0. The frequentist error covariance of this estimator is Q(␪) = −1 −1 H −1 −1 −1 −1 H + Rθθ ) H Rnn H(HH Rnn H + Rθθ ) . The Bayes error covariance is the (HH Rnn sum of the expectation of the frequentist error covariance Q(␪) and the covariance

6.2 Quadratic frequentist bounds

157

of the parameter-dependent bias Rbb = E[b(␪)bH (␪)]. But, since the frequentist error covariance is independent of ␪, we may write −1 −1 −1 H + Rθθ ) QMMSE = QMMSE (␪) + Rbb = (HH Rnn −1 ≤ (HH Rnn H)−1 = QLMVDR (␪) = QLMVDR .

(6.23)

There are lessons here: the MMSE estimator has smaller Bayes error covariance than the LMVDR estimator; it also has smaller frequentist error covariance. Its only vice is that it is not frequentist unbiased. So, from the point of view of Bayes error covariance, frequentist unbiasedness is no virtue. In fact, it is uncommon for the frequentist bias of an estimator of a random parameter to be zero, but common for the Bayes bias to be zero.

6.1.3

Extension to augmented errors These results extend to the characterization of augmented error scores eT = [eT , eH ]. The augmented Bayes bias remains the average of the augmented frequentist bias, which is to say that b = E[b(␪)], the augmented Bayes mean-squared error matrix remains the average of the augmented frequentist mean-squared error matrix, M = E[M(␪)], and the Bayes and frequentist augmented error covariances remain related as Q = E[Q(␪)] + Rbb ,

(6.24)

where Q(␪) = E ␪ [e(y)eH (y)] is the augmented frequentist error covariance matrix and Rbb = E[(b(␪) − b)(b(␪) − b)H ] is the augmented covariance matrix for the frequentist bias.

6.2

Quadratic frequentist bounds Let us begin our discussion of quadratic frequentist covariance bounds by considering the Hermitian version of the story, where complementary covariances are ignored in the bounding procedure. Then we will extend these bounds by accounting for complementary covariances and find that this tightens the bounds. ˆ Here is the idea: from the measurement y is computed the estimator ␪(y), with ˆ frequentist error ␧␪ (y) = ␪(y) − ␪ and centered error score e␪ (y) = ␧␪ (y) − b(␪). We would like to approximate this centered error score by using a linear function of the centered measurement score s␪ (y).

6.2.1

The virtual two-channel experiment and the quadratic frequentist bound Consider the virtual two-channel estimation problem of Fig. 6.2, which is reminiscent of the virtual two-channel problems developed in Section 5.3. In Fig. 6.2, the centered error score e␪ (y) is to be estimated from the centered measurement score s␪ (y). The first

158

Performance bounds for parameter estimation

eθ (y)

e θ (y) − eˆ θ (y) T H (θθ)J−1 (θ)

sθ (y)

sθ (y)

Figure 6.2 Two-channel representation of centered error e␪ (y) and centered score s␪ (y).

step in the derivation of quadratic performance bounds is to define the composite vector of the error score and the measurement score, [eT␪ (y), sT␪ (y)]T . The covariance matrix for this composite vector is then e␪ (y) H Q(␪) TH (␪) H . (6.25) Res (␪) = E ␪ e␪ (y) s␪ (y) = s␪ (y) T(␪) J(␪) In this equation the covariance matrices Q(␪), T(␪), and J(␪) are defined as Q(␪) = E ␪ [e␪ (y)eH ␪ (y)], T(␪) = E ␪ [s␪ (y)eH ␪ (y)], J(␪) =

(6.26)

E ␪ [s␪ (y)sH ␪ (y)].

The covariance matrix Q(␪) is the frequentist covariance for the centered error score, T(␪) is the cross-correlation between the centered error score and the centered measurement score, which is often called the sensitivity or expansion-coefficient matrix, and J(␪) is the covariance matrix for the centered measurement score, which is always called the information matrix. We shall assume it to be nonsingular, since there is no good reason to carry around linearly dependent scores. In the case of a Fisher score, to be discussed in Section 6.3, the information is Fisher information. The LMMSE estimator of the centered error score from the centered measurement score is eˆ ␪ (y) = TH (␪)J−1 (␪)s␪ (y). The nonnegative error covariance matrix for this estimator is the Schur complement Q(␪) − TH (␪)J−1 (␪)T(␪). From this follows the frequentist bound: Result 6.2. The error covariance matrix is bounded by the general quadratic frequentist bound Q(␪) ≥ TH (␪)J−1 (␪)T(␪),

(6.27)

and the mean-squared error is bounded as M(␪) ≥ TH (␪)J−1 (␪)T(␪) + b(␪)bH (␪).

(6.28)

This argument could have been simplified by saying that the Hermitian covariance matrix Res (␪) for the composite error score and measurement score is positive semidefinite, so the Schur complement must be positive semidefinite. But this argument would not have revealed the essence of quadratic performance bounding, which is to linearly estimate centered error scores from centered measurement scores.

6.2 Quadratic frequentist bounds

159

As noted by Weinstein and Weiss (1988), the Battacharrya, Ziv–Zakai, Barankin, and Bobrovsky–Zakai bounds all have this quadratic structure. Only the choice of score distinguishes one bound from another. Moreover, each quadratic bound is invariant with respect to nonsingular linear transformation of the score function s. That is, the transformed score Ls transforms the sensitivity matrix T to LT and the information matrix J to LJLH , leaving the bound invariant. Definition 6.2. An estimator is said to be efficient with respect to a given score if the bound in (6.27) is achieved with equality: Q(␪) = TH (␪)J−1 (␪)T(␪). That is, the estimated error score equals the actual error score: e␪ (y) = TH (␪)J−1 (␪)s␪ (y).

(6.29)

This is also invariant with respect to linear transformation L. That is, an estimator is efficient if, element-by-element, its centered error score lies in the subspace spanned by the elements of the centered measurement score. The corresponding representation for ˆ the estimator itself is ␪(y) − ␪ = TH (␪)J−1 (␪)s␪ (y). One score is better than another if its quadratic bound is larger than the other. So the search goes on for a better bound, or in fact a better score.

Good and bad scores We shall not give the details here, but McWhorter and Scharf (1993c) showed that a given measurement score can be improved, in the sense that it improves the quadratic frequentist bound, by “Rao–Blackwellizing” it. To “Rao–Blackwellize” is to compute the expected value of the centered measurement score, conditioned on the frequentist sufficient statistic for the parameter. In other words, to be admissible a score must be a function of the sufficient statistic for the parameter to be estimated. Moreover, additional score functions never hurt, which is to say that the quadratic performance bound is never loosened with the addition of additional scores.

6.2.2

Projection-operator and integral-operator representations of quadratic frequentist bounds Let us consider the variance bound for one component of the parameter vector ␪. Call ˆ it θ . The corresponding component of the estimator ␪(y) is called θˆ (y), and the corˆ Without responding component of the centered error score is eθ (y) = θˆ (y) − E ␪ [θ(y)]. loss of generality, assume that scores have been transformed into a coordinate system where the information matrix is J(θ ) = I. The composite covariance matrix for [eθ (y), sTθ (y)]T is eθ (y) ∗ Q(θ ) tH (θ ) , (6.30) = Res (θ ) = E ␪ eθ (y) sH (y) θ sθ (y) t(θ ) I where t(θ ) = E ␪ [sθ (y)eθ∗ (y)] is the m-dimensional column vector of expansion coefficients in this transformed coordinate system. Then the invariant quadratic frequentist

160

Performance bounds for parameter estimation

eq (y)

s1 s2

sm

Figure 6.3 Projection of the error eθ (y) onto the Hilbert space of centered error scores

s1 , s2 , . . ., sm .

bound may be written Q(θ ) ≥ tH (θ )t(θ ) =

m

|ti (θ )|2 ,

(6.31)

i=1

where ti (θ ) is the ith coordinate of the m-dimensional expansion-coefficient vector t(θ ). The score random variables are uncorrelated with variance unity, so we may write the previous equation as   m m ti si∗ t ∗j s j  = E ␪ |s eθ (y)|2 , (6.32) Q(θ ) ≥ E ␪  i=1 j=1

where the operator s eθ (y) acts like this: s eθ (y) =

m

ti∗ si =

i=1

m

E ␪ [si∗ (y)eθ (y)]si (y).

(6.33)

i=1

This projection operator projects the centered error score eθ (y) onto each element of the centered measurement score (or onto the Hilbert space of centered measurement scores), as illustrated in Fig. 6.3. Thus, for the bound (6.31) to be tight, the actual error score eθ (y) must lie close to the subspace spanned by the m centered measurement scores. This interpretation makes it easy to see that the elements of the vector t(θ ) are just the coefficients of the true error eθ (y) in the basis sθ (y). That is, ti (θ ) = E ␪ [si (y)eθ∗ (y)]. There is also a kernel representation of this bound due to McWhorter and Scharf (1993c). Write the projection as m m ti∗ si = eθ (y ) si (y)si∗ (y ) pθ (y )dy s eθ (y) = i=1

=

i=1

eθ (y )K θ (y, y ) pθ (y )dy .

(6.34)

So, for a tight bound, s eθ (y) must be nearly the actual error eθ (y), which is to say that the actual error eθ (y) must nearly be an element of the reproducing kernel Hilbert space defined by the kernel K θ (y, y ) =

m i=1

si (y)si∗ (y ).

(6.35)

6.2 Quadratic frequentist bounds

6.2.3

161

Extension of the quadratic frequentist bound to improper errors and scores The essential step in the extension of quadratic frequentist bounds to improper error and measurement scores is to define the composite vector of augmented errors and scores, [eT , sT ]T . The covariance matrix for this composite vector is e␪ (y) H Q(␪) TH (␪) H . (6.36) E␪ e␪ (y) s␪ (y) = s␪ (y) T(␪) J(␪) In this equation the constituent matrices are defined as Q(␪) H Q(␪) = E ␪ [e␪ (y)e␪ (y)] = ∗ Q (␪) T(␪) T(␪) = E ␪ [s␪ (y)eH ␪ (y)] = ∗ T (␪) J(␪) H J(␪) = E ␪ [s␪ (y)s␪ (y)] = ˜ ∗ J (␪)

Q(␪) , Q∗ (␪) T(␪) , T∗ (␪) ˜ J(␪) . J∗ (␪)

(6.37)

The covariance matrix T(␪) = E ␪ [s␪ (y)eT␪ (y)] is the complementary sensitivity or ˜ expansion-coefficient matrix, and J(␪) = E ␪ [s␪ (y)sT␪ (y)] is the complementary covariance matrix for the centered measurement score, which is called the complementary information matrix. Furthermore, Q(␪) is the augmented error covariance matrix, T(␪) is the augmented sensitivity or expansion-coefficient matrix, and J(␪) is the augmented information matrix. Obviously, we are interested in bounding Q(␪) rather than Q(␪), but it is convenient to derive the bounds on Q(␪) first, from which one can immediately obtain the bound on Q(␪). The covariance matrix for the composite vector of augmented error score and augmented measurement score is positive semidefinite, so, assuming the information matrix to be nonsingular, we obtain the bound Q(␪) ≥ TH (␪)J−1 (␪)T(␪). From this, we can read out the northwest block of Q(␪) to obtain the following result. Result 6.3. The frequentist error covariance is bounded as # $ T(␪) −1 H T . Q(␪) ≥ T (␪) T (␪) J (␪) ∗ T (␪)

(6.38)

The quadratic form on the right-hand side is the general widely linear-quadratic frequentist bound on the error covariance matrix for improper error and measurement scores. Of course the underlying idea is that Q(␪) − TH (␪)J−1 (␪)T(␪) is the augmented error covariance matrix of the widely linear estimator of the error score from the measurement score: # $ T (␪) J−1 (␪)s␪ (y). (6.39) eˆ ␪ (y) = TH (␪) T The bound in Result 6.3 is tighter than the bound in Result 6.2, but for proper mea˜ surement scores (i.e., J(␪) = 0) and cross-proper measurement and error scores (i.e., T(␪) = 0) the bounds are identical.

162

Performance bounds for parameter estimation

A widely linear estimator is efficient if equality holds in (6.38). This is the case if and only if the estimator of the centered error score equals the actual error, meaning that the estimator has the representation # $ ˆ T (␪) J−1 (␪)s␪ (y). (6.40) ␪(y) − ␪ = TH (␪) T

6.3

´ Fisher score and the Cramer–Rao bound Our results for quadratic frequentist bounds so far are general. To make them applicable we need to consider a concrete score for which T(␪) and J(␪) can be computed. For this we choose the Fisher score and compute its associated Cramér–Rao bound. Definition 6.3. The Fisher score is defined as T H ∂ ∂ ∂ log p␪ (y), . . ., ∗ log p␪ (y) , log p␪ (y) = ␴␪ (y) = ∂␪ ∂θ1∗ ∂θ p

(6.41)

where the partial derivatives are Wirtinger derivatives as discussed in Appendix 2. The ith component of the Fisher score is σi (y) =

∂ ∂ log p␪ (y) + j log p␪ (y). ∂(Re θi ) ∂(Im θi )

(6.42)

Thus, the Fisher score is a p-dimensional complex-valued column vector. The notation (∂/∂␪) p␪ (y) means (∂/∂␾) p␾(y), evaluated at ␾ = ␪. We shall demonstrate shortly that the expected value of the Fisher score is E ␪ [␴␪ (y)] = 0, so that Definition 6.3 also defines the centered measurement score s␪ (y). The Fisher score has a number of properties that make it a compelling statistic for inferring the value of a parameter and for bounding the error covariance of any estimator for that parameter. We list these properties and annotate them here. 1. We may write the partial derivative as ∂ 1 ∂ log p␪ (y) = p␪ (y), ∂θi∗ p␪ (y) ∂θi∗

(6.43)

which is a normalized measure of the sensitivity of the pdf p␪ (y) to variations in the parameter θi . Large sensitivity is valued, and this will be measured by the variance of the score. 2. The Fisher score is a zero-mean random variable. This statistical property is consistent with maximum-likelihood procedures that search for maxima of the log-likelihood function log p␪ (y) by searching for the zeros of (∂/∂␪)log p␪ (y). Functions with steep slopes are valued, so we value a zero-mean score with large variance. 3. The cross-correlation between the centered Fisher score and the centered error score is the expansion-coefficient matrix H ∂ T(␪) = I + (6.44) b(␪) , ∂␪

´ 6.3 Fisher score and the Cramer–Rao bound

163

where as before b(␪) is the frequentist bias of the estimator. The (i, j)th element of the matrix (∂/∂␪)b(␪) is (∂/∂θ j )bi (␪). When the estimator is unbiased, then T = I, which is to say that the measurement score is perfectly aligned with the error score. The effect of bias is to misalign them. 4. The Fisher information matrix is also the expected Hessian of the score function (up to a minus sign): H ∂ ∂ log p␪ (y) log p␪ (y) JF (␪) = E ␪ ∂␪ ∂␪ H ∂ ∂ = −E ␪ . (6.45) log p␪ (y) ∂␪ ∂␪ The first of these four properties is interpretive only. The next three are interpretive and analytical. Let’s prove them. 2. The expected value of the Fisher score is H H ∂ ∂ log p␪ (y) p␪ (y)dy = p␪ (y) dy = 0. E ␪ [␴␪ (y)] = ∂␪ ∂␪

(6.46)

ˆ 3. The frequentist bias b(␪) of the parameter estimator ␪(y) may be differentiated with respect to ␪ as follows: ∂ ∂ ˆ [␪(y) − ␪] p␪ (y)dy b(␪) = ∂␪ ∂␪ ∂ ˆ = [␪(y) − ␪] log p␪ (y) p␪ (y)dy − I ∂␪ = E ␪ [␧␪ (y)␴␪H (y)] − I = TH (␪) − I.

(6.47)

In the last line, we have used E ␪ [␴␪ (y)] = 0. 4. Consider the p × p Hessian matrix H H 1 ∂ ∂ ∂ ∂ log p␪ (y) = p␪ (y) ∂␪ ∂␪ ∂␪ ∂␪ p␪ (y) H H 1 ∂ ∂ ∂ ∂ p␪ (y) − log p␪ (y) log p␪ (y). = ∂␪ ∂␪ p␪ (y) ∂␪ ∂␪ (6.48) The expectation of this matrix is H ∂ ∂ E␪ = 0 − E ␪ [s␪ (y)sH log p␪ (y) ␪ (y)] = −JF (␪). ∂␪ ∂␪

(6.49)

Result 6.4. The quadratic frequentist bound for the Fisher score is the Cramér–Rao bound: H ∂ ∂ −1 Q(␪) ≥ I + (6.50) b(␪) JF (␪) I + b(␪) . ∂␪ ∂␪

164

Performance bounds for parameter estimation

When the frequentist bias is zero, then the Cramér–Rao bound is Q(␪) ≥ J1F (␪),

(6.51)

where JF (␪) is the Fisher information. This bound is sometimes also called the deterministic Cramér–Rao bound, in order to differentiate it from the corresponding Bayesian (i.e., stochastic) bound in Result 6.7. It is the most celebrated of all quadratic frequentist bounds. If repeated measurements carry information about one fixed and deterministic param'M p␪ (yi ), then the sensitivity matrix remains fixed eter ␪ through the product pdf i=1 and the Fisher information matrix scales with M. Consequently the Cramér–Rao bound decreases as M −1 .

6.3.1

Nuisance parameters There are many ways to show the effect of nuisance parameters on error bounds, and some of these are given by Scharf (1991, pp. 231–233). But perhaps the easiest, if not most general, way to establish the effect is this. Begin with the Fisher information matrix J(␪) and the corresponding Cramér–Rao bound for frequentist unbiased estimators, Q(␪) ≥ J−1 (␪). (In order to simplify the notation we do not subscript J as JF .) The (i, i)th element of J(␪) is Jii (␪) and the (i, i)th element of J−1 (␪) is denoted by (J−1 )ii (␪). From the definition of a score we see that Jii (␪) is the Fisher information for the ith element of ␪, when only the ith parameter in the parameter vector ␪ is unknown. The Cauchy–Schwarz inequality says that (yH Jy)(xH Jx) ≥ |yH Jx|2 . Choose x = uk and y = J−1 uk , with uk the kth Euclidean basis vector. Then (J−1 )ii (␪)Jii (␪) ≥ 1, or (J−1 )ii (␪) ≥ 1/Jii (␪). These results actually generalize to show that any r -by-r dimensional submatrix of the p-by- p inverse J−1 (␪) is more positive definite than the inverse of the corresponding r -by-r Fisher matrix J(␪). So nuisance parameters increase the Cramér–Rao bound.

6.3.2

´ The Cramer–Rao bound in the proper multivariate Gaussian model In Result 2.5, the pdf for a proper complex Gaussian random variable y: −→ C n was shown to be 1 exp{−(y − ␮ y )H R−1 (6.52) p␪ (y) = n yy (y − ␮ y )}, π det R yy where ␮ y is the mean and R yy the Hermitian covariance matrix of y. For our purposes we assume that the mean value and the Hermitian covariance matrix are both functions of the unknown parameter vector ␪, even though we have not made this dependence explicit in the notation. We shall assume that we have drawn M independent copies of the random vector y from the pdf p␪ (y), so that the logarithm of the joint pdf of

´ 6.3 Fisher score and the Cramer–Rao bound

Y = [y1 , y2 , . . ., y M ] is (M ) " log p␪ (yi ) = −Mn log π − M log det R yy − M tr (R−1 yy S yy ),

165

(6.53)

i=1

:M (yi − ␮ y )(yi − ␮ y )H is the sample Hermitian covariance matrix. where S yy = M −1 i=1 Using the results of Appendix 2 for Wirtinger derivatives, in particular the results for differentiating logarithms and traces in Section A2.2, we may express the jth element of the centered measurement score s␪ (Y) as ∂ ∂ log det R yy − M tr(R−1 yy S yy ) ∂θ ∗j ∂θ ∗j ( ) ∂ −1 ∂ −1 −1 −1 ∂ = −M tr R yy ∗ R yy + M tr R yy R yy R yy S yy − M tr R yy ∗ S yy . ∂θ j ∂θ ∗j ∂θ j

s j (Y) = −M

(6.54) It is a simple matter to show that ∂S yy /∂θ ∗j has mean-value zero. So to compute the Hessian term −E ␪ [(∂/∂θi )s j (Y)] we can ignore any terms that involve a first partial derivative of S yy . The net result, after a few lines of algebra, is ∂ s j (Y) JF,i j = −E ␪ ∂θi ) ( ) ( ∂ ∂ ∂ ∂ −1 −1 −1 + M tr R yy E ␪ R yy R yy R yy S yy = M tr R yy ∂θi ∂θ ∗j ∂θi ∂θ ∗j ) ( H ∂ ∂ ∂ ∂ −1 −1 −1 = M tr R yy +M R yy R yy R yy ␮ y R yy ␮y ∂θi ∂θ ∗j ∂θi ∂θ ∗j H ∂ −1 ∂ = JF,∗ ji . ␮ R ␮ (6.55) +M y y yy ∂θ ∗j ∂θi This is the general formula for the (i, j)th element of the Fisher information matrix in the proper multivariate Gaussian experiment that brings information about ␪ in its mean and covariance. The real version of this result dates at least to Slepian (1954) and the complex version to Bangs (1971). There are a few special cases. If the covariance matrix R yy is independent of ␪ then the first term vanishes. If the mean ␮ y is independent of ␪ ∗ then the second term vanishes, and if it is independent of ␪, the third term vanishes.

6.3.3

The separable linear statistical model and the geometry of the ´ Cramer–Rao bound To gain geometrical insight into the Fisher matrix and the Cramér–Rao bound, we now apply the results of the previous subsection to parameter estimation in the linear model y = H␪ + n, where n is a zero-mean proper Gaussian with covariance Rnn independent

166

Performance bounds for parameter estimation

gi

ri

Gi

PGi gi

Figure 6.4 Geometry of the Cramér–Rao bound in the separable statistical model with multivariate Gaussian errors; the variance is large when mode gi lies near the subspace Gi of other modes.

of ␪. We shall find that it is the sensitivity of noise-free measurements to small variations in parameters that determines the performance of an estimator. The partial derivatives are ∂␮ y /∂θi = hi , where hi is the ith column of H and the −1 H. The Cramér–Rao bound for frequentist-unbiased Fisher matrix is JF = MHH Rnn estimators is thus Q(␪) ≥

1 H −1 −1 (H Rnn H) . M

(6.56)

−1/2 H, the (i, i)th element may be written as With the definition G = Rnn

Q ii (␪) ≥

giH gi 1 1 1 1 1 = , M giH gi giH (I − PGi )gi M giH gi sin2 ρi

(6.57)

where PGi is the projection onto the subspace Gi spanned by all but the ith mode in G, ρi is the angle that the mode vector gi makes with this subspace, and giH (I − PGi )gi /(giH gi ) is the sine-squared of this angle. Thus, as illustrated in Fig. 6.4, the lower bound on the variance in estimating θi is a large multiple of (MgiH gi )−1 when the ith mode can be linearly approximated with the other modes in Gi . For closely spaced modes, only a large number of independent samples or a large value of giH gi – producing a large output signal-to-noise ratio MgiH gi – can produce a small lower bound. With low output signal-to-noise ratio and closely spaced modes, any estimator of θi will be poor, meaning that the resolution in amplitude of the ith mode will be poor. This result generalizes to mean-value vectors more general than H␪, on replacing hi with ∂␮/∂θi . Example 6.2. Let the noise covariance matrix in the proper multivariate Gaussian model be Rnn = σ 2 I and the matrix H = [A␺ 1 , A␺ 2 ] with ␺ k = [1, ejφk , . . ., ej(n−1)φk ]T . We call ␺ k a complex exponential mode with mode angle φk and A a complex mode amplitude. The Cramér–Rao bound is Q ii (␪) ≥

1 1 1 , 2 2 2 M n| A| /σ 1 − ln (φ1 − φ2 )

(6.58)

where 0 ≤ ln2 (φ) =

1 sin2 (nφ/2) ≤1 n 2 sin2 (φ/2)

(6.59)

´ 6.3 Fisher score and the Cramer–Rao bound

167

is the Lanczos kernel. In this bound, snr = | A|2 /σ 2 is the per-sample or input signalto-noise ratio, SNR = nsnr is the output signal-to-noise ratio and 1 − ln2 (φ1 − φ2 ) is the sine-squared of the angle between the subspaces ␺ 1 and ␺ 2 . This bound is (MSNR)−1 at φ1 − φ2 = 2π/n, and this angle difference is called the Rayleigh limit to resolution. It governs much of optics, radar, sonar, and geophysics, even though it is quite conservative in the sense that many independent samples or large input SNR can override the aperture effects of the Lanczos kernel. 2 This example bounds the error covariance matrix for estimating the linear parameters in the separable linear model, not the nonlinear parameters that would determine the matrix H(␪). Typically these parameters would be frequency, wavenumber, delay, and so on. The Cramér–Rao bound for these parameters then depends on terms like ∂hi /∂θ ∗j . 3

6.3.4

´ Extension of Fisher score and the Cramer–Rao bound to improper errors and scores In order to extend the Cramér–Rao bound to improper errors and scores, we need only compute the complementary expansion-coefficient matrix T(␪) = E[s␪ (y)eT␪ (y)] and T ˜ complementary Fisher information matrix J(␪) = E[s␪ (y)s␪ (y)]. To this end, consider the following conjugate partial derivative of the bias: ∂ ∂ ˆ [␪(y) − ␪] p␪ (y)dy b(␪) = ∂␪ ∗ ∂␪ ∗ ∂ ˆ T (␪). log p␪ (y) p␪ (y)dy − 0 = T (6.60) = [␪(y) − ␪] ∂␪ ∗ This produces the identity T(␪) =

H

∂ ∗ b (␪) ∂␪

T ∂ = b(␪) . ∂␪ ∗

(6.61)

The (i, j)th element of the matrix (∂/∂␪ ∗ )b(␪) is (∂/∂θ ∗j )bi (␪). Importantly, for frequentist unbiased estimators, the complementary expansion-coefficient matrix is zero, and the augmented expansion-coefficient matrix is T(␪) = I. For the complementary Fisher information, consider the p × p Hessian T T 1 ∂ ∂ ∂ ∂ log p␪ (y) = p␪ (y) ∂␪ ∂␪ ∂␪ ∂␪ p␪ (y) T T 1 ∂ ∂ ∂ ∂ p␪ (y) − log p␪ (y) log p␪ (y). (6.62) = ∂␪ ∂␪ p␪ (y) ∂␪ ∂␪ Taking expectations, we find the identity J˜ F (␪) = E ␪ [s␪ (y)sT␪ (y)] = −E ␪

T ∂ ∂ , log p␪ (y) ∂␪ ∂␪

(6.63)

168

Performance bounds for parameter estimation

which is the complementary dual to (6.45). The augmented Cramér–Rao bound for improper error and measurement scores is the bound of Result 6.3, applied to the Fisher score. Result 6.5. For frequentist-unbiased estimators, the widely linear Cramér–Rao bound is −1 ˜∗ Q(␪) ≥ [JF (␪) − J˜ F (␪)J−∗ F (␪)JF (␪)] .

(6.64)

−1 ˜∗ Since [JF (␪) − J˜ F (␪)J−∗ ≥ J−1 F (␪)JF (␪)] F (␪), the widely linear bound is tighter than the standard Cramér–Rao bound in Result 6.4, which ignores complementary Fisher information. Result 6.5 was essentially given by van den Bos (1994b). Note that Results 6.3 and 6.5 assume that JF (␪) is nonsingular. An important case in which JF (␪) is always singular is when the parameter ␪ is real. For real ␪, we have JF (␪) = J˜ F (␪), so there is no need to compute the complementary Fisher matrix. In this case, widely linear estimation of the error from the measurement score is equivalent to linear estimation, and the widely linear Cramér–Rao bound is the standard Cramér–Rao bound: Q(␪) ≥ J−1 F (␪). However, this bound still depends on whether or not y is proper because even the Hermitian Fisher matrix JF (␪) depends on the complementary correlation of y, as the following subsection illustrates.

6.3.5

´ The Cramer–Rao bound in the improper multivariate Gaussian model It is straightforward to extend the results of Section 6.3.2 to improper Gaussians. Let y: −→ C n be Gaussian with augmented mean ␮ y and augmented covariance matrix R yy . Assuming as before that we have drawn M independent copies of y, the (i, j)th element of the Hermitian Fisher matrix is ) ( H ∂ ∂ ∂ ∂ M JF,i j = +M R R−1 R ␮ R−1 ␮ = JF,∗ ji , tr R−1 yy yy 2 ∂θi yy yy ∂θ ∗j yy ∂θi y ∂θ ∗j y (6.65) where the second summand is real. The (i, j)th element of the complementary Fisher matrix is H ∂ ∂ ∂ M −1 −1 ∂ R + M = J˜F, ji , R R ␮ R ␮ tr R−1 J˜F,i j = yy yy 2 ∂θi yy yy ∂θ j yy ∂θi y ∂θ j y (6.66) and again the second summand is real. The case of real ␪, JF (␪) = J˜ F (␪), has been treated by Delmas and Abeida (2004). They have shown that, when ␪ is real, the Cramér–Rao bound Q(␪) ≥ J−1 F (␪) is actually lowered if y is improper rather than proper. That is, for a given Hermitian covariance matrix R yy , the Hermitian Fisher matrix JF (␪) is more yy = 0. At first, this may seem like a contradiction yy = 0 than if R positive definite if R of the statement made after Result 6.5 that improper errors and scores tighten (i.e., increase) the Cramér–Rao bound. So what is going on?

169

´ 6.3 Fisher score and the Cramer–Rao bound

There are two different effects at work. On the one hand, the actual mean-squared error in widely linearly estimating ␪ (real or complex) from y is reduced when the degree of impropriety of y is increased while the Hermitian covariance matrix R yy is kept constant. This is reflected by the lowered Cramér–Rao bound. On the other hand, if ␪ is complex, a widely linear estimate of the error score from the measurement score is a more accurate predictor of the actual mean-squared error for estimating ␪ from y than is a linear estimate of the error score from the measurement score. This is reflected by a tightening Cramér–Rao bound. It depends on the problem which one of these two effects is stronger. Finally, we note that (6.65) and (6.66) do not cover the maximally improper (also called strictly noncircular or rectilinear) case in which R yy is singular. This was addressed by Römer and Haardt (2007).

6.3.6

´ Fisher score and Cramer–Rao bounds for functions of parameters Sometimes it is a function of the parameter ␪ that is to be estimated rather than the parameter itself. What can we say about the Fisher matrix in this case? Let w = g(␪) be a continuous bijection from C p to C p , with inverse ␪ = g−1 (w). In the linear case, w = H␪ and ␪ = H−1 w, and in the widely linear case, w = H ␪ and ␪ = H−1 w. Define the Jacobian matrix   ∂␪ ∂␪  ∂w ∂␪ ∂w∗  F(w) F(w) ∗ ∗  = F(w) = = ∗ . (6.67) ∗  ∂␪ ∂␪  F (w) F (w) ∂w ∂w∗ ∂w = 0. In the widely linear case, F(w) = H−1 . The In the linear case, F(w) = H−1 and F pdf parameterized by w is taken to be p␪ (y) at ␪ = g−1 (w). Using the chain rule for non-holomorphic functions (A2.23), the Fisher score for w is then

H

∂ sw (y) = log pw (y) ∂w

H

∂ ∂ ∗ (w) = log p␪ (y)F(w) + ∗ log p␪ (y)F ∂␪ ∂␪

T (w)s∗␪ (y) at ␪ = g(w), = FH (w)s␪ (y) + F or, equivalently, sw (y) = FH (w)s␪ (y) at ␪ = g(w),

(6.68)

where s␪ (y) is the augmented Fisher score for ␪. If g is widely linear, then sw (y) = H−H s␪ (y). The centered error in estimating w is ew (y) = g−1 (e␪ (y)). In the widely linear case, we have ew (y) = H e␪ (y). The expansion-coefficient matrix cannot be modeled generally, but for widely linear g the augmented version is −H T(␪)HH at ␪ = H−1 w. T(w) = E[sw (y)eH w (y)] = H

(6.69)

170

Performance bounds for parameter estimation

The augmented Fisher matrix for w is JF (w) = FH (w)JF (␪)F(w) at ␪ = g−1 (w).

(6.70)

If g is widely linear, this is JF (w) = H−H JF (␪)H−1 at ␪ = H−1 w. So, in the general case we can characterize the Fisher matrix and in the widely linear case we can characterize the Cramér–Rao bound: −1 H H −1 Q(w) ≥ TH (w)J−1 F (w)T(w) = H T (␪)JF (␪)T(␪)H at ␪ = H w.

(6.71)

If the original estimator was frequentist unbiased, then T(␪) = I and the bound is H −1 Q(w) ≥ H J−1 F (␪)H at ␪ = H w.

(6.72)

To obtain a bound on Q(w), we read out the northwest block of Q(w). The simplifications in the strictly linear case should be obvious.

6.4

Quadratic Bayesian bounds As in the development of quadratic frequentist bounds, the idea of quadratic Bayesian ˆ bounds is this: from the measurement y is computed the estimator ␪(y), with error ˆ ␧(y, ␪) = ␪(y) − ␪ and centered error score e(y, ␪) = ␧(y, ␪) − b. We would like to approximate this centered error score by using a linear or widely linear function of the centered measurement score s(y, ␪) = ␴(y, ␪) − E[␴(y, ␪)]. Thus, the virtual twochannel estimation problem of Fig. 6.2 still holds. The centered error score e(y, ␪) is to be estimated from the centered measurement score s(y, ␪). We will from the outset discuss the widely linear story, and the simplifications in the linear case should be obvious. The composite vector of augmented centered errors and scores [eT , sT ]T has covariance matrix Q TH e(y, ␪) H H . (6.73) E e (y, ␪) s (y, ␪) = T J s(y, ␪) In this equation the covariance matrices Q, T, and J are defined as Q = E[e(y, ␪)eH (y, ␪)], T = E[s(y, ␪)eH (y, ␪)],

(6.74)

J = E[s(y, ␪)sH (y, ␪)], ˆ where Q is the augmented error covariance for the estimator ␪(y), T is the augmented sensitivity or expansion-coefficient matrix, and J is the augmented information matrix associated with the centered measurement score s(y, ␪). In the case of the Fisher–Bayes score, to be discussed in Section 6.5, the information is Fisher–Bayes information. The covariance matrix for the composite augmented error and score is positive semidefinite, so, assuming the augmented information matrix to be nonsingular, we obtain the following bound.

6.5 Fisher–Bayes score and Fisher–Bayes bound

171

Result 6.6. The augmented error covariance is bounded by the general augmented quadratic Bayesian bound Q ≥ TH J−1 T.

(6.75)

The augmented mean-squared error matrix is bounded by M ≥ TH J−1 T + b bH .

(6.76)

The bounds on Q and M are obtained by reading out the northwest block of Q and M. Only the choice of score distinguishes one bound from another. Moreover, all quadratic Bayes bounds are invariant with respect to nonsingular widely linear transformation of the score function s. That is, the transformed score L s transforms the augmented sensitivity matrix T to L T and the augmented information matrix J to L J LH , leaving the bound invariant. Definition 6.4. A widely linear estimator will be said to be Bayes-efficient with respect to a given score if the bound in (6.75) is achieved with equality. A necessary and sufficient condition for efficiency is ˆ ˆ − E[␪]) = TH J−1 s(y, ␪) ␪(y) − ␪ − (E[␪]

(6.77)

which is also invariant with respect to widely linear transformation L. That is, an estimator is efficient if, element-by-element, the centered error scores lie in the widely linear subspace spanned by the elements of the centered measurement score. The minimum achievable error covariance will be the covariance of the conditional mean estimator, which is unbiased. Thus, the Fisher–Bayes bound will be the greatest achievable quadratic lower bound if and only if the conditional mean estimator can be written as E[␪|y] = TH J−1 s(y, ␪) + ␪. This result follows from the fact that the MMSE ˆ estimator has zero mean, which is to say that E[ ␪(y)] = E[␪]. The comments in Section 6.2.1, regarding good and bad scores, and Section 6.2.2 apply almost directly to Bayesian bounds, with one cautionary note: the Bayes expansion coefficient ti = E[si (y, θ )e∗ (y, θ )] is not generally E[ti (θ )] = E{E θ [ti (θ )]}, since the frequentist’s score sθ (y) is not generally the Bayesian’s score s(y, θ ), which will include prior information about the parameter θ .

6.5

Fisher–Bayes score and Fisher–Bayes bound Our results for quadratic Bayes bounds so far are general. To make them applicable, we need to consider a concrete score for which the augmented covariances T and J can be computed. Following the development of the Fisher score, we choose the Fisher–Bayes score and compute its associated Fisher–Bayes bound.

172

Performance bounds for parameter estimation

6.5.1

Fisher–Bayes score and information Definition 6.5. The Fisher–Bayes measurement score is defined as T H ∂ ∂ ∂ log p(y, ␪), . . ., ∗ log p(y, ␪) . log p(y, ␪) = ␴(y, ␪) = ∂␪ ∂θ1∗ ∂θ p

(6.78)

As throughout this chapter, all partial derivatives are Wirtinger derivatives as defined in Appendix 2. Thus, the Fisher–Bayes score is a p-dimensional complex column vector. The Fisher–Bayes score is the sum of the Fisher score and the prior score, H ∂ log p(y, ␪) ␴(y, ␪) = ∂␪ H H ∂ ∂ = (6.79) log p␪ (y) + log p(␪) = ␴␪ (y) + ␴(␪), ∂␪ ∂␪ where ␴␪ (y) is the frequentist score and ␴(␪) is a new score that scores the prior density. The pdf p(y, ␪) is the joint pdf for the measurement y and parameter ␪. Since the expected value of Fisher–Bayes score is 0, Definition 6.5 also defines the centered measurement score s(y, ␪). There are several properties of the Fisher–Bayes score that make it a compelling statistic for inferring the value of a parameter and for bounding the mean-squared error of any estimator for that parameter. In the following list, properties 1, 2, and 4 are analogous to the properties of the Fisher score listed in Section 6.3, but property 3 is different. 1. We may write the partial derivative as 1 ∂ ∂ log p(y, ␪) = p(y, ␪), ∂θi∗ p(y, ␪) ∂θi∗

(6.80)

which is a normalized measure of the sensitivity of the pdf p(y, ␪) to variations in the parameter θi . Large sensitivity is valued, and this will be measured by the variance of the score. 2. The Fisher–Bayes score is a zero-mean random variable. This statistical property is consistent with maximum-likelihood procedures that search for maxima of the loglikelihood function log p(y, ␪) by searching for zeros of (∂/∂␪)log p(y, ␪). Functions with steep slopes are valued, so we value a zero-mean score with large variance. 3. The expansion-coefficient matrix is T = I, and the complementary expansion = 0, thus T = I. So, for the Fisher–Bayes measurement score, coefficient matrix is T the centered measurement score is perfectly aligned with the centered error score. 4. The Fisher–Bayes information matrix is also the expected Hessian of the score function (up to a minus sign): H ∂ ∂ log p(y, ␪) log p(y, ␪) JFB = E s(y, ␪)sH (y, ␪) = E ∂␪ ∂␪ H ∂ ∂ = −E . (6.81) log p(y, ␪) ∂␪ ∂␪

6.5 Fisher–Bayes score and Fisher–Bayes bound

Analogously, the complementary Fisher–Bayes information matrix is T ˜JFB = E[s(y, ␪)sT (y, ␪)] = −E ∂ ∂ log p(y, ␪) . ∂␪ ∂␪

173

(6.82)

The proofs of properties 2 and 4 for the Fisher–Bayes score are not substantially different than the corresponding proofs for the Fisher score, so we omit them. Property 3 is proved as follows. The bias b of the parameter estimator ␪ˆ is differentiated with respect to ␪ as H ∂ ∂ H ˆ ( ␪(y) − ␪) p(y, ␪)dy d␪ b = 0= ∂␪ ∂␪ H ∂ ˆ = − ␪]H p(y, ␪)dy d␪ − I = T − I, (6.83) log p(y, ␪) [ ␪(y) ∂␪ which produces the identity T = I. On the other hand, the complementary expansioncoefficient matrix is derived analogously by considering H ∂ ∗ H ∂ ˆ (␪(y) − ␪)∗ p(y, ␪)dy d␪ 0= = b ∂␪ ∂␪ H ∂ ˆ = − ␪]T p(y, ␪)dy d␪ − 0 = T. (6.84) log p␪ (y) [␪(y) ∂␪ = 0, which means that the error and Fisher–Bayes score are always This shows that T cross-proper. This is a consequence of the way the Fisher–Bayes score is defined. In summary, the augmented expansion-coefficient matrix is T = I. The Fisher–Bayes matrix JFB is the sum of the expected Fisher matrix and the expected prior information matrix, since the frequentist score and prior score are uncorrelated: JFB = E[JF (␪)] + E[JP (␪)].

(6.85)

Thus, the Fisher–Bayes information is not simply the average of the Fisher information, although, for diffuse priors or a plenitude of measurements, it is approximately so.

6.5.2

Fisher–Bayes bound The Fisher–Bayes bound on the augmented error covariance matrix is Result 6.6 with −1 . From it, we obtain the bound on the error covariance Fisher–Bayes score: Q ≥ JFB matrix. Result 6.7. The widely linear Fisher–Bayes bound on the error covariance matrix is ˜ ∗ −1 Q ≥ (JFB − J˜ FB J−∗ FB JFB ) .

(6.86)

The Fisher–Bayes bound is sometimes also called the stochastic Cramér–Rao bound. In a number of aspects it behaves essentially like the “deterministic” Cramér–Rao bound in Result 6.4.

174

Performance bounds for parameter estimation

r The widely linear Fisher–Bayes bound is tighter than the linear Fisher–Bayes bound ˜ ∗ −1 ≥ ignoring complementary Fisher–Bayes information because (JFB − J˜ FB J−∗ FB JFB ) −1 JFB . However, JFB itself depends on whether or not y is proper. r If M repeated measurements carry information about one fixed parameter ␪ through 'M p(yi , ␪), the Fisher–Bayes bound decreases as M −1 . the product pdf i=1 r The influence of nuisance parameters on the Fisher–Bayes bound remains unchanged from Section 6.3.1: nuisance parameters increase the Fisher–Bayes bound. r A tractable formula for Fisher–Bayes bounds for functions of parameters is obtained for the widely linear case, in which the development of Section 6.3.6 proceeds essentially unchanged. Suppose that measurement and parameter have a jointly Gaussian distribution p(y, ␪) = p␪ (y) p(␪), and ␪ has augmented covariance matrix Rθθ . The new term added to the Fisher information matrix to produce the Fisher–Bayes information matrix is the prior information JP = R−1 θθ , which is independent of the mean of ␪. Consequently the augmented Fisher–Bayes bound is −1 Q ≥ (JF + R−1 θθ ) ,

(6.87)

where JF = E[JF (␪)] and the elements of the augmented Fisher matrix JF (␪) have been derived in Section 6.3.5.

6.6

Connections and orderings among bounds Generally, we would like to connect Bayesian bounds to frequentist bounds, and particularly the Fisher–Bayes bound to the Cramér–Rao bound. In order to do so, we appeal to the fundamental identity Q = E[Q(␪)] + Rbb .

(6.88)

Thus, there is the Bayesian bound Q ≥ E[TH (␪)J−1 (␪)T(␪)] + E[(b(␪) − b)(b(␪) − b)H ].

(6.89)

This is a covariance bound on any estimator whose frequentist bias is b(␪) and Bayes bias is b. If the frequentist bias is constant at b – as for LMVDR (BLUE) estimators, for which b(␪) = 0 – then Rbb = 0, and the bound is Q ≥ E[TH (␪)J−1 (␪)T(␪)]. If the score is the Fisher score, and the bias is constant, then T(␪) = I and the bound is Q ≥ E[J−1 F (␪)].

(6.90)

This is a bound on any estimator of a random parameter with the property that its frequentist and Bayes bias are constant, as in LMVDR estimators. We know this bound is larger than the Fisher–Bayes bound, because JFB = E[JF (␪)] + E[JP (␪)], so we have the ordering −1 E[J−1 F (␪)] ≥ JFB .

(6.91)

Notes

175

This ordering makes sense because the right-hand side is a bound on estimators for which there is no constraint on the frequentist bias, as in MMSE estimation, and the left-hand side is a bound on estimators for which there is the constraint that the frequentist bias be constant. The constraint increases error covariance, as we have seen when comparing the variance of LMVDR and MMSE estimators. When there is a plenitude of data or when the prior is so diffuse that E[JP (␪)] E[JF (␪)], then E[JF (␪)] approximates JFB from below and consequently (E[JF (␪)])−1 ≥ −1 −1 J−1 FB . Then, by Jensen’s inequality, E[JF (␪)] ≥ (E[JF (␪)]) , so we may squeeze −1 (E[JF (␪)]) between the two bounds to obtain the ordering (cf. Van Trees and Bell (2007)) −1 ≥ J−1 E[J−1 F (␪)] ≥ (E[JF (␪)]) FB .

(6.92)

To reiterate, the rightmost inequality follows from the fact that the Fisher–Bayes information matrix is more positive semidefinite than the expected value of the Fisher information matrix. The leftmost follows from Jensen’s inequality. Why should this ordering hold? The leftmost bound is a bound for estimators that are frequentist and Bayes unbiased, and frequentist unbiasedness is not generally a desirable property for an estimator of a random parameter. So this extra property tends to increase mean-squared error. The middle term ignores prior information about the random parameter, so it is only a bound in the limit as the Fisher–Bayes information matrix converges to the expected Fisher information matrix. The rightmost bound enforces no constraints on the frequentist or Bayes bias. The leftmost bound we tend to associate with LMVDR and ML estimators, the middle term we associate with the limiting performance of MMSE estimators when the prior is diffuse or there is a plenitude of measurements, and the rightmost bound we tend to associate with MMSE estimators. In fact, in our examples we have seen that the LMVDR and ML estimators achieve the leftmost bound and the MMSE estimator achieves the rightmost. Of course, it is not possible or meaningful to fit the frequentist Cramér–Rao bound into this ordering, −1 −1 except when JF (␪) is independent of ␪, in which case J−1 F = E[JF ] = (E[JF ]) . Then −1 −1 JF ≥ JFB .

Notes 1 There is a vast literature on frequentist and Bayesian bounds. This literature is comprehensively represented, with original results by the editors, in Van Trees and Bell (2007). We make no attempt in this chapter to cite a representative subset of theoretical and applied papers in this edited volume, or in the published literature at large, since to do so would risk offense to the authors of the papers excluded. This chapter addresses performance bounds for the entire class of quadratic frequentist and Bayesian bounds, for proper and improper complex parameters and measurements. Our method has been to cast the problem of quadratic performance bounding in the framework of error score estimation from measurement scores in a virtual two-channel experiment. This point of view is consistent with the point of view used in Chapter 5 and it allows us to extend the reasoning of Weinstein and Weiss (1988) to widely linear estimators.

176

Performance bounds for parameter estimation

We have not covered the derivation of performance bounds under parameter constraints. Representative published works are those by Gorman and Hero (1990), Marzetta (1993), and Stoica and Ng (1998). 2 The examples we give in this chapter are not representative of a great number of problems in signal processing where mode parameters like frequency, wavenumber, direction of arrival, and so on are to be estimated. In these problems the Cramér–Rao and Fisher–Bayes bounds are tight at high SNR and very loose at low SNR. Consequently they do not predict performance breakdown at a threshold SNR. In an effort to analyze the onset of breakdown, many alternative scores have been proposed. Even the Cramér–Rao and Fisher–Bayes bounds can be adapted to this problem by using a method of intervals, wherein a mixture of the Cramér–Rao or Fisher–Bayes bound and a constant bound representative of total breakdown is averaged. The averaging probability is the probability of error lying outside the interval wherein the likelihood is quadratic, and it may be computed for many problems. This method, which has been used to effect by Richmond (2006), is remarkably accurate. So we might say that quadratic performance bounding, augmented with a method of intervals, is a powerful way to model error covariances from high SNR, through the threshold region of low SNR where the quadratic bound by itself is not tight, to the region of total breakdown. 3 The Cramér–Rao bounds for these nonlinear parameters have been treated extensively in the literature, and are well represented by Van Trees and Bell (2007). McWhorter and Scharf (1993b) discuss concentration ellipsoids for error covariances when estimating the sum and difference frequencies of closely spaced modes within the matrix H(␪). See also the discussion of the geometry of the Cramér–Rao bound by McWhorter and Scharf (1993a).

7

Detection

Detection is the electrical engineer’s term for the statistician’s hypothesis testing. The problem is to determine which of two or more competing models best describes experimental measurements. If the competition is between two models, then the detection problem is a binary detection problem. Such problems apply widely to communication, radar, and sonar. But even a binary problem can be composite, which is to say that one or both of the hypotheses may consist of a set of models. We shall denote by H0 the hypothesis that the underlying model, or set of models, is M0 and by H1 the hypothesis that it is M1 . There are two main lines of development for detection theory: Neyman–Pearson and Bayes. The Neyman–Pearson theory is a frequentist theory that assigns no prior probability of occurrence to the competing models. Bayesian theory does. Moreover, the measure of optimality is different. To a frequentist the game is to maximize the detection probability under the constraint that the false-alarm probability is not greater than a prespecified value. To a Bayesian the game is to assign costs to incorrect decisions, and then to minimize the average (or Bayes) cost. The solution in any case is to evaluate the likelihood of the measurement under each hypothesis, and to choose the model whose likelihood is higher. Well – not quite. It is the likelihood ratio that is evaluated, and when this ratio exceeds a threshold, determined either by the false-alarm rate or by the Bayes cost, one or other of the hypotheses is accepted. It is commonplace for each of the competing models to contain unknown nuisance parameters and unknown decision parameters. Only the decision parameters are tested, but the nuisance parameters complicate the computation of likelihoods and defeat attempts at optimality. There are generally two strategies: enforce a condition of invariance, which limits competition to detectors that share these invariances, or invoke a principle of likelihood that replaces nuisance parameters by their maximum-likelihood estimates. These lead to generalized likelihood ratios. In many cases the two approaches produce identical tests. So, in summary, the likelihood-ratio statistic is the gold standard. It plays the same role for binary hypothesis testing as the conditional mean estimator plays in MMSE estimation. The theory of Neyman–Pearson and Bayes detection we shall review is quite general. 1 It reveals functions of the measurement that may include arbitrary functions of the complex measurement y, including its complex conjugate y∗ . However, in order to demonstrate the role of widely linear and widely quadratic forms in the theory of

178

Detection

hypothesis testing, we shall concentrate on hypothesis testing within the proper and improper Gaussian measurement models. Throughout this chapter we assume that measurements have pdfs, which are allowed to contain Dirac δ-functions. Thus, all of our results hold for mixed random variables with discrete and continuous, and even singular, components.

7.1

Binary hypothesis testing A hypothesis is a statement about the distribution of a measurement y ∈ Y. Typically the hypothesis is coded by a parameter ␪ ∈ Θ that indexes the pdf p␪ (y) for the measurement random vector y: −→ C n . We might say the parameter modulates the measurement in the sense that the value of ␪ modulates the distribution of the measurement. Often the parameter modulates the mean vector, as in communications, but sometimes it modulates the covariance matrix, as in oceanography and optics, where a frequency–wavenumber spectrum determines a covariance matrix. We shall be interested in two competing hypotheses, H0 and H1 , and therefore we address ourselves to binary hypothesis testing, as opposed to multiple hypothesis testing. The hypothesis H0 is sometimes called the hypothesis and the hypothesis H1 is sometimes called the alternative, so we are testing a hypothesis against its alternative. Each of the hypotheses may be simple, in which case the pdf of the measurement is p␪i (y) under hypothesis Hi . Or each of them may be composite, in which case the distribution of the measurement is p␪ (y), ␪ ∈ Θi , under hypothesis Hi . The parameter sets Θi and the measurement space y ∈ Y may be quite general. Definition 7.1. A binary test of H1 versus H0 is a test statistic, or threshold detector φ: Y → {0, 1} of the form 1, y ∈ Y1 , (7.1) φ(y) = 0, y ∈ Y0 . We might say that φ is a one-bit quantizer of the random vector y. It sorts measurements into two sets: the detection region Y1 and its complement region Y0 . Such a test will make two types of errors: missed detections and false alarms. That is, it will sometimes classify a measurement drawn from the pdf p␪ (y), ␪ ∈ Θ1 , as a measurement drawn from the pdf p␪ (y), ␪ ∈ Θ0 . This is a missed detection. Or it might classify a measurement drawn from the pdf p␪ (y), ␪ ∈ Θ0 , as a measurement drawn from the pdf p␪ (y), ␪ ∈ Θ1 . This is a false alarm. For a simple hypothesis and simple alternative the detection probability PD and the false-alarm probability PF of a test are defined as follows: PD = P␪1(φ = 1) = E ␪1 [φ],

(7.2)

PF = P␪0(φ = 1) = E ␪0 [φ].

(7.3)

7.1 Binary hypothesis testing

179

The subscripts on the probability and expectation operators denote probability and expectation computed under the pdf of the measurement p␪i (y). We shall generalize these definitions for testing composite hypotheses in due course. Detection probability and false-alarm probability are engineering terms for the statistician’s power and size. Both terms are evocative, with the statistician advocating for a test that is small but powerful, like a compact drill motor, and the engineer advocating for a test that has high detection probability and low false-alarm probability, like a good radar detector on one of the authors’ Mazda RX-7.

7.1.1

The Neyman–Pearson lemma In binary hypothesis testing of a simple alternative versus a simple hypothesis, the test statistic φ is a Bernoulli random variable B[ p], with p = PD under H1 and p = PF under H0 . Ideally, it would have the property PD = 1 and PF = 0, but these are unattainable for non-degenerate tests. So we aim to maximize the detection probability under the constraint that the false-alarm probability be no greater than the design value of PF = α. In other words, it should have maximum power for its small size. We first state the Neyman–Pearson (NP) lemma without proof. 2 Result 7.1. (Neyman–Pearson) Among all competing binary tests φ of a simple alternative versus a simple hypothesis, with probability of false alarm less than or equal to α, none has larger detection probability than the likelihood-ratio test    1, l(y) > η, φ(y) = γ , l(y) = η, (7.4)   0, l(y) < η. In this equation, 0 ≤ l(y) < ∞ is the real-valued likelihood ratio l(y) =

p␪1(y) p␪0(y)

(7.5)

and the threshold parameters (η, γ ) are chosen to satisfy the following constraint on the false-alarm probability: P␪0 [l(y) > η] + γ P␪0 [l(y) = η] = α.

(7.6)

When φ(y) = γ , we select H1 with probability γ , and H0 with probability 1 − γ . The likelihood ratio l is a highly compressed function of the measurement, returning a nonnegative number, and the test function φ returns what might be called a one-bit quantized version of the likelihood ratio. Any monotonic function of the likelihood ratio l will do, so in multivariate Gaussian theory it is the logarithm of l, or the log-likelihood ratio L, that is used in place of the likelihoodratio. There is a cautionary note about the reading of these equations. The test statistic φ and the likelihood ratio l are functions of the measurement random variable, making φ

180

Detection

and l random variables. When evaluated at a particular realization or measurement, they are test values and likelihood-ratio values.

7.1.2

Bayes detectors The only thing that distinguishes a Bayes detector from a Neyman–Pearson (NP) detector is the choice of threshold η. It is chosen to minimize the Bayes risk, which is defined to be π1 C10 (1 − PD ) + (1 − π1 )C01 PF , where π1 is the prior probability that hypothesis H1 will be in force, C10 is the cost of choosing H0 when H1 is in force, and 1 − PD is the probability of doing so. Similarly, C01 is the cost of choosing H1 when H0 is in force, and PF is the probability of doing so. Assume that the Bayes test sorts measurements into the set S ⊂ C n , where H1 is accepted, and the complement set, where H0 is accepted. Then the Bayes risk may be written as π1 C10 (1 − PD ) + (1 − π1 )C01 PF = π1 C10 + [(1 − π1 )C01 p␪0 (y) − π1 C10 p␪1 (y)]dy. S

(7.7) As in Chapter 6, this integral with respect to a complex variable is interpreted as the integral with respect to its real and imaginary parts: f (y)dy f (y)d[Re y]d[Im y]. (7.8) The Bayes risk is minimized by choosing S in (7.7) to be the set (1 − π1 )C01 S = {y: π1 C10 p␪1 (y) > (1 − π1 )C01 p␪0 (y)} = y: l(y) > = η . (7.9) π1 C10 This is just the likelihood-ratio test with a particular threshold η. The only difference between a Bayes test and an NP test is that their thresholds are different.

7.1.3

Adaptive Neyman–Pearson and empirical Bayes detectors It is often the case that only a subset of the parameters is to be tested, leaving the others as nuisance parameters. In these cases the principles of NP or Bayes detection may be expanded to say that unknown nuisance parameters are replaced by their maximumlikelihood estimators or their Bayes estimators, under each of the two hypotheses. The resulting detectors are called adaptive, empirical, or generalized likelihood-ratio detectors. The literature on them is comprehensive, and we make no attempt to review it here. Some examples are discussed in the context of testing for correlation structure in Section 4.5. Later, in Section 7.6, we outline a different story, which is based on invariance with respect to nuisance parameters. This story also leads to detectors that are adaptive.

7.2

Sufficiency and invariance The test function φ is a Bernoulli random variable. Its probability distribution depends on the parameter ␪ according to p = PD under H1 and p = PF under H0 . However,

181

7.3 Receiver operating characteristic

the value of φ, given its likelihood ratio l, is 0 if l < η, γ if l = η, and 1 if l > η, independently of ␪. Thus, according to the Fisher–Neyman factorization theorem, the likelihood-ratio statistic is a sufficient statistic for testing H1 versus H0 . Definition 7.2. We call a statistic t(y) sufficient if the pdf p␪ (y) may be factored as p␪ (y) = a(y)q␪ (t), where q␪ (t) is the pdf for the sufficient statistic. In this case the likelihood ratio equals the likelihood ratio for the sufficient statistic t: l(y) =

q␪ (t) p␪1 (y) = 1 = l (t). p␪0 (y) q␪0 (t)

We may thus rewrite the NP lemma in terms of the sufficient statistic:    1, l (t) > η, φ(t) = γ , l (t) = η,   0, l (t) < η.

(7.10)

(7.11)

In this equation, (η, γ ) are chosen to satisfy the false-alarm probability constraint: P␪0 [l (t) > η] + γ P␪0 [l (t) = η] = α.

(7.12)

The statistic t is often a scalar function of the original measurement, and its likelihood can often be inverted with a monotonic function for the statistic itself. Then the test statistic is a threshold test that simply compares the sufficient statistic against a threshold. Definition 7.3. A sufficient statistic is minimal, meaning that it is most memory efficient, if it is a function of every other sufficient statistic. A sufficient statistic is an invariant statistic with respect to the transformation group G if φ(g(t)) = φ(t) for all g ∈ G. The composition rule for any two transformations within G is g1 ◦ g2 (t) = g1 (g2 (t)).

7.3

Receiver operating characteristic It is obvious that no NP detector will have a worse (PD , PF ) pair than (α, α), since the test φ = α achieves this performance and, according to the NP lemma, cannot be better than the NP likelihood-ratio test. Moreover, from the definitions of (PD , PF ), an increase in the threshold η will decrease the false-alarm probability and decrease the detection probability. So a plot of achievable (PD , PF ) pairs, called an ROC curve for receiver operation characteristic, appears as the convex curve in Fig. 7.1. Why is it convex? Consider two points on the ROC curve. Draw a straight line between them, to find an operating point for some suboptimum detector. At the PF for this detector there is a point on the ROC curve for the NP detector that cannot lie below it. Thus the ROC curve for the NP detector is convex. Every detector will have an operating point somewhere within the convex set bounded by the ROC curve for the NP detector and the (α, α) line. One such operating point is

182

Detection

PD 1

slope h

PF 0

1

Figure 7.1 Geometry of the ROC for a Neyman–Pearson detector.

illustrated with a cross in Fig. 7.1. Any suboptimum test with this operating point may be improved with an NP test whose operating point lies on the arc subtended by the wedge that lies to the northwest of the cross. To improve a test is to decrease its PF for a fixed PD along the horizontal line of the wedge, increase its PD for a fixed PF along the vertical line of the wedge, or decrease PF and increase PD by heading northwest to the ROC line. Recall the identity (7.10) for a sufficient statistic t. If the sufficient statistic is the likelihood itself, that is t = l(y), (7.10) shows that the pdf for likelihood under the two hypotheses is connected as follows: q␪1 (l) = lq␪0 (l). The probability of detection and probability of false alarm may be written η η 1 − PD = q␪1 (l)dl = lq␪0 (l)dl, 0

1 − PF =

(7.13)

(7.14)

0 η

q␪0(l)dl.

(7.15)

0

Differentiate each of these formulae with respect to η to find ∂ PD = −ηq␪0 (η) ∂η

(7.16)

∂ PF = −q␪0 (η) ∂η

(7.17)

dPD = η. dPF

(7.18)

and therefore

Thus, at any operating point on the ROC curve, the derivative of the curve is the threshold η required to operate there. 3

7.4 Simple Gaussian hypothesis testing

7.4

183

Simple hypothesis testing in the improper Gaussian model In the context of this book, where an augmented second-order theory that includes Hermitian and complementary covariances forms the basis for inference, the most important example is the problem of testing in the multivariate Gaussian model. We may test for mean-value and/or covariance structure.

7.4.1

Uncommon means and common covariance The first problem is to test the alternative H1 that an improper Gaussian measurement y has mean value ␮1 versus the hypothesis H0 that it has mean value ␮0 . Under both hypotheses the measurement y has Hermitian correlation R = E ␪0 [(y − ␮0 )(y − ␮0 )H ] = E ␪1 [(y − ␮1 )(y − ␮1 )H ] and symmetric complementary covariance = E ␪0 [(y − ␮0 )(y − ␮0 )T ] = E ␪1 [(y − ␮1 )(y − ␮1 )T ]. These covariances are coded R into the augmented covariance matrix R R R = ∗ . (7.19) R R∗ For convenience, we drop the subscripts on covariance matrices (R yy etc.) in this chapter. Following Result 2.4, we may write down the pdf of the improper Gaussian vector y under hypothesis Hi : 1 p␪i (y) = n 1/2 exp − 12 (y − ␮i )H R−1 (y − ␮i ) . (7.20) π det R Of course, this pdf also covers the proper case where R is block-diagonal. After some algebra, the real-valued log-likelihood ratio may be expressed as the following widely linear function of y: p␪1 (y) L = log = − 12 (y − ␮1 )H R−1 (y − ␮1 ) + 12 (y − ␮0 )H R−1 (y − ␮0 ) p␪0 (y) = (␮1 − ␮0 )H R−1 (y − y0 ) with y0 =

1 2

␮1 + ␮0 .

The log-likelihood ratio may be written as the inner product $H # L = R−1 (␮1 − ␮0 ) (y − y0 ) = 2 Re {wH (y − y0 )}, with

(7.21)

w = R−1 (␮1 − ␮0 ). w∗

(7.22)

(7.23)

w=

(7.24)

Taking the inner product in (7.23) can be interpreted as coherent matched filtering. In the proper case, the matching vector w simplifies to w = R−1 (␮1 − ␮0 ).

184

Detection

µ1

µ0

Figure 7.2 Geometrical interpretation of a coherent matched filter. The thick dashed line is the locus of measurements such that [R−1 (␮1 − ␮0 )]H (y − y0 ) = 0.

The centered complex measurement y − y0 is resolved onto the line R−1 (␮1 − ␮0 ) to produce the real log-likelihood ratio L, which is compared against a threshold. The geometry of the log-likelihood ratio is illustrated in Fig. 7.2: we choose lines parallel to the thick dashed line in order to get the desired probability of false alarm PF . But how do we determine the threshold to get the desired PF ? We will see that the deflection, or output signal-to-noise ratio, plays a key role here. Definition 7.4. The deflection for a likelihood-ratio test with log-likelihood ratio L is defined as d=

[E ␪1 (L) − E ␪0 (L)]2 , var␪0 (L)

(7.25)

where var␪0 (L) denotes the variance of L under H0 . We will first show that L given by (7.23) results in the deflection d = (␮1 − ␮0 )H R−1 (␮1 − ␮0 ),

(7.26)

which is simply the output signal-to-noise ratio. The mean of L is d/2 under H1 and −d/2 under H0 , and its variance is d under both hypotheses. Thus, the log-likelihood ratio statistic L is distributed as a real Gaussian random variable with mean value ±d/2 and variance d. Thus, L has deflection, or output signal-to-noise ratio, d. In the general non-Gaussian case, the deflection is much easier to compute than a complete ROC curve. Now the false-alarm probability for the detector φ that compares L against a threshold η is η η + d/2 1 1 √ √ , (7.27) exp − (L + d/2)2 dL = 1 − PF = 1 − 2d d 2π d −∞ where is the probability distribution function of a zero-mean, variance-one, real Gaussian random variable. A similar calculation shows the probability of detection to be η η − d/2 1 1 2 √ √ . (7.28) exp − (L − d/2) dL = 1 − PD = 1 − 2d d 2π d −∞

7.4 Simple Gaussian hypothesis testing

185

PD PF d

h− d 2 d

L

h+ √d 2 d

Figure 7.3 The false-alarm probability PF (dark gray) and detection probability PD (light and dark gray) are tail probabilities of a normalized Gaussian. They are determined by the threshold η and the deflection d.

So, by choosing the threshold η, any design value of false-alarm probability may be achieved, and this choice of threshold determines the detection probability as well. In Fig. 7.3 the argument is made that the normalized zero-mean, variance-one, √ Gaussian d, whose density determines performance. The designer is given a string of length √ to achieve the false-alarm probability PF , head end is placed at location (η + d/2)/ d,√ d, to determine the detection probability and whose tail end falls at location (η − d/2)/ √ PD . The length of the string is d, which is the output voltage SNR, so obviously the larger the SNR the better.

7.4.2

Common mean and uncommon covariances When the mean of an improper complex random vector is common to both the hypothesis and the alternative, but the covariances are uncommon, we might as well assume that the mean is zero, since the measurement can always be centered. Then the pdf for the measurement y: −→ C n under hypothesis Hi is p␪i (y) =

πn

1 exp − 12 yH Ri−1 y 1/2 det Ri

(7.29)

where Ri is the augmented covariance matrix under hypothesis Hi . For a measured y this is also the likelihood that the covariance model Ri would have produced it. The log-likelihood ratio for comparing the likelihood of alternative H1 with the likelihood of hypothesis H0 is then the real widely quadratic form −H/2

−1 H L = yH (R−1 0 − R1 )y = y R0 −1/2

−H/2

−1/2

(I − S−1 )R0

y,

(7.30)

is the augmented signal-to-noise ratio matrix. The transwhere S = R0 R1 R0 −1/2 formed measurement R0 y has augmented covariance matrix I under hypothesis H0 and augmented covariance matrix S under hypothesis H1 . Thus L is the log-likelihood ratio for testing that the widely linearly transformed measurement, extracted as the top −1/2 half of R0 y, is proper and white versus the alternative that it is improper with augmented covariance S. When the SNR matrix is given its augmented EVD S = U Λ UH

186

Detection

(cf. Result 3.1), the log-likelihood-ratio statistic may be written −H/2

L = yH R0

−1/2

U(I − Λ−1 )UH R0

y.

(7.31)

The augmented EVD of the SNR matrix S is the solution to the generalized eigenvalue problem −H/2

R1 (R0

−H/2

U) − R0 (R0

U)Λ = 0.

(7.32)

The characteristic function for the quadratic form L may be derived, and inverted for its pdf. We do not pursue this question further, since the general result does not bring much insight into the performance of special cases.

7.4.3

Comparison between linear and widely linear detection We would now like to compare the performance of linear and widely linear detection. For this, we consider a special case of testing for a common mean and uncommon covariances. We wish to detect whether the Gaussian signal x: −→ C n is present in additive white Gaussian noise n: −→ C n . Under H0 , the measurement is y = n, and under H1 , it is y = x + n. The signal x has augmented covariance matrix R x x and the noise is white and proper with augmented covariance matrix Rnn = N0 I. The signal and noise both have zero mean, and they are uncorrelated. Thus, the covariance matrix under H0 is R0 = N0 I, and under H1 it is R1 = R x x + N0 I. Let Rx x = U Λ UH denote the augmented EVD of R x x . Note that Λ is now the augmented eigenvalue matrix of Rx x , whereas before in Section 7.4.2, Λ denoted the −1/2 −H/2 = R1 /N0 = Rx x /N0 + I. Thereaugmented eigenvalue matrix of S = R0 R1 R0 fore, the augmented eigenvalue matrix of S is now Λ/N0 + I, and the log-likelihood ratio (7.31) is 1 H y U[I − (Λ/N0 + I)−1 ]UH y N0 1 H = y U Λ(Λ + N0 I)−1 UH y. N0

L=

(7.33)

It is worthwhile to point out that L can also be written as H ˆ = 2 Re {yH xˆ }, L = yH (R x y R−1 yy y) = y x

(7.34)

where R x y = R x x and R yy = R1 = R x x + N0 I, so that xˆ is the widely linear MMSE estimate of x from y. Hence, the log-likelihood detector for the signal-plus-white-noise setup is really a cascade of a WLMMSE filter and a matched filter. In the proper case, the augmented covariance matrix R x x is block-diagonal, R x x = Diag(Rx x , Rx x ), and the eigenvalues of R x x occur in pairs. Let M = Diag(µ1 , . . ., µn ) be the eigenvalue matrix of the Hermitian covariance matrix Rx x . The log-likelihood ratio is now 2 L= Re {yH UM(M + N0 I)−1 UH y} = 2 Re {yH xˆ }, (7.35) N0 where xˆ is now the linear MMSE estimate of x from y.

7.4 Simple Gaussian hypothesis testing

187

So what performance advantage does widely linear processing offer over linear processing? Answering this question proceeds along the lines of Section 5.4.2. The performance criterion we choose here is the deflection d. Schreier et al. (2005) computed the deflection for the improper signal-plus-white-noise detection scenario: ( 2n )2 λ2 i λ + N0 i=1 i d= (7.36) 2 . 2n λi 2 2N0 λi + N0 i=1 2n Here, {λi }i=1 are the eigenvalues of the augmented covariance matrix R x x . In order to evaluate the maximum performance advantage of widely linear over linear pro x x . Using cessing, we need to maximize the deflection for fixed Rx x and varying R Result A3.3, it can be shown that the deflection is a Schur-convex function of the eigenvalues {λi }. Therefore, maximizing the deflection requires maximum spread of the {λi } in the sense of majorization. According to Result 3.7, this is achieved for

λi = 2µi ,

i = 1, . . ., n,

and

λi = 0,

i = n + 1, . . ., 2n,

(7.37)

n are the eigenvalues of the Hermitian covariance matrix Rx x . By plugwhere {µi }i=1 ging this into (7.36), we obtain the maximum deflection achieved by widely linear processing, )2 ( n µi2 2 2µi + N0 i=1 (7.38) max d = 2 . n µi Rx x 2 N0 2µi + N0 i=1

x x = 0, in which case the On the other hand, linear processing implicitly assumes that R deflection is ( n )2 µ2 i µ + N0 i i=1 d = (7.39) 2 . n Rx x =0 µi 2 N0 µi + N0 i=1 Thus, the maximum performance advantage, as measured by deflection, is max N0

max d Rx x = 2, d Rx x =0

(7.40)

which is attained for N0 −→ 0 or N0 −→ ∞. This bound was derived by Schreier et al. (2005). We note that, if (7.37) is not satisfied, the maximum performance advantage, which is then less than a factor of 2, occurs for some noise level N0 > 0.

188

Detection

The factor of 2 is a very conservative performance advantage bound because it assumes the worst-case scenario of white additive noise. If the noise is colored, the performance difference between widely linear and linear processing can be much larger. One such example is the detection of a real message x in purely imaginary noise n. This is a trivial detection problem, but only if the widely linear operation Re y is admitted.

7.5

Composite hypothesis testing and the Karlin–Rubin theorem So far we have tested a simple alternative H1 versus a simple hypothesis H0 and found the NP detector to be most powerful among all competitors whose false-alarm rates do not exceed the false-alarm rate of the NP detector. Now we would like to generalize the notion of most powerful to the notion of uniformly most powerful (UMP). That is, we would like to consider composite alternative and composite hypothesis and argue that an NP test is uniformly most powerful, which is to say most powerful against all competitors for every possible pair of parameters (␪0 ∈ Θ0 , ␪1 ∈ Θ1 ). This is too much to ask, so we restrict ourselves to a scalar real-valued parameter θ, and ask for a test that would be most powerful for every pair (θ0 ∈ 0 , θ1 ∈ 1 ). Moreover, we shall restrict the sets 0 and 1 to the form 0 = {θ : θ ≤ 0} and 1 = {θ : θ > 0}, or 0 = {0} and 1 = {θ : θ > 0}. Definition 7.5. A detector φ is uniformly most powerful (UMP) for testing H1 : θ ∈ 1 versus H0 : θ ∈ 0 if its detection probability is greater at every value of θ ∈ 1 than that of every competitor φ whose false-alarm probability is no greater than that of φ. That is sup E θ [φ ] ≤ sup E θ [φ] = α,

θ∈0

θ∈0

E θ [φ ] ≤ E θ [φ],

∀ θ ∈ 1 .

(7.41) (7.42)

A scalar sufficient statistic, such as the likelihood ratio itself, has been identified. Assume that the likelihood ratio for this statistic, namely qθ1 (t)/qθ0 (t) is monotonically increasing in t for all θ1 > θ0 . Then the test function φ may be replaced by the threshold test    1, t > t0 , (7.43) φ(t) = γ , t = t0 ,   0, t < t 0

with (t0 , γ ) chosen to satisfy the false-alarm probability. That is Pθ0 [t > t0 ] + γ Pθ0 [t = t0 ] = α.

(7.44)

The Karlin–Rubin theorem, which we shall not prove, says the threshold test φ is uniformly most powerful for testing {θ ≤ 0} versus {θ > 0} or for testing {θ = 0} versus {θ > 0}. Moreover, the false-alarm probability is computed at θ = 0.

7.6 Invariance in hypothesis testing

189

Example 7.1. Let’s consider a variation on Section 7.4.1. The measurement is y = ␮θ + n, where the signal ␮ is deterministic complex, θ is real, and the Gaussian noise n has zero mean and augmented covariance matrix Rnn . As usual we work with the augmented measurement y = ␮θ + n. First, we whiten the augmented measurement to obtain v = ␬θ + w, where v = −1/2 −1/2 R−1/2 nn y, ␬ = Rnn ␮, and w = Rnn n. Then v is proper and white, but the pdf may still be written in augmented form: pθ (v) =

1 exp − 12 (v − ␬θ)H (v − ␬θ ) . n π

(7.45)

Thus, a sufficient statistic is t = ␬ H v = 2 Re {␬ H v}, which is a real Gaussian with mean value θ␬ H ␬ = 2θ Re {␬ H ␬} and variance ␬ H ␬ = 2Re {␬ H ␬}. It is easy to check that the likelihood ratio l(t) = qθ1 (t)/qθ0 (t) increases monotonically in t for all θ1 > θ0 . Thus the Karlin–Rubin theorem applies and the test    1, t > t0 , (7.46) φ(t) = γ , t = t0 ,   0, t < t 0

is uniformly most powerful for testing {θ ≤ 0} versus {θ > 0} or for testing {θ = 0} versus {θ > 0}. The threshold may be set to achieve the desired false alarm probability at θ = 0. The probability of detection then depends on the scale of θ : t0 t0 − θd √ and PD = 1 − . (7.47) PF = 1 − √ d d As before, is the distribution function of a zero-mean, variance-one, real Gaussian random variable. For θ = 0 under H0 , the deflection is d = θ 2 ␬ H ␬ = θ 2 ␮H R−1 nn ␮. Figure 7.3 continues to apply. −1/2 y}. The term In the proper case, the sufficient statistic is t = ␬ H v = 2 Re {␮H Rnn H −1 inside the real part operation is proper complex with real mean θ ␮ Rnn ␮ and variance −1 −1 −1 ␮. So twice the real part has mean 2θ ␮H Rnn ␮ and variance 2␮H Rnn ␮. The ␮H Rnn 2 H −1 resulting deflection is d = 2θ ␮ Rnn ␮.

7.6

Invariance in hypothesis testing There are many problems in detection theory where a channel introduces impairments to a measurement. Examples would be the addition of bias or the introduction of gain. Typically, even the most exquisite testing and calibration cannot ensure that these are, respectively, zero and one. Thus there are unknown bias and gain terms that cannot be cancelled out or equalized. Moreover, the existence of these terms in the measurement defeats our attempts to find NP tests that are most powerful or uniformly most powerful. The idea behind invariance in the theory of hypothesis testing is then to make an end-run by requiring that any detector we design be invariant with respect to these impairments. This is sensible. However, in order to develop a theory that captures the essence of this

190

Detection

reasoning, we must be clear about what would constitute an invariant hypothesis-testing problem and its corresponding invariant test. 4 The complex random measurement y is drawn from the pdf p␪ (y). We are to test the alternative H1 : ␪ ∈ Θ1 versus the hypothesis H0 : ␪ ∈ Θ0 . If g is a transformation in the group G, where the group operation is composition, g1 ◦ g2 (y) = g1 (g2 (y)), then the transformed measurement t = g(y) will have pdf q␪ (t). If this pdf is q␪ (t) = p␪ (t), then we say the transformation group G leaves the measurement distribution invariant, since only the parameterization of the probability law has changed, not the law itself. If the induced transformation on the parameter ␪ leaves the sets Θ1 and Θ0 invariant, which is to say that g (Θi ) = Θi , then the hypothesis-testing problem is said to be invariant-G. Let us first recall the definition of a maximal invariant. A statistic t(y) is said to be a maximal invariant statistic if t(y2 ) = t(y1 ) if and only if y2 = g(y1 ) for some g ∈ G. In this way, t sorts measurements into orbits or equivalence classes where t is constant and on this orbit all measurements are within a transformation g of each other. On this orbit any other invariant statistic will also be constant, making it a function of the maximal invariant statistic. It is useful to note that an invariant test is uniformly most powerful (UMP) if it is more powerful than every other test that is invariant-G. We illustrate the application of invariance in detection by solving a problem that has attracted a great deal of attention in the radar, sonar, and communications literature, namely matched subspace detection. This problem has previously been addressed for real-valued and proper complex-valued measurements. The solution we give extends the result to improper complex-valued measurements.5

7.6.1

Matched subspace detector Let y: −→ C n denote a complex measurement of the form y = H␪ + n

(7.48)

with H ∈ C and ␪ ∈ C , p < n. We say that the signal component of the measurement is an element of the p-dimensional subspace H. The noise is taken, for the time being, to be proper with Hermitian covariance matrix R. We wish to test the alternative H1 : ␪ = 0 versus the hypothesis H0 : ␪ = 0. But we note that ␪ = 0 if and only if ␪ H M␪ = 0 for any positive definite Hermitian matrix M = FH F. So we might as well be testing H1 : ␪ H FH F␪ > 0 versus H0 : ␪ H FH F␪ = 0, for a matrix F of our choice. We may whiten the measurement with the nonsingular transformation v = R−1/2 y to return the measurement model n× p

p

v = G␪ + w,

(7.49)

where G = R−1/2 H, and w = R−1/2 n is proper and white noise. Without loss of generality, we may even replace this model with the model v = U␾ + w,

(7.50)

where the matrix U ∈ C n× p is a unitary basis for the subspace G, and the parameter ␾ = F␪ is a reparameterization of ␪. So the hypothesis test can be written equivalently as the alternative H1 : ␾H ␾ > 0 versus the hypothesis H0 : ␾H ␾ = 0.

7.6 Invariance in hypothesis testing

191

At this point the pdf of the whitened measurement v may be expressed as p␾(v) =

1 exp{(v − U␾)H (v − U␾)}. πn

(7.51)

The statistic t = UH v is sufficient for ␾ and the statistic tH t may be written as tH t = vH UUH v = vH PU v. The projection onto the subspace U is denoted PU = UUH , and H H the projection onto the subspace orthogonal to U is P⊥ U = I − UU . The statistic t t is invariant with respect to rotations g(v) = (UQUH + P⊥ U )v of the whitened measurement p× p is a rotation matrix. This rotation leaves the v in the subspace U, where Q ∈ C noise white and proper and rotates the signal U␾ to the signal UQ␾, thus rotating the parameter ␾ to the parameter ␾ = Q␾. So the pdf for v is invariant-Q, with transformed parameter Q␾. Moreover, ␾H QH Q␾ = ␾H ␾, leaving the hypothesis-testing problem invariant. In summary, the hypothesis-testing problem is invariant with respect to rotations of the measurement in the subspace U, and the matched subspace detector statistic L = tH t = vH PU v

(7.52)

is invariant-Q. If the complex measurements y are Gaussian, this detector statistic is the sum of squares of p independent complex Gaussian random variables, each of variance 1, making it distributed as a χ22p (␾H ␾) random variable, which is to say a chi-squared random variable with 2 p degrees of freedom and non-centrality parameter ␾H ␾. This pdf has a monotonic likelihood ratio, so a threshold test of L is a uniformly most powerful invariant test of H1 versus H0 . The test statistic L matches to the subspace U and then computes its energy in this subspace, earning the name noncoherent matched subspace detector. This energy may be compared against a threshold because the scale of the additive noise is known a priori. Therefore large values of energy in the subspace U are unlikely to be realized with noise only in the measurement. Actually, rotations are not the largest set of transformations under which the statistic L is invariant. It is invariant also with respect to translations of v in the subspace perpendicular to the subspace U. The sufficient statistic remains sufficient, the hypothesis problem remains invariant, and the quadratic form in the projection operator remains invariant. So the set of invariances for the noncoherent matched subspace detector is the cylindrical set illustrated in Fig. 7.4. Example 7.2. For p = 1, the matrix U is a unit vector ␺, and the measurement model is v = ␺φ + w.

(7.53)

The noncoherent matched subspace detector is L = |␺ H v|2 ,

(7.54)

which is invariant with respect to rotations of the whitened measurement in the subspace ␺ and to bias in the subspace orthogonal to ␺. A rotation of the measurement amounts

192

Detection

v

PU v

U

Figure 7.4 The cylinder is the set of transformations of the whitened measurement that leave the matched subspace detector invariant, and the energy in the subspace U constant.

to the transformation jβ g(v) = (␺ejβ ␺ H + P⊥ ␺ )v = ␺e φ + w ,

(7.55)

where the rotated noise w remains white and proper and the rotated signal is just a true phase rotation of the original signal. This is the natural transformation to be invariant to when nothing is known a priori about the amplitude and phase of the parameter φ. Let us now extend our study of this problem to the case in which the measurement noise n is improper with Hermitian covariance R and complementary covariance R. To this end, we construct the augmented measurement y and whiten it with the square root of the augmented covariance matrix R−1/2 . The resulting augmented measurement v = R−1/2 y has identity augmented covariance, so v is white and proper, and a widely linear function of the measurement y. However, by whitening the measurement we have turned the linear model y = H␪ + n into the widely linear model v = G ␪ + w, where G = R−1/2

H 0 0 H∗

(7.56) (7.57)

and w is proper and white noise. This model is then replaced with v = U ␾ + w,

(7.58)

where the augmented matrix U is a widely unitary basis for the widely linear (i.e., real) subspace G, which satisfies UH U = I, and ␾ is a reparameterization of ␪. It is a straightforward adaptation of prior arguments to argue that the real test statistic L = 12 vH PU v = 12 vH U UH v

(7.59)

is invariant with respect to rotations in the widely linear (i.e., real) subspace U and to bias in the widely linear (i.e., real) subspace orthogonal to U. The statistic L may also

7.6 Invariance in hypothesis testing

193

v P⊥ Uv

PU v

U

Figure 7.5 The double cone is the set of transformations of the whitened measurement that leave the CFAR matched6subspace detector invariant, and leave the angle that the measurement makes 7 with the subspace U constant.

be written in terms of v and the components of U1 U2 U= U∗2 U∗1

(7.60)

as H T ∗ H ∗ H L = vH (U1 UH 1 + U2 U2 )v + Re {v (U1 U2 + U2 U1 )v}.

(7.61)

In the proper case, U2 = 0 and U1 = U, so L simplifies to (7.52). If the complex measurement y is Gaussian, the statistic L is distributed as a χ22p (␾H ␾) random variable. This is the same distribution as in the proper case – the fact that L is coded as a real-valued widely quadratic form does not change its distribution.

7.6.2

CFAR matched subspace detector The arguments for a constant-false-alarm-rate (CFAR) matched subspace detector follow along lines very similar to the lines of argument for the matched subspace detector. The essential difference is this: the scale of the improper noise is unknown, so the Hermitian covariance and the complementary covariance are known only up to a nonnegative are the Hermitian and complementary covariances, with scale. That is, σ 2 R and σ 2 R 2 R and R known, but σ > 0 unknown. We ask for invariance with respect to one more transformation, namely scaling of the measurement by a nonzero constant, which would have the effect of scaling the variance of the noise and scaling the value of the subspace parameter ␾, while leaving the hypothesis-testing problem invariant. For proper complex noise, the statistic L=

vH PU v vH P⊥ Uv

(7.62)

is invariant with respect to rotations in the widely linear (i.e., real) subspaces U and U⊥ , and to scaling by a nonzero constant. The set of measurements g(v) with respect to which the statistic L is invariant is illustrated in Fig. 7.5. This invariant set demonstrates

194

Detection

that it is the angle that a measurement makes with a subspace that matters, not its energy in that subspace. The statistic L measures this angle. This is sensible when the scale of the noise is unknown. That is, L normalizes the energy resolved in the subspace U with an estimate of noise power. Since L is the ratio of two independent χ 2 statistics, it is within a constant of an F-distributed random variable, with 2 p and 2(n − p) degrees of freedom, and non-centrality parameter ␾H ␾.

Notes 1 There is a comprehensive library of textbooks and research papers on detection theory, which we make no attempt to catalog. What we have reviewed in this chapter is just a sample of what seems relevant to detection in the improper Gaussian model, a topic that has not received much attention in the literature. Even regarding that, we have said nothing about extensions of detection theory to a theory of adaptive detection that aims to resolve hypotheses when the subspace is a priori unknown or the covariance matrix of the noise is unknown. The study of these problems, in the case of proper noise, began with the seminal work of Reed et al. (1974), Kelly (1986), and Robey et al. (1992), and has seen its most definitive conclusions in the work of Kraut and Scharf and their collaborators (see Kraut et al. (2001)) and Conte and his collaborators (see Conte et al. (2001)). 2 The Neyman–Pearson lemma, the Fisher–Neyman factorization theorem, and the Karlin–Rubin theorem are proved in most books on hypothesis testing and decision theory. Proofs are contained, for instance, in Ferguson (1967). These fundamental theorems are used in Chapter 4 of Scharf (1991) to derive a number of uniformly most powerful and uniformly most powerful invariant detectors for real-valued signals. 3 This result is originally due to Van Trees (see Van Trees (2001), where it is proved using a moment-matching argument). 4 The theory of invariance in hypothesis testing is covered comprehensively by Ferguson (1967) and applied to problems of bias and scale by Scharf (1991). 5 The results in this chapter for matched subspace detectors generalize the results of Scharf (1991) and Scharf and Friedlander (1994) from real-valued signals to complex-valued signals that may be proper or improper. Moreover, the adaptive versions of these matched subspace detectors, termed adaptive subspace detectors, may be derived, along the lines of papers by Kraut et al. (2001) and Conte et al. (2001). Adaptive subspace detectors use sample covariance matrices. These ideas can be generalized further to the matched direction detectors introduced by Besson et al. (2005). There is also a body of results addressed to the problem of testing for covariance structure. In this case the hypothesis is composite, consisting of covariance matrices of specified structure, and the alternative is composite, consisting of all Hermitian covariance matrices. A selection of these problems has been addressed in Section 4.5.

Part III

Complex random processes

8

Wide-sense stationary processes

The remaining chapters of the book deal with complex-valued random processes. In this chapter, we discuss wide-sense stationary (WSS) signals. In Chapter 9, we look at nonstationary signals, and in Chapter 10, we treat cyclostationary signals, which are an important subclass of nonstationary signals. Our discussion of WSS signals continues the preliminary exposition given in Section 2.6. WSS processes have shift-invariant second-order statistics, which leads to the definition of a time-invariant power spectral density (PSD) – an intuitively pleasing idea. For improper signals, the PSD needs to be complemented by the complementary power spectral density (C-PSD), which is generally complex-valued. In Section 8.1, we will see that WSS processes allow an easy characterization of all possible PSD/C-PSD pairs and also a spectral representation of the process itself. Section 8.2 discusses widely linear shift-invariant filtering, with an application to analytic and complex baseband signals. We also introduce the noncausal widely linear minimum mean-squared error, or Wiener, filter for estimating a message signal from a noisy measurement. In order to find the causal approximation of the Wiener filter, we need to adapt existing spectral factorization algorithms to the improper case. This is done in Section 8.3, where we build causal synthesis, analysis, and Wiener filters for improper WSS vector-valued time series. Section 8.4 introduces rotary-component and polarization analysis, which are widely used in a number of research areas, ranging from optics, geophysics, meteorology, and oceanography to radar. These techniques are usually applied to deterministic signals, but we present them in a more general stochastic framework. The idea is to represent a two-dimensional signal in the complex plane as a superposition of ellipses, which can be analyzed in terms of their shape and orientation. Each ellipse is the sum of a counterclockwise and a clockwise rotating phasor, called the rotary components. If there is complete coherence between the rotary components (i.e., they are linearly dependent), then the signal is completely polarized. The chapter is concluded with a brief exposition of higher-order statistics of N th-order stationary signals in Section 8.5, where we focus on higher-order moment spectra and the principal domains of analytic signals.

8.1

Spectral representation and power spectral density Consider a zero-mean wide-sense stationary (WSS) continuous-time complex-valued random process x(t) = u(t) + jv(t), which is composed from the two real random

198

Wide-sense stationary processes

processes u(t) and v(t) defined for all t ∈ IR. Let us first recall a few key definitions and results from Section 2.6. The process x(t) is WSS if and only if both the covariance function r x x (τ ) = E[x(t + τ )x ∗ (t)] and the complementary covariance function r˜x x (τ ) = E[x(t + τ )x(t)] are independent of t. Equivalently, the autoand cross-covariance functions of real and imaginary parts, ruu (τ ) = E[u(t + τ )u(t)], rvv (τ ) = E[v(t + τ )v(t)], and ruv (τ ) = E[u(t + τ )v(t)], must all be independent of t. The process x(t) is proper when r˜x x (τ ) = 0 for all τ , which means that ruu (τ ) ≡ rvv (τ ) and ruv (τ ) ≡ −ruv (−τ ), and thus ruv (0) = 0. The Fourier transform of r x x (τ ) is the power spectral density (PSD) Px x ( f ), and the Fourier transform of r˜x x (τ ) is the complementary power spectral density (C-PSD) Px x ( f ). Result 2.14 characterizes all possible pairs (Px x ( f ), Px x ( f )). This result is easily obtained by considering the augmented signal x(t) = [x(t), x ∗ (t)]T whose lag-τ covariance matrix r x x (τ ) r˜x x (τ ) H (8.1) R x x (τ ) = E[x(t + τ )x (t)] = ∗ r˜x x (τ ) r x∗x (τ ) is the augmented covariance function of x(t). The Fourier transform of R x x (τ ) is the augmented PSD matrix Px x ( f ) Px x ( f ) . (8.2) P x x ( f ) = ∗ Px x ( f ) Px x (− f ) This matrix is a valid augmented PSD matrix if and only if it is positive semidefinite, which is equivalent to, for all f , Px x ( f ) ≥ 0,

(8.3)

Px x ( f ) = Px x (− f ),

(8.4)

| Px x ( f )|2 ≤ Px x ( f )Px x (− f ).

(8.5)

Thus, for any pair of L 1 -functions (Px x ( f ), Px x ( f )) that satisfy these three conditions, there exists a WSS random process x(t) with PSD Px x ( f ) and C-PSD Px x ( f ). Conversely, the PSD Px x ( f ) and C-PSD Px x ( f ) of any given WSS random process x(t) satisfy the conditions (8.3)–(8.5). The same conditions hold for a discrete-time process x[k] = u[k] + jv[k]. The PSD and C-PSD of x(t) are connected to the PSDs of real and imaginary parts, Puu ( f ) and Pvv ( f ), and the cross-PSD between real and imaginary part, Puv ( f ), through the familiar real-to-complex transformation T=

1 1

j , −j

TTH = TH T = 2I,

(8.6)

as Px x ( f ) = T

Puu ( f ) ∗ (f) Puv

Puv ( f ) H T , Pvv ( f )

(8.7)

8.1 Spectral representation

199

from which we determine Px x ( f ) = Puu ( f ) + Pvv ( f ) + 2 Im Puv ( f ),

(8.8)

Px x (− f ) = Puu ( f ) + Pvv ( f ) − 2 Im Puv ( f ),

(8.9)

Px x ( f ) = Puu ( f ) − Pvv ( f ) + 2j Re Puv ( f ).

(8.10)

The PSD and C-PSD provide a statistical description of the spectral properties of x(t). But is there a spectral representation of the process x(t) itself? The following result, whose proper version is due to Cramér, is affirmative. Result 8.1. Provided that r x x (τ ) is continuous, a complex second-order random process x(t) can be written as ∞ ej2π f t dξ ( f ), (8.11) x(t) = −∞

where ξ ( f ) is a spectral process with orthogonal increments dξ ( f ) whose second-order moments are E|dξ ( f )|2 = Px x ( f )d f and E[dξ ( f )dξ (− f )] = Px x ( f )d f . This result bears comment. The expression (8.11) almost looks like an inverse Fourier transform. In fact, if the derivative process X ( f ) = dξ ( f )/d f existed, then X ( f ) would be the Fourier transform of x(t). We would then be able to replace dξ ( f ) in (8.11) with X ( f )d f , turning (8.11) into an actual inverse Fourier transform. However, X ( f ) can never be a second-order random process with finite second-order moment if x(t) is WSS. This is why we have to resort to the orthogonal increments dξ ( f ). This issue is further clarified in Section 9.2. The process ξ ( f ) has orthogonal increments, which means that E[dξ ( f 1 )dξ ∗ ( f 2 )] = 0 and E[dξ ( f 1 )dξ (− f 2 )] = 0 for f 1 = f 2 . For f 1 = f 2 = f , E|dξ ( f )|2 = Px x ( f )d f and E[dξ ( f )dξ (− f )] = Px x ( f )d f . The minus sign in the expression for Px x ( f ) is owed to the fact that the spectral process corresponding to x ∗ (t) is ξ ∗ (− f ): ∞ ∗ ∞ ej2π f t dξ ( f ) = ej2π f t dξ ∗ (− f ). (8.12) x ∗ (t) = −∞

−∞

The process x(t) can be decomposed into two orthogonal and WSS parts as x(t) = xc (t) + xd (t), where xc (t) belongs to a continuous spectral process ξc ( f ) and xd (t) to a purely discontinuous spectral process ξd ( f ). The spectral representation of xd (t) reduces to a countable sum ∞ ej2π f t dξd ( f ) = C fn ej2π fn t , (8.13) xd (t) = −∞

n

where { f n } is the set of discontinuities of ξ ( f ). The random variables C fn = lim [ξd ( f n + ε) − ξd ( f n − ε)] ε→0+

(8.14)

are orthogonal in the sense that E[C fn C ∗fm ] = 0 and E[C fn C− fm ] = 0 for n = m. At frequency f n , the process xd (t) has power E|C fn |2 and complementary power E[C fn C− fn ],

200

Wide-sense stationary processes

and thus unbounded power spectral density. The discontinuities { f n } correspond to periodic components of the correlation function r x x (τ ). The process xc (t) with continuous spectral process ξc ( f ) has zero power at all frequencies (but nonzero power spectral density, unless xc (t) ≡ 0). Example 8.1. Consider the ellipse x(t) = C+ ej2π f0 t + C− e−j2π f0 t ,

(8.15)

where C+ and C− are two complex random variables. The spectral process ξ ( f ) is purely discontinuous:   f < − f0 ,  0, (8.16) ξ ( f ) = C− , − f0 ≤ f < f0 ,   C + C , f ≤ f. −

+

0

In Example 1.5 we found that x(t) is WSS if and only if E[C+ C−∗ ] = 0, E C+2 = 0, and E C−2 = 0, in which case the covariance function is r x x (τ ) = E|C+ |2 ej2π f0 τ + E|C− |2 e−j2π f0 τ

(8.17)

and the complementary covariance function is r˜x x (τ ) = 2E[C+ C− ]cos(2π f 0 τ ).

(8.18)

The covariance and complementary covariance are periodic functions of τ . The power of x(t) at frequency f 0 is E|C+ |2 , and at − f 0 it is E|C− |2 . The complementary power at ± f 0 is E[C+ C− ]. The PSD and C-PSD are thus unbounded line spectra: Px x ( f ) = E |C+ |2 δ( f − f 0 ) + E|C− |2 δ( f + f 0 ),

(8.19)

Px x ( f ) = E[C+ C− ][δ( f − f 0 ) + δ( f + f 0 )].

(8.20)

Aliasing in the spectral representation of a discrete-time process x[k] will be discussed in the context of nonstationary processes in Section 9.2.

8.2

Filtering The output of a widely linear shift-invariant (WLSI) filter with impulse response (h 1 (t), h 2 (t)) and input x(t) is ∞ [h 1 (t − τ )x(τ ) + h 2 (t − τ )x ∗ (τ )]dτ. (8.21) y(t) = (h 1 ∗ x)(t) + (h 2 ∗ x ∗ )(t) = −∞

If x(t) and y(t) have spectral representations with orthogonal increments dξ ( f ) and dυ( f ), respectively, they are connected as dυ( f ) = H1 ( f )dξ ( f ) + H2 ( f )dξ ∗ (− f ).

(8.22)

8.2 Filtering

This may be written in augmented notation as H1 ( f ) dξ ( f ) H2 ( f ) dυ( f ) = . H2∗ (− f ) H1∗ (− f ) dξ ∗ (− f ) dυ ∗ (− f ) We call

H( f ) =

H1 ( f ) H2∗ (− f )

H2 ( f ) H1∗ (− f )

201

(8.23)

(8.24)

the augmented frequency-response matrix of the WLSI filter. The relationship between the augmented PSD matrices of x(t) and y(t) is P yy ( f ) = H( f )Px x ( f )HH ( f ).

(8.25)

In the following, we will take a look at three important (W)LSI filters: the filter constructing the analytic signal using the Hilbert transform, the noncausal Wiener filter, and the causal Wiener filter.

8.2.1

Analytic and complex baseband signals In Section 1.4, we discussed the role of analytic and equivalent complex baseband signals in communications. In this section, we will investigate properties of random analytic and ˆ denote the analytic signal constructed complex baseband signals. Let x(t) = u(t) + ju(t) ˆ from the real signal u(t) and its Hilbert transform u(t). As discussed in Section 1.4.2, a Hilbert transformer is a linear shift-invariant (LSI) filter with impulse response h(t) = 1/(πt) and frequency response H ( f ) = −j sgn( f ). Producing the analytic signal x(t) from the real signal u(t) is therefore an LSI operation with augmented frequencyresponse matrix 1 + jH ( f ) 0 1 + jH ( f ) 0 = , (8.26) H( f ) = 0 1 − jH ( f ) 0 1 − jH ∗ (− f ) where the second equality is due to H ( f ) = H ∗ (− f ). The augmented PSD matrix of the real input signal u(t) is Puu ( f ) Puu ( f ) , (8.27) Puu ( f ) = Puu ( f ) Puu ( f ) which is rank-deficient for all frequencies f . The augmented PSD matrix of the analytic signal x(t) is Px x ( f ) = H( f )Puu ( f )HH ( f ).

(8.28)

With the mild assumption that u(t) has no DC component, Puu (0) = 0, we obtain the PSD of x(t) Px x ( f ) = Puu ( f ) + H ( f )Puu ( f )H ∗ ( f ) + 2 Im {Puu ( f )H ∗ ( f )} = 2Puu ( f )(1 + sgn( f )) = 4( f )Puu ( f ),

(8.29)

202

Wide-sense stationary processes

where ( f ) is the unit-step function. The C-PSD is Px x ( f ) = Puu ( f ) − H ( f )Puu ( f )H ∗ ( f ) + 2j Re {Puu ( f )H ∗ ( f )} = 0.

(8.30)

So the analytic signal constructed from a WSS real signal has zero PSD for negative frequencies and four times the PSD of the real signal for positive frequencies, and it is proper. Propriety also follows from the bound (8.5) since | Px x ( f )|2 ≤ Px x ( f )Px x (− f ) = 0.

(8.31)

There is also a close interplay between wide-sense stationarity and propriety for equivalent complex baseband signals. As in Section 1.4, let p(t) = Re {x(t)ej2π f0 t } denote a real passband signal obtained by complex modulation of the complex baseband signal x(t) with bandwidth < f 0 . It is easy to see that a complex-modulated signal x(t)ej2π f0 t is proper if and only if x(t) is proper. If x(t)ej2π f0 t is the analytic signal obtained from p(t), we find the following. Result 8.2. A real passband signal p(t) is WSS if and only if the equivalent complex baseband signal x(t) is WSS and proper. An alternative way to prove this is to express the covariance function of p(t) as (8.32) r pp (t, τ ) = Re r x x (t, τ )ej2π f0 τ + Re r˜x x (t, τ )ej2π f0 (2t+τ ) . If r pp (t, τ ) is to be independent of t, we need the covariance function r x x (t, τ ) to be independent of t and the complementary covariance function r˜x x (t, τ ) ≡ 0. We note a subtle difference between analytic signals and complex baseband signals: while there are no improper WSS analytic signals, improper WSS complex baseband signals do exist. However, the real passband signal produced from an improper WSS complex baseband signal is cyclostationary rather than WSS. Result 8.2 is important for communications because passband thermal noise is modeled as WSS. Hence, its complex baseband representation is WSS and proper.

8.2.2

Noncausal Wiener filter ˆ of a message signal x(t) The celebrated Wiener filter produces a linear estimate x(t) from an observation (or measurement) y(t) based on the PSD of y(t), denoted Pyy ( f ), and the cross-PSD between x(t) and y(t), denoted Px y ( f ). This assumes that x(t) and ˆ is optimal in the sense that it minimizes the y(t) are jointly WSS. The estimate x(t) ˆ − x(t)|2 , which is independent of t because x(t) and y(t) are mean-squared error E|x(t) jointly WSS. The frequency response of the noncausal Wiener filter is H( f ) =

Px y ( f ) . Pyy ( f )

(8.33)

If y(t) = x(t) + n(t), with uncorrelated noise n(t) of PSD Pnn ( f ), then the frequency response is H( f ) =

Px x ( f ) . Px x ( f ) + Pnn ( f )

(8.34)

8.3 Causal Wiener filter

203

This has an intuitively appealing interpretation: the filter attenuates frequency components where the noise is strong compared with the signal. The extension of the linear Wiener filter to the widely linear Wiener filter for improper complex signals presents no particular difficulties. The augmented frequency-response matrix of the noncausal widely linear Wiener filter is H( f ) = Px y ( f )P−1 yy ( f ).

(8.35)

Using the matrix-inversion lemma (A1.42) to invert the 2 × 2 matrix P yy ( f ), we can derive explicit formulae for H1 ( f ) and H2 ( f ), similarly to the vector case in Result 5.3. Example 8.2. If y(t) = x(t) + n(t), where n(t) is proper, white noise, uncorrelated with x(t), with PSD Pnn ( f ) = N0 , we obtain H1 ( f ) = H2 ( f ) =

Px x ( f )(Px x ( f ) + N0 ) − | Px x ( f )|2 , (Px x ( f ) + N0 )2 − | Px x ( f )|2 Px x ( f )N0 (Px x ( f ) + N0 )2 − | Px x ( f )|2

.

It is easy to see that, for Px x ( f ) = 0, the frequency response simplifies to H1 ( f ) =

8.3

Px x ( f ) Px x ( f ) + N0

and

H2 ( f ) = 0.

Causal Wiener filter The Wiener filter derived above is not suitable for real-time applications because it is noncausal. In this section, we find the causal Wiener filter for an improper WSS vectorvalued time series. This requires the spectral factorization of an augmented PSD matrix. We will see that it is straightforward to apply existing spectral factorization algorithms to the improper case, following Spurbeck and Schreier (2007).

8.3.1

Spectral factorization So far, we have dealt with scalar continuous-time processes. Consider now the zeromean WSS vector-valued discrete-time process x[k] = u[k] + jv[k] with matrix-valued covariance function Rx x [κ] = Ex[k + κ]xH [k]

(8.36)

and matrix-valued complementary covariance function x x [κ] = Ex[k + κ]xT [k]. R

(8.37)

204

Wide-sense stationary processes

The augmented matrix covariance function is R x x [κ] = Ex[k + κ]xH [k]

(8.38)

for x[k] = [xT [k], xH [k]]T . It is positive semidefinite for κ = 0 and Rx x [κ] = RHx x [−κ]. The augmented PSD matrix Px x (θ ) is the discrete-time Fourier transform (DTFT) of Rx x [κ]. However, in order to spectrally factor Px x (θ ), we need to work with the ztransform x x (z) Px x (z) P (8.39) Px x (z) = ∗ ∗ Px x (z ) P∗x x (z ∗ ) instead. It satisfies the symmetry property Px x (z) = PHx x (z −∗ ), and thus Px x (z) = x x (z) = P T (z −1 ). We indulge in a minor abuse of notation by writPHx x (z −∗ ) and P xx ing Px x (θ ) = Px x (z)|z=ejθ . The augmented PSD matrix Px x (θ ) is positive semidefinite and Hermitian: Px x (θ ) = PHx x (θ ). We wish to factor Px x (z) as Px x (z) = Ax x (z)AHx x (z −∗ ),

(8.40)

where Ax x (z) is minimum phase, meaning that both Ax x (z) and A−1 x x (z) are causal and 1 stable. This means that both the coloring filter Ax x (z) and the whitening filter A−1 x x (z) have all their zeros and poles inside the unit circle. Moreover, since Ax x (z) must represent a WLSI filter, it needs to satisfy the pattern x x (z) Ax x (z) A . (8.41) Ax x (z) = ∗ ∗ Ax x (z ) A∗x x (z ∗ ) We first determine Pzz (z) of the real vector time series z[k] = [uT [k], vT [k]]T as Pzz (z) = 14 TH Px x (z)T. We will assume that Px x (z) and thus also Pzz (z) are rational, which means that every element can be written as [Pzz ]i j (z) =

bi j (z) , ai j (z)

(8.42)

where the polynomials bi j (z) and ai j (z) are coprime with real coefficients. We further require that no ai j (z) have roots on the unit circle. Now we factor Pzz (z) into a matrix of polynomials ␤(z) divided by a single denominator polynomial α(z): Pzz (z) =

1 ␤(z). α(z)

(8.43)

This factorization is achieved by finding the polynomial α(z) that is the least common multiple of all the denominator polynomials {ai j (z)}. The polynomial α(z) may be iteratively computed using Euclid’s algorithm for finding the greatest common divisor of two polynomials (see, e.g., Blahut (1985)). The elements of ␤(z) are then βi j (z) = bi j (z)

α(z) . ai j (z)

(8.44)

8.4 Rotary-component and polarization analysis

205

Now we factor the polynomial α(z) to obtain a minimum-phase polynomial α0 (z) such that α(z) = α0 (z)α0 (z −1 ). Next, we can utilize an efficient algorithm such as the one provided by Jezek and Kucera (1985) or Spurbeck and Mullis (1998) to factor the polynomial matrix ␤(z) = ␤0 (z)␤0T (z −1 ) with ␤0 (z) minimum phase. Note that α0 (z −1 ) = α0∗ (z −∗ ) and ␤0T (z −1 ) = ␤0H (z −∗ ) because the polynomials in α(z) and ␤(z) have real coefficients. Then, ␤0 (z)/α0 (z) is a rational minimum-phase coloring filter for the real time series z[k]. Thus, Ax x (z) = √

1 2α0 (z)

T␤0 (z)TH

(8.45)

is a rational minimum-phase coloring filter for the complex time series x[k]. It is easy to verify that (8.40) holds for Ax x (z) given by (8.45).

8.3.2

Causal synthesis, analysis, and Wiener filters We can now causally synthesize a WSS vector-valued time series x[k] with desired PSD x x (θ ) from a complex white vector sequence w[k]. To matrix Px x (θ ) and C-PSD matrix P do this, we factor the augmented PSD matrix as Px x (z) = Ax x (z)AHx x (z −∗ ) as described above. The z-transform of the synthesized vector time series x[k] ←→ X(z) is then x x (z)W∗ (z ∗ ), X(z) = Ax x (z)W(z) + A

(8.46)

where the input w[k] ←→ W(z) is a complex white and proper vector time series with ww (θ ) = 0. Pww (θ ) = I and P We can also causally whiten (analyze) a WSS vector time series x[k]. In augmented notation, the z-transform of the filter output is W(z) = A−1 x x (z)X(z).

(8.47)

The augmented transfer function of the noncausal Wiener filter that provides the widely linear MMSE estimate xˆ [k] of x[k] from measurements y[k] is Hnc (z) = Px y (z)P−1 yy (z).

(8.48)

With the factorization P yy (z) = A yy (z)AHyy (z −∗ ), the augmented transfer function of the causal Wiener filter is then −1 −∗ (8.49) Hc (z) = Px y (z)A−H yy (z ) + A yy (z), where [·]+ denotes the causal part. The z-transform of the estimate is given in augmented & = Hc (z)Y(z). notation by X(z)

8.4

Rotary-component and polarization analysis Rotary-component and polarization analysis are widely used in many research areas, including optics, geophysics, meteorology, oceanography, and radar. 2 The idea is to represent a two-dimensional signal in the complex plane as an integral of ellipses, and

206

Wide-sense stationary processes

v

b

a y

2Av

u

2Au Figure 8.1 The ellipse traced out by the monochromatic and deterministic signal x(t).

each ellipse as the sum of a counterclockwise and a clockwise rotating phasor, which are called the rotary components. From a mathematical point of view, each ellipse is the widely linear minimum mean-squared error approximation of the signal from its rotary components at a given frequency f . From a practical point of view, the analysis of ellipse properties (such as its shape and orientation) and properties of the rotary components (such as their coherence) carries information about the underlying physics.

8.4.1

Rotary components In order to illustrate the fundamental ideas of rotary-component analysis, we begin with a monochromatic and deterministic bivariate signal. 3 The most general such signal is described as u(t) = Au cos(2π f 0 t + θu ), v(t) = Av cos(2π f 0 t + θv ),

(8.50)

where Au and Av are two given nonnegative amplitudes and θu and θv are two given phase offsets. As shown in Fig. 8.1, this signal moves periodically around an ellipse, which is inscribed in a rectangle whose sides are parallel to the u- and v-axes and have lengths 2Au and 2Av . The description (8.50) is said to be a decomposition of this ellipse into its linearly polarized components u(t) and v(t). A complex representation equivalent to (8.50) is x(t) = u(t) + jv(t) = A+ ejθ+ ej2π f0 t + A− e−jθ− e−j2π f0 t .

x+ (t) x− (t)

(8.51)

This is the sum of the counterclockwise (CCW) turning phasor x+ (t) and the clockwise (CW) turning phasor x− (t), which are called the rotary components or circularly polarized components. The ellipse itself is called the polarization ellipse.

8.4 Rotary-component and polarization analysis

207

The real description (8.50) and complex description (8.51) can be related through their PSD matrices. Restricted to nonnegative frequencies f , the PSD matrix of the vector z(t) = [u(t), v(t)]T is Au Av ejθ A2u Pzz ( f ) = 14 δ( f − f 0 ), (8.52) Au Av e−jθ A2v with θ = θu − θv . Since Pzz ( f ) = P∗zz (− f ), the PSDs of positive and negative frequencies are trivially related and the restriction to nonnegative frequencies f presents no loss of information. The augmented PSD matrix for the complex description is A+ A− ej2ψ A2+ Px x ( f ) = δ( f − f 0 ), (8.53) A+ A− e−j2ψ A2− with 2ψ = θ+ − θ− . The PSD matrices Pzz ( f ) and Px x ( f ) are connected through the real-to-complex transformation T as Px x ( f ) = TPzz ( f )TH ,

(8.54)

from which we obtain the following expressions: A2+ + A2− =

A2u + A2v , 2

(8.55)

A2+ − A2− = Au Av sin θ, tan(2ψ) =

(8.56)

2Au Av cos θ Im{A+ A− ej2ψ } = . 2 2 Au − Av Re{A+ A− ej2ψ }

(8.57)

Thus, the ellipse is parameterized by either ( Au , Av , θ ) or ( A+ , A− , ψ). If the reader suspects that the latter will turn out to be more useful, then he or she is right. We have already determined in Section 1.3 that ψ = (θ+ − θ− )/2 is the angle between the major axis and the u-axis, simply referred to as the orientation of the ellipse. Moreover, we found that 2a = 2(A+ + A− ) is the length of the major axis and 2b = 2|A+ − A− | the length of the minor axis. From this, we obtain the area of the ellipse abπ = ( A+ + A− )| A+ − A− |π = | A2+ − A2− |π. The numerical eccentricity

√ √ a 2 − b2 2 A+ A− = a A+ + A−

(8.58)

(8.59)

is seen to be the ratio of geometric mean to arithmetic mean of A+ and A− . In optics, it is common to introduce another angle χ through sin(2χ ) = ±

A2+ − A2− 2ab = , a 2 + b2 A2+ + A2−

−

π π ≤χ ≤ , 4 4

(8.60)

where the expression in the middle is the ratio of geometric mean to arithmetic mean of a 2 and b2 , and its sign indicates the rotation direction of the ellipse: CCW for “+” and CW for “−.” Because b A+ − A− tan χ = ± = (8.61) a A+ + A− we may say that χ characterizes the scale-invariant shape of the ellipse.

208

Wide-sense stationary processes

c=0

c=0

c = −π/4

c = π/4

Figure 8.2 Linear and circular shape-polarization.

r For −π/4 ≤ χ < 0, x(t) traces out the ellipse in the clockwise direction, and x(t) is called CW polarized or right(-handed) polarized. If χ = −π/4, then the ellipse degenerates into a circle because A+ = 0, and x(t) is called CW circularly polarized or right-circularly polarized. r For 0 < χ ≤ π/4, x(t) traces out the ellipse in the counterclockwise direction, and x(t) is called CCW polarized or left(-handed) polarized. If χ = π/4, the ellipse becomes a circle because A− = 0, and x(t) is CCW circularly polarized or left-circularly polarized. r For χ = 0, the ellipse degenerates into a line because A = A , and x(t) is called + − linearly polarized. The angle of the line is the ellipse orientation ψ. Linear and circular polarization is illustrated in Fig. 8.2. Unfortunately, the term polarization has two quite different meanings in the literature. The type of polarization discussed here is “shape polarization,” which characterizes the shape (elliptical, circular, or linear) and rotation direction of the ellipse. There is another type of polarization, to be discussed in Section 8.4.3, which is not related to shape polarization.

8.4.2

Rotary components of random signals Now we will generalize the discussion of the previous subsection to a WSS random signal x(t). Using the spectral representation from Result 8.1, we write x(t) as ∞ x(t) = dξ ( f )ej2π f t + dξ (− f )e−j2π f t . (8.62) 0

This creates the minor technical issue that dξ (0) is counted twice in the integral, which can be remedied by scaling dξ (0) by a factor of 1/2. Throughout this section, we will let f denote a nonnegative frequency. The representation (8.62) is the superposition of

209

8.4 Rotary-component and polarization analysis

ellipses, and one ellipse ε f (t) = dξ ( f )ej2π f t + dξ (− f )e−j2π f t

ε f+ (t) ε f− (t)

(8.63)

can be constructed for a given frequency f ≥ 0. We emphasize that the ellipse ε f (t) and the rotary components ε f+ (t) and ε f− (t) are WSS random processes. The rotary components are each individually proper, but the ellipse ε f (t) is generally improper.

Interpretation of the random ellipse There is a powerful interpretation of the ellipse ε f (t). Assume that we wanted to linearly estimate the signal x(t) from the rotary components at a given frequency f , i.e., from ε f+ (t) = dξ ( f )ej2π f t and ε f− (t) = dξ (− f )e−j2π f t . Because dξ ( f ) and dξ ∗ (− f ) are generally correlated, meaning that Px x ( f )d f = E[dξ ( f )dξ (− f )] may be nonzero, we have the foresight to use a widely linear estimator of the form xˆ f (t) = [H1 ( f )dξ ( f ) + H2 ( f )dξ ∗ (− f )]ej2π f t + H1 (− f )dξ (− f ) + H2 (− f )dξ ∗ ( f )]e−j2π f t .

(8.64)

So, for a fixed frequency f , we need to determine H1 ( f ), H1 (− f ), H2 ( f ), and H2 (− f ) such that E|xˆ f (t) − x(t)|2 (which is independent of t since x(t) is WSS) is minimized. To solve this problem, we first note that the vector [dξ ( f ), dξ ∗ (− f )] is uncorrelated with the vector [dξ (− f ), dξ ∗ ( f )] because x(t) is WSS. This allows us to determine the pair (H1 ( f ), H2 ( f )) independently from the pair (H1 (− f ), H2 (− f )). In other words, the WLMMSE estimate xˆ f (t) is the sum of r the LMMSE estimate of x(t) from ε (t) = dξ ( f )ej2π f t and ε∗ (t) = dξ ∗ (− f )ej2π f t , f+ f− and r the LMMSE estimate of x(t) from ε∗ (t) = dξ ∗ ( f )e−j2π f t and ε (t) = f− f+ dξ (− f )e−j2π f t . The first is E(x(t)dξ ∗ ( f )e−j2π f t ) E(x(t)dξ (− f )e−j2π f t ) −1 ε f+ (t) E(dξ ( f )dξ (− f )) E|dξ ( f )|2 × ε∗f− (t) E|dξ (− f )|2 E(dξ ∗ ( f )dξ ∗ (− f )) −1 # $ P ( f )d f x x ( f )d f ε f+ (t) P x x = Px x ( f )d f Px x ( f )d f ε∗f− (t) Px∗x ( f )d f Px x (− f )d f =

1 Px x ( f )Px x (− f ) − | Px x ( f )|2

(8.65) T

Px x ( f )Px x (− f ) − | Px x ( f )|2 −Px x ( f ) Px x ( f ) + Px x ( f )Px x ( f )

ε f+ (t) ε∗f− (t) (8.66)

= ε f+ (t).

(8.67)

210

Wide-sense stationary processes

In (8.65) we have used ∗

−j2π f t

E(x(t)dξ ( f )e

)=E

∞

j2πνt

dξ (ν)e

∗

−j2π f t

dξ ( f )e

−∞

= E|dξ ( f )|2 = Px x ( f )d f, ∞ −j2π f t j2πνt −j2π f t E(x(t)dξ (− f )e dξ (− f )e )=E dξ (ν)e

(8.68)

−∞

= E(dξ ( f )dξ (− f )) = Px x ( f )d f,

(8.69)

which hold because ξ ( f ) has orthogonal increments, i.e., E[dξ (ν)dξ ∗ ( f )] = 0 and E[dξ (ν)dξ (− f )] = 0 for ν = f . A completely analogous computation to (8.66)–(8.67) shows that the LMMSE estimate of x(t) from ε∗f+ (t) and ε f− (t) is simply ε f− (t). All of this taken together means that xˆ f (t) = ε f (t) = ε f+ (t) + ε f− (t),

(8.70)

i.e., H1 ( f ) = H1 (− f ) = 1 and H2 ( f ) = H2 (− f ) = 0. This says that the ellipse ε f (t) in (8.63) is actually the best widely linear estimate of x(t) that may be constructed from the rotary components at a given frequency f . Moreover, the widely linear estimate turns out to be strictly linear. That is, even if the rotary components ε f+ (t) and ε f− (t) have nonzero complementary correlation, it cannot be exploited to build a widely linear estimator with smaller mean-squared error than a strictly linear estimator.

Statistical properties of the random ellipse The second-order statistical properties of the WSS random ellipse ε f (t) can be approximately determined from the augmented PSD matrix Px x ( f ). From (8.58), we approximate the expected area of the random ellipse as π |Px x ( f ) − Px x (− f )|d f.

(8.71)

This is a crude approximation because it involves exchanging the order of the expectation operator and the absolute value. The expected rotation direction of the ellipse is given by the sign of Px x ( f ) − Px x (− f ), where “+” indicates CCW and “−” CW direction. The angle ψ( f ), obtained as half the phase of the C-PSD through tan(2ψ( f )) =

Im Px x ( f ) , Re Px x ( f )

(8.72)

gives an approximation of the expected ellipse orientation. Similarly, χ ( f ) obtained through sin(2χ ( f )) =

Px x ( f ) − Px x (− f ) Px x ( f ) + Px x (− f )

(8.73)

approximates the expected ellipse shape. The approximations (8.72) and (8.73) are derived from (8.57) and (8.60), respectively, by applying the expectation operator to the numerator and denominator separately, thus ignoring the fact that these are generally

8.4 Rotary-component and polarization analysis

211

correlated, and exchanging the order of the expectation operator and the sin or tan function. However, it was shown by Rubin-Delanchy and Walden (2008) that, in the Gaussian case, ψ( f ) is the mean ellipse orientation – not just an approximation. The properties of the ellipse all depend on the PSD and C-PSD, which we have defined as ensemble averages. In most practical applications, only one realization of a random process is available. Assuming that the random process is ergodic (which is often a reasonable assumption), all ensemble averages can be replaced by time averages.

8.4.3

Polarization and coherence Definition 8.1. Let x(t) be a WSS complex random process x(t) with augmented PSD matrix Px x ( f ), and let 1 ( f ) and 2 ( f ) denote the two real eigenvalues of Px x ( f ), assuming 1 ( f ) ≥ 2 ( f ). The degree of polarization of x(t) at frequency f is defined as ( f ) =

1 ( f ) − 2 ( f ) , 1 ( f ) + 2 ( f )

(8.74)

which satisfies 0 ≤ ( f ) ≤ 1. If ( f ) = 0, then x(t) is called unpolarized at frequency f . If ( f ) = 1, then x(t) is called completely polarized at frequency f . We need to emphasize again that there are two meanings of the term polarization: one is this definition and the other is “shape polarization” as discussed in Section 8.4.1. A signal that is polarized in the sense of Definition 8.1 does not have to be polarized in the sense that the shape of the ellipse degenerates into a line or circle, which is called linear or circular polarization. Similarly, a linearly or circularly polarized signal does not have to be polarized in the sense of Definition 8.1 either. The degree of polarization has an intricate relationship with and is in fact sometimes confused with the magnitude-squared coherence between the CCW and CW rotary components. “Coherence” is a synonymous term for correlation coefficient, but, in the frequency domain, “coherence” is much more commonly used than “correlation coefficient.” In the terminology of Chapter 4, the coherence defined in the following would be called a reflectional correlation coefficient. Definition 8.2. The complex coherence between CCW and CW rotary components is ρx x ( f ) =

E[dξ ( f )dξ (− f )] E|dξ ( f )|2 E |dξ (− f )|2

=√

Px x ( f ) . Px x ( f )Px x (− f )

(8.75)

If either Px x ( f ) = 0 or Px x (− f ) = 0, then we define ρx x ( f ) = 1. We usually consider only the magnitude-squared coherence |ρx x ( f )|2 , which satisfies |ρx x ( f )|2 ≤ 1. If and only if x(t) is completely polarized at f , then 2 ( f ) = 0, | Px x ( f )|2 = Px x ( f )Px x (− f ), and thus det Px x ( f ) = 0. We thus find the following connection between the degree of polarization and the magnitude-squared coherence. Result 8.3. A WSS complex random process x(t) is completely polarized at frequency f , i.e., ( f ) = 1, if and only if ρx x ( f ) = 1. The corresponding ellipse ε f (t) has complete

212

Wide-sense stationary processes

coherence between its CCW and CW rotary components. Provided that Px x (− f ) = 0, we have dξ ( f ) = c f dξ ∗ (− f ) with probability 1, where the complex constant c f is Px x ( f ) Px x ( f ) . = Px x (− f ) Px∗x ( f )

cf =

(8.76)

If x(t) is completely polarized at f , all sample functions of the random ellipse ε f (t) turn in the same direction, which is either clockwise – if −π/4 ≤ χ ( f ) < 0, |c f | < 1 with c f given by (8.76) – or counterclockwise – if 0 < χ ( f ) ≤ π/4, |c f | > 1. Monochromatic deterministic signals are always completely polarized, since their augmented PSD matrix is always rank-deficient. Moreover, all analytic signals are completely polarized for all f , as is evident from det Px x ( f ) ≡ 0 because Px x ( f )Px x (− f ) ≡ 0. However, ρx x ( f ) = 0, i.e., Px x ( f ) = 0, is only a necessary but not sufficient condition for a signal to be unpolarized at f , which requires 1 ( f ) = 2 ( f ). In fact, for ρx x ( f ) = 0 the degree of polarization takes on a particularly simple form: % % % Px x ( f ) − Px x (− f ) % % % ( f ) = % Px x ( f ) + Px x (− f ) % = |sin(2χ ( f ))| .

(8.77)

The additional condition of equal power in the rotary components, Px x ( f ) = Px x (− f ), for a signal to be unpolarized at f can thus be visualized as a requirement that the random ellipse ε f (t) have no preferred rotation direction. We can find an expression for the degree of polarization ( f ) by using the formula (A1.13) for the two eigenvalues of the 2 × 2 augmented PSD matrix: 1,2 ( f ) =

1 2

tr Px x ( f ) ±

1 2

;

tr 2 Px x ( f ) − 4 det Px x ( f ).

(8.78)

By inserting this expression into (8.74), we then obtain < ( f ) =

1−

4 det Px x ( f ) , tr 2 Px x ( f )

(8.79)

without the need for an explicit eigenvalue decomposition of Px x ( f ). For a given frequency f , it is possible to decompose any WSS signal x(t) into a WSS completely polarized signal p(t) and a WSS unpolarized signal n(t) (but this decomposition is generally different for different frequencies). 4 This is achieved by expanding the eigenvalue matrix as Λx x ( f ) = Λ pp ( f ) + Λnn ( f ) =

1 ( f ) − 2 ( f ) 0

0 2 ( f ) 0 . (8.80) + 0 0 2 ( f )

8.4 Rotary-component and polarization analysis

213

The degree of polarization ( f ) at frequency f is thus the ratio of the polarized power to the total power: Ppol ( f ) ( f ) = . (8.81) Ptot ( f ) The degree of polarization and the magnitude-squared coherence have one more important property. Result 8.4. The degree of polarization ( f ) and the magnitude-squared coherence |ρx x ( f )|2 are both invariant under coordinate rotation, i.e., x(t) and x(t)ejφ have the same ( f ) and |ρx x ( f )|2 for a fixed real angle φ. This is easy to see. If x(t) is replaced with x(t)ejφ , the PSD and C-PSD transform as Px x ( f ) −→ Px x ( f ), Px x ( f ) −→ Px x ( f )ej2φ . That is, Px x ( f ) and | Px x ( f )| are invariant under coordinate rotation. It follows that the ellipse shape χ ( f ), the degree of polarization ( f ), and the magnitude-squared coherence |ρx x ( f )|2 are all invariant under coordinate rotation. On the other hand, the ellipse orientation ψ( f ) is covariant under coordinate rotation, i.e., it transforms as ψ( f ) −→ ejφ ψ( f ). The invariance property of |ρx x ( f )|2 is another key advantage of the rotary component versus the Cartesian description. It is obviously also possible to define the Cartesian coherence Puv ( f ) (8.82) ρuv ( f ) = √ Puu ( f )Pvv ( f ) as a normalized cross-spectrum between the u- and v-components. However, the Cartesian magnitude-squared coherence |ρuv ( f )|2 does depend on the orientation of the uand v-axes, thus limiting its usefulness.

8.4.4

Stokes and Jones vectors George G. Stokes introduced a set of four parameters to characterize the state of polarization of partially polarized light. These parameters are closely related to the PSD and C-PSD. Definition 8.3. The Stokes vector Σ( f ) = [0 ( f ), 1 ( f ), 2 ( f ), 3 ( f )]T is defined as 0 ( f ) = Ev Px x ( f ) = 12 [Px x ( f ) + Px x (− f )], 1 ( f ) = Re Px x ( f ), 2 ( f ) = Im Px x ( f ), 3 ( f ) = Od Px x ( f ) = 12 [Px x ( f ) − Px x (− f )], where Ev Px x ( f ) and Od Px x ( f ) denote the even and odd parts of Px x ( f ).

214

Wide-sense stationary processes

The four real-valued parameters (0 ( f ), 1 ( f ), 2 ( f ), 3 ( f )) are an equivalent parameterization of (Px x ( f ), Px x (− f ), Px x ( f )), where Px x (± f ) is real and Px x ( f ) complex. Hence, the polarization-ellipse properties may be determined from Σ( f ). The polarized power is ; ; Ppol ( f ) = 12 ( f ) + 22 ( f ) + 32 ( f ) = | Px x ( f )|2 + Od 2 Px x ( f ), (8.83) and the total power is Ptot ( f ) = 0 ( f ) = Ev Px x ( f ). The degree of polarization (8.79) can thus be expressed as ; | Px x ( f )|2 + Od 2 Px x ( f ) . ( f ) = Ev Px x ( f )

(8.84)

(8.85)

An unpolarized signal has 1 ( f ) = 2 ( f ) = 3 ( f ) = 0, and a completely polarized signal has 12 ( f ) + 22 ( f ) + 32 ( f ) = 02 ( f ). Thus, three real-valued parameters suffice to characterize a completely polarized signal. Which three parameters to choose is, of course, somewhat arbitrary. The most common way uses (Puu ( f ), Pvv ( f ), arg Puv ( f )), which are combined in the Jones vector Puu ( f ) j1 ( f ) = . (8.86) j( f ) = j2 ( f ) Pvv ( f )ej arg Puv ( f ) We note that only the angle of the cross-PSD Puv ( f ) between the real part u(t) and the imaginary part v(t) is specified. The magnitude |Puv ( f )| is determined by the assumption that the signal is completely polarized, which means that the determinant of the PSD matrix Pzz ( f ) is zero: P ( f ) Puv ( f ) = 0 ⇐⇒ |Puv ( f )|2 = Puu ( f )Pvv ( f ). (8.87) det Pzz ( f ) = det uu ∗ ( f ) Pvv ( f ) Puv The reason why the Jones vector uses the Cartesian coordinate system rather than the rotary components is simply historical. It would be just as easy (and maybe more insightful) to instead use the three parameters (Px x ( f ), Px x (− f ), arg Px x ( f )), and fix the magnitude of the C-PSD as | Px x ( f )|2 = Px x ( f )Px x (− f ). On the other hand, we may also express the Stokes vector in the Cartesian coordinate system as 0 ( f ) = Puu ( f ) + Pvv ( f ), 1 ( f ) = Puu ( f ) − Pvv ( f ), 2 ( f ) = 2 Re Puv ( f ), and 3 ( f ) = 2 Im Puv ( f ). This allows us to determine equivalent Jones and Stokes vectors for fully polarized signals. Remember that the Jones vectors cannot describe partially polarized or unpolarized signals. The effect of optical elements on Stokes vectors can be described by multiplication from the left with a real-valued 4 × 4 matrix. This is called the Mueller calculus. For completely polarized light, there exists an analog calculus for manipulation of Jones vectors, which uses complex-valued 2 × 2 matrices. This is called the Jones calculus. 5

8.4 Rotary-component and polarization analysis

215

Example 8.3. The following table lists some states of polarization and corresponding Jones and Stokes vectors: Polarization

Jones vector

Stokes vector

Horizontal linear Vertical linear Linear at 45◦ Right-hand (CW) circular Left-hand (CCW) circular

[1, 0] [0, 1] [ 12 , 12 ] [ 12 , −j/2] [ 12 , j/2]

[1, 1, 0, 0] [1, −1, 0, 0] [1, 0, 1, 0] [1, 0, 0, −1] [1, 0, 0, 1]

The effects of two exemplary optical elements on Jones and Stokes vectors are described by pre-multiplying the vectors with the following matrices: Optical element

Jones matrix

0 0

Left-circular (CCW) polarizer

1 2



1 0

Horizontal linear polarizer

Mueller matrix

1 j

1 2

−j 1

1 1 0 0

0 0 0 0

0 0 0 0

1 0 0 1

0 0 0 0

0 0 0 0

1 0 0 1



1 2



1 1 0 0



In summary, we now have the following alternative characterizations of the statistical properties of the ellipse ε f (t): r the PSD and C-PSD: (P ( f ), P (− f ), P ( f )) xx xx xx r the PSDs and cross-PSD of the Cartesian coordinates: (P ( f ), P ( f ), P ( f )) uu vv uv r the PSD, the ellipse orientation (i.e., half the phase of the C-PSD), and the magnitudesquared coherence: (Px x ( f ), Px x (− f ), ψ( f ), |ρx x ( f )|2 ) r the Stokes vector Σ( f ) r the Jones vector j( f ), which assumes a completely polarized signal: |ρ ( f )|2 = 1 xx All of these descriptions are equivalent because each of them can be derived from any other description.

8.4.5

Joint analysis of two signals Now consider the joint analysis of two complex jointly WSS signals x(t) and y(t). Individually, they are described by the augmented PSD matrices Px x ( f ) and P yy ( f ), or any of the equivalent characterizations discussed at the end of the previous subsection.

216

Wide-sense stationary processes

Their joint properties are described by the augmented cross-PSD matrix Px y ( f ) Px y ( f ) P x y ( f ) = ∗ , Px y (− f ) Px∗y (− f )

(8.88)

which is the Fourier transform of

r x y (τ ) r˜x y (τ ) . R x y (τ ) = E [x(t + τ )y (t)] = ∗ r˜x y (τ ) r x∗y (τ ) H

(8.89)

Denoting the spectral process corresponding to y(t) by υ( f ), we can express the crossPSD and cross-C-PSD as Px y ( f )d f = E[dξ ( f )dυ ∗ ( f )],

(8.90)

Px y ( f )d f = E[dξ ( f )dυ(− f )].

(8.91)

Instead of using the four complex-valued cross-(C-)PSDs Px y ( f ), Px y (− f ), Px y ( f ), and Px y (− f ) directly, we can also define the magnitude-squared coherences |ρx y (± f )|2 =

|Px y (± f )|2 , Px x (± f )Pyy (± f )

(8.92)

|ρx y (± f )|2 =

| Px y (± f )|2 , Px x (± f )Pyy (∓ f )

(8.93)

which are rotational and reflectional correlation coefficients, respectively. Let ψx y (± f ) x y (± f ) the phase of Px y (± f ). Then the joint analysis denote the phase of Px y (± f ), and ψ of x(t) and y(t) is performed using the quantities r (P (± f ), |ρ ( f )|2 , ψ ( f )) to describe the individual properties of x(t), xx xx xx r (P (± f ), |ρ ( f )|2 , ψ ( f )) to describe the individual properties of y(t), and yy yy yy r (|ρ (± f )|2 , |ρ (± f )|2 , ψ (± f ), ψ x y (± f )) to describe the joint properties of x(t) xy xy xy and y(t). It is also possible to define a total coherence between x(t) and y(t) as a correlation coefficient between the vectors [dξ ( f ), dξ ∗ (− f )]T and [dυ( f ), dυ ∗ (− f )]T . This would proceed along the lines discussed in Chapter 4.

8.5

Higher-order spectra So far in this chapter, we have considered only second-order properties. We now look at higher-order moments of a complex N th-order stationary signal x(t) that has moments defined and bounded up to N th order. For n ≤ N , it is possible to define 2n different nthorder moment functions, depending on where complex conjugate operators are placed: n−1 " n i x (t + τi ) . (8.94) m x,♦ (␶ ) = E x (t) i=1

8.5 Higher-order spectra

217

In this equation, ␶ = [τ1 , . . ., τn−1 ]T is a vector of time lags, and ♦ = [n , 1 , 2 , . . ., n−1 ] has elements i that are either 1 or the conjugating star ∗. For example, if ♦ = [1, 1, ∗], then m x,♦ (␶ ) = m x x x ∗ (τ1 , τ2 ) = E[x(t)x(t + τ1 )x ∗ (t + τ2 )]. The number of stars in ♦ will be denoted by q and the number of 1s by n − q. If x(t) is N th-order stationary, then all moments up to order N are functions of ␶ only. It is possible that some moments do not depend on t, whereas others do. Such a process is not N th-order stationary. For n = 2, m x,♦ (␶ ) is a correlation function (Hermitian or complementary), for n = 3 it is a bicorrelation function, and for n = 4 it is a tricorrelation function. Many of the 2n moments obtained for the 2n possible conjugation patterns are redundant. First of all, moments with the same number of stars are equivalent because they are related to each other through a simple coordinate transformation. For example, m x x x ∗ (τ1 , τ2 ) = m x ∗ x x (τ1 − τ2 , −τ2 ).

(8.95)

Secondly, moments with q stars are related to moments with n − q stars through complex conjugation and possibly an additional coordinate transformation. For instance, m x x x ∗ (τ1 , τ2 ) = m ∗x ∗ x x ∗ (τ2 , τ1 ).

(8.96)

We call two moments equivalent if they can be expressed in terms of each other, and distinct if they cannot be expressed in terms of each other. There are n/2 + 1 distinct nth-order moment functions, where n/2 denotes the greatest integer less than or equal to n/2. In general, all distinct nth-order moment functions are required for an nth-order description of x(t). Higher-order statistical analysis often uses cumulants rather than moments. For simplicity, we consider only moments here. However, the extension to cumulants is straightforward, using the connection between cumulants and moments given in Section 2.5.

8.5.1

Moment spectra and principal domains We assume that the spectral process ξ ( f ) corresponding to x(t) is an N th-order random function with moments defined and bounded up to N th order. In the following, let i f i denote − f i if i is the conjugating star ∗. The (n − 1)-dimensional Fourier transform of m x,♦ (␶ ) is the nth-order moment spectrum Mx,♦ (f), with f = [ f 1 , f 2 , . . ., f n−1 ]T : T Mx,♦ (f) = m x,♦ (␶ )e−j2πf ␶ dn−1 ␶ . (8.97) IRn−1

This can be expressed in terms of the increments of the spectral process ξ ( f ) as n−1 " n−1 n T i dξ (i f i ) , (8.98) Mx,♦ (f)d f = E dξ (− n f 1) i=1

with 1 = [1, 1, . . ., 1]T and f T 1 = f 1 + f 2 + · · · + f n−1 . The relationship (8.98) will be further illuminated in Section 9.5. For n = 2, Mx,♦ (f) is a power spectrum (the PSD or C-PSD), for n = 3 it is a bispectrum, and for n = 4 it is a trispectrum. Which ones of

218

Wide-sense stationary processes

equivalent spectra to use is mainly a matter of taste, but each spectrum comes with its own symmetry properties. It is sufficient to compute a moment spectrum over a smaller region, called the principal domain, because the complete spectrum can be reconstructed from its symmetry properties. For complex signals, each moment spectrum with a different conjugation pattern has its own symmetries and therefore its own principal domain. Example 8.4. We consider the bispectrum of a lowpass signal x(t) that is limited to the frequency band [− f max , f max ]. The two distinct third-order correlations we choose are m x x x (τ1 , τ2 ) and m x ∗ x x (τ1 , τ2 ). The corresponding frequency-domain expressions are Mx x x ( f 1 , f 2 )d f 1 d f 2 = E[dξ ( f 1 )dξ ( f 2 )dξ (− f 1 − f 2 )], Mx ∗ x x ( f 1 , f 2 )d f 1 d f 2 = E[dξ ( f 1 )dξ ( f 2 )dξ ∗ ( f 1 + f 2 )]. We first look at Mx x x ( f 1 , f 2 ). The support of Mx x x ( f 1 , f 2 ) is the overlap of the support of dξ ( f 1 )dξ ( f 2 ), the square in Fig. 8.3(a), with the support of dξ (− f 1 − f 2 ), the northwest to southeast corridor between the two solid 45◦ lines. The lines of symmetry are f 2 = f 1 , f 2 = −2 f 1 , and f 2 = − 12 f 1 . Thus, the principal domain of Mx x x ( f 1 , f 2 ) is the gray region in Fig. 8.3(a). Now consider Mx ∗ x x ( f 1 , f 2 ). This bispectrum has only one line of symmetry, which is f 2 = f 1 . Thus, its principal domain is the gray polygon in Fig. 8.3(b). If x(t) is real, the increments of its spectral process satisfy the Hermitian symmetry dξ ( f ) = dξ ∗ (− f ). This means that the bispectrum of a real signal has many more symmetry relations, and thus a smaller principal domain. For comparison, the principal domain of the bispectrum for a real signal is shown in Fig. 8.3(c). 6

8.5.2

Analytic signals ˆ ˆ is the Hilbert Now consider the WSS analytic signal x(t) = u(t) + ju(t), where u(t) transform of the real signal u(t). The increments of the spectral processes corresponding to x(t) and u(t) are related by dξ ( f ) = 2( f )dη( f ), where ( f ) is the Heaviside unitstep function. We thus find that the nth-order moment spectra of complex x(t) are connected to the nth-order moment spectrum of its corresponding real signal u(t) as Mx,♦ (f) = 2n (− n f T 1)

n−1 "

(i f i )Mu (f).

(8.99)

i=1

This shows that Mx,♦ (f) cuts out regions of Mu (f). Therefore, on its nonzero domain, Mx,♦ (f) also inherits the symmetries of Mu (f). The conjugation pattern determines the selected region and thus the symmetry properties. For example, the pattern ♦ = [∗, 1, 1, . . ., 1] selects the orthant f 1 ≥ 0, f 2 ≥ 0, . . ., f n−1 ≥ 0.

219

8.5 Higher-order spectra

f2

f2 f2 = f1

f2 = f1

f1

f1

f2 = − f1 2

f2 = −2 f1 (a)

(b)

f2 f2 = f1

f1

f2 = − f1 2

f2 = −2 f1 (c) Figure 8.3 Principal domains (gray) of different bispectra: (a) Mx x x ( f 1 , f 2 ), (b) Mx ∗ x x ( f 1 , f 2 ),

and (c) for a real signal. Dashed lines are lines of symmetries.

Example 8.5. Figure 8.4 shows the support of bispectra of a bandlimited analytic signal x(t) with different conjugation patterns. All the conjugation patterns shown are equivalent, meaning that every bispectrum with one or two stars (q = 1, 2) contains the same information. Yet since Mx ∗ x x ( f 1 , f 2 ) covers the principal domain of Mx x x ( f 1 , f 2 ) in its most common definition, gray in the figure, one could regard x ∗ x x as the canonical starring pattern. Because ( f 1 )( f 2 )(− f 1 − f 2 ) ≡ 0 and (− f 1 )(− f 2 )( f 1 + f 2 ) ≡ 0, the support for Mx x x ( f 1 , f 2 ) and Mx ∗ x ∗ x ∗ ( f 1 , f 2 ) is empty. In other words, if x(t) is a WSS analytic signal, bispectra with q = 0 or q = 3 are identically zero.

220

Wide-sense stationary processes

f2

x∗ x ∗ x xx∗ x

x∗ xx xx∗ x∗

f1

x∗ xx∗ xxx∗

Figure 8.4 Support of bispectra with different conjugation patterns for a bandlimited analytic ˆ signal x(t) = u(t) + ju(t). The principal domain of the real signal u(t) is gray.

The observation in this example can be generalized: moment spectra of order n for analytic signals can be forced to be zero depending on the number of conjugates q and the nonzero bandwidth of dξ ( f ). Let f min and f max denote the minimum and maximum frequencies at which dξ ( f ) is nonzero. In order to obtain a nonzero moment spectrum Mx,♦ (f), there must be overlap between the support of the random hypercube dξ 1 (1 f 1 ) · · · dξ n−1 (n−1 f n−1 ) and the support of dξ n (− n f T 1). The lowest nonzero frequency of dξ i (i f i ), i = 1, . . ., n − 1, is f min if i = 1 and − f max if i = ∗, and similarly, the highest nonzero frequency is f max if i = 1 and − f min if i = ∗. Take first the case in which q ≥ 1, so we can assume without loss of generality that n = ∗. Then we obtain the required overlap if both (n − q) f min − (q − 1) f max < f max

(8.100)

(n − q) f max − (q − 1) f min > f min .

(8.101)

and

Now, if q = 0, then n = 1 and we require (n − 1) f min < − f min

(8.102)

(n − 1) f max > − f max ,

(8.103)

and

which shows that (8.100) and (8.101) also hold for q = 0. Since one of the inequalities (8.100) and (8.101) will always be trivially satisfied, a simple necessary (not sufficient) condition for a nonzero Mx,♦ (f) is (n − q) f min < q f max , (n − q) f max > q f min ,

if 2q ≤ n, if 2q > n.

(8.104)

Notes

221

If either q = 0 or q = n, this condition requires f min < 0, which is impossible for an analytic signal. Thus, moment spectra of analytic signals with q = 0 or q = n must be identically zero. 7 Example 8.6. For a WSS analytic signal x(t), evaluating (8.100) and (8.101) for n = 4 confirms that trispectra with q = 0 or q = 4 are zero. A necessary condition for trispectra with q = 1 or q = 3 to be nonzero is 3 f min < f max , and trispectra with q = 2 cannot be identically zero for a nonzero signal. For a real WSS signal u(t), we may thus calculate the fourth-order moment Eu 4 (t) = ˆ as m x x x x (0, 0, 0) from the analytic signal x(t) = u(t) + ju(t) Eu 4 (t) =

1 E[x(t) 16

+ x ∗ (t)]4 =

1 2

Re {E[x 3 (t)x ∗ (t)]} + 38 E|x(t)|4 ,

taking into account that E x 4 (t) = 0 and E x ∗4 (t) = 0. If 3 f min > f max , tricorrelations with q = 1 or q = 3 must also be identically zero, in which case Eu 4 (t) = 38 E|x(t)|4 . The term E[x 3 (t)x ∗ (t)] is equal to the integral of a trispectrum for one particular conjugation pattern with q = 1 or q = 3, and E |x(t)|4 is the integral of a trispectrum for one particular conjugation pattern with q = 2. Reversing the inequalities in (8.104) leads to a sufficient condition for a zero moment spectrum. Let us assume that 2q ≤ n, which constitutes no loss of generality since moment spectra with q and n − q conjugates are equivalent. If (8.104) holds for q − 1 and some fixed order n, it also holds for q. Thus, if trispectra with q conjugates are zero, so are trispectra with q − 1 conjugates. We may therefore ask up to which order N a signal with given f min and f max is circular. Remember that a signal is N th-order circular (N th-order proper) if its only nonzero moment functions up to order N have an equal number of conjugated and nonconjugated terms, and hence all odd-order moments (up to order N ) are zero. Moreover, if a signal is circular up to order N = 2m − 1, then it is also circular for order N = 2m. Therefore, N can be assumed even without loss of generality. If x(t) is N th-order circular, all moment spectra with q ≤ N /2 − 1 are zero but (8.104) does not hold for q = N /2 − 1. This immediately leads to > = f max + f min , (8.105) N ≤2 f max − f min where · again denotes the floor function. 8

Notes 1 The multivariate spectral factorization problem was first addressed by Wiener and Masani (1957, 1958), who provided the proof that a causal factorization exists for continuous-time spectral density matrices. Further developments were reported by Youla (1961) and Rozanov

222

Wide-sense stationary processes

2

3 4

5

6

7

8

(1963), who provide general algorithms that perform spectral factorization on rational spectraldensity matrices for continuous-time WSS processes. Several spectral-factorization algorithms for discrete-time spectral-density matrices have been proposed by Davis (1963), Wilson (1972), and Jezek and Kucera (1985). The seminal paper that first investigated the coherence properties of partially polarized light is Wolf (1959), and an extended discussion is provided by Born and Wolf (1999). The rotary component method was introduced in oceanography by Gonella (1972) and Mooers (1973). Other key papers that deal with polarization and coherence are Jones (1979) and Samson (1980). c The presentation in Section 8.4.1 follows Schreier (2008b). This paper is IEEE, and portions are used with permission. A similar discussion is also provided by Lilly and Gascard (2006). There are other interesting signal decompositions for a given frequency f besides the one into a completely polarized and an unpolarized signal. Rubin-Delanchy and Walden (2008) present a decomposition into two ellipses whose orientations are orthogonal, but with the same aspect ratio. The magnitude-squared coherence between CCW and CW rotary components controls the relative influences of the two ellipses. If |ρx x ( f )|2 = 1, then the second ellipse vanishes. If |ρx x ( f )|2 = 0, then the two ellipses have equal influence. The Stokes parameters were introduced by Stokes (1852), and the Jones parameters and calculus by Jones (1941). Mueller calculus was invented by Hans Müller (whose last name is commonly transcribed as “Mueller”) at MIT in the early 1940s. Interestingly, the first time Müller published work using his calculus was only several years later, in Mueller (1948). The symmetry relations of the bispectrum for real signals are detailed by Pflug et al. (1992) and Molle and Hinich (1995). The bispectrum of complex signals has been discussed by Jouny and Moses (1992), but they did not consider the bispectrum with no conjugated terms. Higher-order spectra with all distinct conjugation patterns were discussed by Schreier and Scharf (2006b). The principal domains of the bispectrum and trispectrum of real signals were derived by Pflug et al. (1992). Necessary conditions for nonzero moment spectra of analytic signals were given by Picinbono (1994) and Amblard et al. (1996b). This discussion was extended by Izzo and Napolitano (1997) to necessary conditions for nonzero spectra of equivalent complex baseband signals. A thorough discussion of circular random signals, including the interplay between circularity and stationarity, is provided by Picinbono (1994). The result that connects N th-order circularity to the bandwidth of analytic signals is also due to Picinbono (1994), and was derived by Amblard et al. (1996b) as well.

9

Nonstationary processes

Wide-sense stationary (WSS) processes admit a spectral representation (see Result 8.1) in terms of the Fourier basis, which allows a frequency interpretation. The transformdomain description of a WSS signal x(t) is a spectral process ξ ( f ) with orthogonal increments dξ ( f ). For nonstationary signals, we have to sacrifice either the Fourier basis, and thus its frequency interpretation, or the orthogonality of the transform-domain representation. We will discuss both possibilities. The Karhunen–Loève (KL) expansion uses an orthonormal basis other than the Fourier basis but retains the orthogonality of the transform-domain description. The KL expansion is applied to a continuous-time signal of finite duration, which means that its transform-domain description is a countably infinite number of orthogonal random coefficients. This is analogous to the Fourier series, which produces a countably infinite number of Fourier coefficients, as opposed to the Fourier transform, which is applied to an infinite-duration continuous-time signal. The KL expansion presented in Section 9.1 takes into account the complementary covariance of an improper signal. It can be considered the continuous-time equivalent of the eigenvalue decomposition of improper random vectors discussed in Section 3.1. An alternative approach is the Cramér–Loève (CL) spectral representation, which retains the Fourier basis and its frequency interpretation but sacrifices the orthogonality of the increments dξ ( f ). As discussed in Section 9.2, the increments dξ ( f ) of the spectral process of an improper signal can have nonzero Hermitian correlation and complementary correlation between different frequencies. Starting from the CL representation, we introduce energy and power spectral densities for nonstationary signals. We then discuss the CL representation for analytic signals and discrete-time signals. Yet another description, which allows deep insights into the time-varying nature of nonstationary signals, is possible in the joint time–frequency domain. In Section 9.3, we focus our attention on the Rihaczek distribution, which is a member of Cohen’s class of bilinear time–frequency distributions. The Rihaczek distribution is not as widely used as, for instance, the Wigner–Ville distribution, but it possesses a compelling property: it is an inner product between the spectral increments and the time-domain process at a given point in the time–frequency plane. This property leads to an evocative geometrical interpretation. It is also the basis for extending the rotary-component and polarization analysis presented in the previous chapter to nonstationary signals. This is discussed in Section 9.4. Finally, Section 9.5 presents a short exposition of higher-order statistics for nonstationary signals.

224

Nonstationary processes

9.1

` expansion Karhunen–Loeve We first consider a representation for finite-length continuous-time signals. Like the Fourier series, this representation produces a countably infinite number of uncorrelated random coefficients. However, because it uses an orthonormal basis other than the Fourier basis it generally does not afford these random coefficients a frequencydomain interpretation. The improper KL expansion is the continuous-time equivalent of the finite-dimensional improper eigenvalue decomposition, which is given in Result 3.1. 1 Result 9.1. Suppose that {x(t), 0 ≤ t ≤ T } is a zero-mean second-order complex random process with augmented covariance function R x x (t1 , t2 ), where both the covariance function r x x (t1 , t2 ) = E[x(t1 )x ∗ (t2 )] and the complementary covariance function r˜x x (t1 , t2 ) = E[x(t1 )x(t2 )] are continuous on [0, T ] × [0, T ]. Then R x x (t1 , t2 ) can be expanded in the Mercer series R x x (t1 , t2 ) =

∞

Φn (t1 )Λn ΦH n (t2 ),

(9.1)

n=1

which converges uniformly in t1 and t2 . The augmented eigenvalue matrix Λn is real and it contains two nonnegative eigenvalues λ2n−1 and λ2n : λ2n−1 − λ2n 1 λ2n−1 + λ2n . (9.2) Λn = 2 λ2n−1 − λ2n λ2n−1 + λ2n The augmented eigenfunction matrix

φn (t) φn (t) Φn (t) = ∗ φn (t) φn∗ (t)

(9.3)

satisfies the orthogonality condition T ΦH n (t)Φm (t)dt = I δnm .

(9.4)

The matrices Λn and Φn (t) are found as the solutions to the equation T R x x (t1 , t2 )Φn (t2 )dt2 . Φn (t1 )Λn =

(9.5)

0

0

Then x(t) can be represented by the Karhunen–Loève (KL) expansion x(t) =

∞

Φn (t)xn ⇔ x(t) =

n=1

∞

φn (t)xn + φn (t)xn∗ ,

(9.6)

n=1

where equality holds in the mean-square sense, and convergence is uniform in t. The complex KL coefficients xn are given by T T# $ H φn∗ (t)x(t) + φn (t)x ∗ (t) dt. Φn (t)x(t)dt ⇔ xn = (9.7) xn = 0

0

225

` expansion 9.1 Karhunen–Loeve

The KL coefficients are improper, with covariance and complementary covariance E(xn xm∗ ) = 12 (λ2n−1 + λ2n )δnm ,

(9.8)

E(xn xm ) =

(9.9)

1 (λ 2 2n−1

− λ2n )δnm .

The proof of this result provides some interesting insights. Let C 2∗ be the image of IR2 under the map 1 j T= . (9.10) 1 −j The space of augmented square-integrable functions defined on [0, T ] is denoted by L 2 ([0, T ], C 2∗ ). This space is IR-linear (i.e., C-widely linear) but not C-linear. Using the results of Kelly and Root (1960) for vector-valued random processes, we can write down the KL expansion for an augmented signal. Let the assumptions be as in the statement of Result 9.1. Then the augmented covariance matrix R x x (t1 , t2 ) can be expanded in the uniformly convergent series (called Mercer’s expansion) R x x (t1 , t2 ) =

∞

λn f n (t1 )f H n (t2 ),

(9.11)

n=1 ∗ T where {λn }∞ n=1 are the nonnegative scalar eigenvalues and {f n (t) = [ f n (t), f n (t)] } are 2 the corresponding orthonormal augmented eigenfunctions. Each f n (t) is L ([0, T ], C 2∗ ). Eigenvalues and eigenfunctions are obtained as solutions to the integral equation T R x x (t1 , t2 )f n (t2 )dt2 , 0 ≤ t1 ≤ T, (9.12) λn f n (t1 ) = 0

where the augmented eigenfunctions f n (t) are orthonormal in L 2 ([0, T ], C 2∗ ): T T 6 7 H f n (t), f m (t) = f n (t)f m (t)dt = 2 Re f n∗ (t) f m (t)dt = δnm . 0

(9.13)

0

Then x(t) ⇔ x(t) can be represented by the series x(t) =

∞

u n f n (t) ⇔ x(t) =

n=1

∞

u n f n (t),

(9.14)

n=1

where equality holds in the mean-square sense and convergence is uniform in t. The KL coefficients are T T 6 7 H f n (t)x(t)dt = 2 Re f n∗ (t)x(t)dt. (9.15) u n = f n (t), x(t) = 0

0

The surprising result here is that these coefficients are real scalars with covariance E(u n u m ) = λn δnm .

(9.16)

The reason why the coefficients are real is found in (9.13). It shows that the functions fn (t) do not have to be orthogonal in L 2 ([0, T ], C) to ensure that the augmented functions f n (t) be orthogonal in L 2 ([0, T ], C 2∗ ). In fact, there are twice as many orthogonal augmented functions f n (t) in L 2 ([0, T ], C 2∗ ) as there are orthogonal functions f n (t) in L 2 ([0, T ], C).

226

Nonstationary processes

That’s why we have been able to reduce the dimension of the internal description by a factor of 2 (real rather than complex KL coefficients). From (9.11) it is not clear how the improper version of Mercer’s expansion specializes to its proper version. To make this connection apparent, we rewrite (9.11) by combining terms with 2n − 1 and 2n as ∞ T λ2n−1 0 TH TH √ √ [f 2n−1 (t1 ), f 2n (t1 )] √ R x x (t1 , t2 ) = 0 λ2n 2 2 2 n=1 ( ) T fH 2n−1 (t2 ) × √ 2 fH 2n (t2 ) ∞ φn (t1 ) φn (t1 ) 12 (λ2n−1 + λ2n ) 12 (λ2n−1 − λ2n ) φn∗ (t2 ) φn (t2 ) = φn∗ (t1 ) φn∗ (t1 ) 1 (λ2n−1 − λ2n ) 1 (λ2n−1 + λ2n ) φn∗ (t2 ) φn (t2 ) 2

n=1

=

∞

2

Φn (t1 )Λn ΦH n (t2 ),

(9.17)

n=1

where 1 φn (t) = √ [ f 2n−1 (t) − j f 2n (t)], 2 1 φn (t) = √ [ f 2n−1 (t) + j f 2n (t)]. 2

(9.18) (9.19)

Thus, the latent representation is now given by complex KL coefficients 1 xn = √ (u 2n−1 + ju 2n ), 2 T T u 2n−1 xn = xn = ∗ = √ ΦH n (t)x(t)dt. xn 2 u 2n 0

(9.20) (9.21)

For these coefficients we find, because of (9.16), E(xn xm ) = 12 [E(u 2n−1 u 2m−1 ) − E(u 2n u 2m ) + jE(u 2n−1 u 2m ) + jE(u 2n u 2m−1 )] = 12 (λ2n−1 − λ2n )δnm E(xn xm∗ )

(9.22)

and = + λ2n )δnm . This completes the proof. If x(t) is proper, r˜x x (t1 , t2 ) ≡ 0, the KL expansion simplifies because λ2n−1 = λ2n and φn (t) ≡ 0 for all n, and the KL coefficients xn are proper. In the proper case, the Mercer series expands the Hermitian covariance function as 1 (λ 2 2n−1

r x x (t1 , t2 ) =

∞

µn φn (t1 )φn∗ (t2 ).

(9.23)

n=1

Here {µn } are the eigenvalues of the kernel r x x (t1 , t2 ), which are related to the eigenvalues of the diagonal kernel R x x (t1 , t2 ) as µn = λ2n−1 = λ2n . In the proper or improper case, finding the eigenvalues and eigenfunctions of a linear integral equation with arbitrary kernel in (9.12) ranges from difficult to impossible. An

` expansion 9.1 Karhunen–Loeve

227

alternative is the Rayleigh–Ritz technique, which numerically solves operator equations. The Rayleigh–Ritz technique is discussed by Chen et al. (1997); Navarro-Moreno et al. (2006) have applied this technique to obtain approximate series expansions of stochastic processes.

Example 9.1. To gain more insight into the improper KL expansion, consider the following communications example. Suppose we want to detect a real waveform u(t) that is transmitted over a channel that rotates it by some random phase φ and adds complex white Gaussian noise n(t). The observations are then given by y(t) = u(t)ejφ + n(t),

(9.24)

where we assume pairwise mutual independence of u(t), n(t), and φ. Furthermore, we denote the rotated signal by x(t) = u(t)ejφ . Its covariance is given by r x x (t1 , t2 ) = E[u(t1 )u(t2 )] and its complementary covariance is r˜x x (t1 , t2 ) = E[u(t1 )u(t2 )] · Eej2φ . There are two important special cases. If the phase φ is uniformly distributed, then r˜x x (t1 , t2 ) ≡ 0 and detection is noncoherent. The eigenvalues of the augmented covariance of x(t) satisfy λ2n−1 = λ2n = µn . On the other hand, if φ is known, then r˜x x (t1 , t2 ) ≡ ej2φ r x x (t1 , t2 ) and detection is coherent. If we order the eigenvalues appropriately, we have λ2n−1 = 2µn and λ2n = 0. Therefore, the coherent case is the most improper case under the power constraint λ2n−1 + λ2n = 2µn . These comments are clarified by noting that 1 λ 2 2n−1

= E{x(t)Re xn },

j λ 2 2n

= E{x(t)Im xn }.

(9.25)

Thus, λ2n−1 measures the covariance between the real part of the observable coordinate xn and the continuous-time signal, and λ2n does so for the imaginary part. In the noncoherent version of (9.24), these two covariances are equal, suggesting that the information is carried equally in the real and imaginary parts of xn . In the coherent version, λ2n−1 = 2µn , λ2n = 0 shows that the information is carried exclusively in the real part of xn , making Re xn a sufficient statistic for the decision on x(t). Therefore, in the coherent problem, WL processing amounts to considering only the real part of the internal description. The more interesting applications of WL filtering, however, lie between the coherent and the noncoherent case, being characterized by a nonuniform phase distribution, or in adaptive realizations of coherent algorithms.

9.1.1

Estimation The KL expansion can be used to solve the following problem: widely linearly estimate a nonstationary improper complex zero-mean random signal x(t) with augmented covariance R x x (t1 , t2 ) in complex white (i.e., proper) noise n(t) with power-spectral density N0 . The observations are y(t) = x(t) + n(t),

0 ≤ t ≤ T,

(9.26)

228

Nonstationary processes

and the noise will be assumed to be uncorrelated with the signal. We are looking for a ˆ of the form widely linear (WL) estimator xˆ (t) ⇔ x(t) T xˆ (t) = H(t, v)y(v)dv, (9.27) 0

with augmented filter impulse response h (t, v) H(t, v) = ∗1 h 2 (t, v) Thus,

T

ˆ = x(t)

h 2 (t, v) . h ∗1 (t, v)

h 1 (t, v)y(v)dv +

0

T

h 2 (t, v)y ∗ (v)dv.

(9.28)

(9.29)

0

To make this estimator a widely linear minimum mean-squared error (WLMMSE) estimator, we require that it satisfy the orthogonality condition [x(t) − xˆ (t)] ⊥ y(u), which translates to

T

R x x (t, u) =

∀(t, u) ∈ [0, T ] × [0, T ],

(9.30)

H(t, v)R yy (v, u)dv.

(9.31)

0

Using the KL expansion of Result 9.1 in this equation, we obtain ∞ T ∞ H H Φn (t)Λn Φn (u) = H(t, v) Φn (v)(Λn + N0 I)Φn (u) dv. 0

n=1

(9.32)

n=1

We now attempt a solution of the form H(t, v) =

∞

Φn (t)Hn ΦH n (v),

n=1

Hn =

h n,1 h ∗n,2

h n,2 . h ∗n,1

(9.33)

(9.34)

Inserting (9.33) into (9.32) we get ∞

Φn (t)Λn ΦH n (u) =

n=1

∞

Φn (t)Hn (Λn + N0 I)ΦH n (u),

(9.35)

n=1

which means Hn = Λn (Λn + N0 I)−1 .

(9.36)

Thus, the terms of Hn are h n,1 =

λ2n−1 λ2n + (N0 /2)(λ2n−1 + λ2n ) , λ2n−1 λ2n + N0 (λ2n−1 + λ2n ) + N02

(9.37)

h n,2 =

(N0 /2)(λ2n−1 − λ2n ) . λ2n−1 λ2n + N0 (λ2n−1 + λ2n ) + N02

(9.38)

` expansion 9.1 Karhunen–Loeve

229

If we denote the resolution of the observed signal onto the KL basis functions by yn =

T

0

ΦH n (t)y(t)dt ⇔ yn =

T

#

$ φn∗ (t)y(t) + φn (t)y ∗ (t) dt,

0

(9.39)

the WL estimator in (9.27) is xˆ (t) =

∞

ˆ = Φn (t)Hn yn ⇔ x(t)

∞

n=1

xˆn φn (t) + xˆn∗ φn (t)

(9.40)

n=1

with xˆn = h n,1 yn + h n,2 yn∗ .

(9.41)

The WLMMSE in the interval [0, T ] is WLMMSE =

1 2

T

0

= =

1 2

T

tr E x(t)[x(t) − xˆ (t)]H dt tr R x x (t, t) −

0

N0 2

∞ n=1

T

H(t, v)R x x (t, v)dv dt

0

λn . λn + N 0

(9.42)

In the proper case, λ2n−1 = λ2n = µn , φn (t) ≡ 0 for all n, and we have h n,1 =

µn , µn + N0

h n,2 = 0.

(9.43)

Thus, the solution simplifies to the LMMSE estimator ˆ = x(t)

∞ n=1

µn yn φn (t) µn + N0

(9.44)

with LMMSE in the interval [0, T ] of LMMSE = N0

∞ n=1

µn . µn + N0

(9.45)

Equations (9.42) and (9.45) hide the fact that there are two λn s for every µn , and thus the sum in (9.42) contains two terms for every one in the sum of (9.45). The discussion in Section 5.4.2 for the estimation problem y = x + n also applies to the continuous-time version y(t) = x(t) + n(t) discussed here. In particular, Result 5.6 still holds: the maximum performance advantage of WLMMSE over LMMSE processing, in additive white and proper noise, is a factor of 2. This is achieved in the maximally improper case λ2n−1 = 2µn , λ2n = 0. If the noise is colored or improper, the performance advantage can be arbitrarily large.

230

Nonstationary processes

9.1.2

Detection The KL expansion can also be used to solve the following detection problem. We observe a complex signal y(t) over the time interval 0 ≤ t ≤ T . We would like to test the hypotheses H0 : y(t) = n(t), H1 : y(t) = x(t) + n(t),

(9.46)

where the noise n(t) is zero-mean complex white (i.e., proper) Gaussian with power spectral density N0 , and the zero-mean complex Gaussian signal x(t) has augmented covariance R x x (t1 , t2 ). Let yn be the resolution of y(t) onto the KL basis functions as given by (9.39). Now collect the first N of these coefficients in the vector y = [y1 , . . ., y N ]T . This leads to a finite-dimensional detection problem exactly like the one discussed in Section 7.4.3. Owing to Grenander’s theorem 2 the finite-dimensional log-likelihood ratio converges to the log-likelihood ratio of the infinite-dimensional detection problem as N → ∞. This means that the solution and discussion provided in Section 7.4.3 also apply to the detection problem (9.46). In particular, in additive white and proper Gaussian noise the maximum performance advantage of widely linear processing over linear processing, as measured by deflection, is a factor of 2.

9.2

´ ` spectral representation Cramer–Lo eve The Cramér–Loève (CL) spectral representation uses the (orthonormal) Fourier basis and thus preserves the frequency interpretation of the transform-domain representation. This, however, comes at the price of correlated spectral increments. Signals that have a CL representation are called harmonizable. The proper version of the following result is due to Loève (1978). 3 Result 9.2. A nonstationary continuous-time signal x(t) can be represented as ∞ dξ ( f )ej2π f t , (9.47) x(t) = −∞

where ξ ( f ) is a spectral process with correlated increments E[dξ ( f 1 )dξ ∗ ( f 2 )] = Sx x ( f 1 , f 2 )d f 1 d f 2 ,

(9.48)

E[dξ ( f 1 )dξ (− f 2 )] = Sx x ( f 1 , f 2 )d f 1 d f 2 ,

(9.49)

if and only if

Sx x ( f 1 , f 2 ) = IR

Sx x ( f 1 , f 2 ) =

2

IR2

r x x (t1 , t2 )e−j2π( f1 t1 − f2 t2 ) dt1 dt2 ,

(9.50)

r˜x x (t1 , t2 )e−j2π( f1 t1 − f2 t2 ) dt1 dt2 .

(9.51)

´ ` spectral representation 9.2 Cramer–Lo eve

231

The representation (9.47) looks like the spectral representation (8.11) in Result 8.1 in the WSS case except that the spectral increments dξ ( f ) are now correlated between different frequencies. This correlation is measured by the (Hermitian) spectral correlation Sx x ( f 1 , f 2 ) and complementary spectral correlation Sx x ( f 1 , f 2 ), which are also called dual-frequency spectra or Loève spectra. The spectral correlation and complementary spectral correlation are the two-dimensional Fourier transforms of the timecorrelation function r x x (t1 , t2 ) = E[x(t1 )x ∗ (t2 )] and complementary correlation function r˜x x (t1 , t2 ) = E[x(t1 )x(t2 )], respectively. We note, however, that the two-dimensional Fourier transforms in (9.50) and (9.51) are defined with different signs on f 1 and f 2 . This is done so that the CL representation simplifies to the spectral representation in Result 8.1 for WSS signals. If necessary, the Fourier integrals (9.50) and (9.51) should be interpreted as generalized Fourier transforms that lead to δ-functions in Sx x ( f 1 , f 2 ) and Sx x ( f 1 , f 2 ). Result 9.3. The spectral correlation and complementary correlation satisfy the symmetries and bounds Sx x ( f 1 , f 2 ) = Sx∗x ( f 2 , f 1 ),

(9.52)

Sx x ( f 1 , f 2 ) = Sx x (− f 2 , − f 1 ),

(9.53)

|Sx x ( f 1 , f 2 )|2 ≤ Sx x ( f 1 , f 1 )Sx x ( f 2 , f 2 ),

(9.54)

| Sx x ( f 1 , f 2 )|2 ≤ Sx x ( f 1 , f 1 )Sx x (− f 2 , − f 2 ).

(9.55)

These bounds are due to the Cauchy–Schwarz inequality. Result 9.3 is the nonstationary extension of the WSS Result 2.14. However, in contrast to the WSS case, Result 9.3 lists only necessary conditions. That is, a pair of functions (Sx x ( f 1 , f 2 ), Sx x ( f 1 , f 2 )) that satisfy (9.52)–(9.55) are not necessarily valid spectral correlation and complementary spectral correlation functions. A necessary and sufficient condition is that Sx x ( f 1 , f 2 ) Sx x ( f 1 , f 2 ) Sx x ( f 1 , f 2 ) = ∗ Sx x (− f 1 , − f 2 ) Sx∗x (− f 1 , − f 2 ) be positive semidefinite in the sense that gH ( f 1 )Sx x ( f 1 , f 2 )g( f 2 )d f 1 d f 2 ≥ 0

(9.56)

IR2

for all continuous functions g( f ): IR −→ C 2 . Unfortunately, it is not easy to interpret this condition as a relation between Sx x ( f 1 , f 2 ) and Sx x ( f 1 , f 2 ).

9.2.1

Four-corners diagram We can gain more insight into the time-varying nature of x(t) by expressing the correlation functions in terms of a global (absolute) time variable t = t2 and a local (relative) time lag τ = t1 − t2 . For convenience – but in an abuse of notation – we will reuse the same symbols r and r˜ for the correlation and complementary correlation functions, now

232

Nonstationary processes

rxx (t t)

t→ f

Vxx (t f )

t →ν Axx (ν t)

t →ν t→ f

Sxx (ν f )

r˜xx (t t)

t→ f

Vxx (t f ) t →ν

t →ν A˜xx (ν t)

t→ f

Sxx (ν f )

Figure 9.1 Four-corners diagrams showing equivalent second-order characterizations.

expressed in terms of t and τ : r x x (t, τ ) = E[x(t + τ )x ∗ (t)],

(9.57)

r˜x x (t, τ ) = E[x(t + τ )x(t)].

(9.58)

Equivalent representations can be found by applying Fourier transformations to t or τ , as shown in the “four-corners diagrams” in Fig. 9.1. There is a diagram for the Hermitian correlation, and another for the complementary correlation. Each arrow in Fig. 9.1 stands for a Fourier transform in one variable. In contrast to (9.50) and (9.51), these Fourier transforms now use equal signs for both dimensions. The variables t and f are global (absolute) variables, and τ and ν are local (relative) variables. A Fourier transform applied to a local time lag yields a global frequency variable, and a Fourier transform applied to a global time variable yields a local frequency offset. We have already encountered the spectral correlations in the southeast corners of Fig. 9.1. Again we are reusing the same symbols S and S for the spectral correlation and complementary correlation, now expressed in terms of ν = f 1 − f 2 and f = f 1 . The magnitude of the Jacobian determinant of this transformation is 1, i.e., the area of the infinitesimal element dν d f equals the area of d f 1 d f 2 . Thus, Sx x (ν, f )dν d f = E[dξ ( f )dξ ∗ ( f − ν)],

(9.59)

Sx x (ν, f )dν d f = E[dξ ( f )dξ (ν − f )].

(9.60)

The northeast corners of the diagrams contain the Rihaczek time–frequency representations Vx x (t, f ) and Vx x (t, f ) in terms of a global time t and global frequency f . We immediately find the expressions Vx x (t, f )d f = E[x ∗ (t)dξ ( f )ej2π f t ],

(9.61)

Vx x (t, f )d f = E[x(t)dξ ( f )ej2π f t ].

(9.62)

Finally, the southwest corners are the ambiguity functions A x x (ν, τ ) and A˜ x x (ν, τ ), which play a prominent role in radar. The ambiguity functions are representations in terms of a local time τ and local frequency ν. We remark that there are other ways of defining global/local time and frequency variables. Another definition splits the time lag τ symmetrically as t1 = t + τ/2 and t2 = t − τ/2. This alternative definition leads to the frequency variables f 1 = f + ν/2 and f 2 = f − ν/2. Section 9.3 will motivate our choice of definition. 4

´ ` spectral representation 9.2 Cramer–Lo eve

9.2.2

233

Energy and power spectral densities Definition 9.1. If the derivative process X ( f ) = dξ ( f )/d f exists, then E|X ( f )|2 = Sx x (0, f )

(9.63)

is the energy spectral density (ESD) and E[X ( f )X (− f )] = Sx x (0, f )

(9.64)

is the complementary energy spectral density (C-ESD). The ESD shows how the energy of a random process is distributed over frequency. The total energy of x(t) is obtained by integrating the ESD: ∞ Sx x (0, f )d f. (9.65) Ex = −∞

Wide-sense stationary signals We will now show that, for a given frequency f , the ESD of a wide-sense stationary (WSS) process is either 0 or ∞. If x(t) is WSS, both r x x (t, τ ) = r x x (τ ) and r˜x x (t, τ ) = r˜x x (τ ) are independent of t. For the spectral correlation we find r x x (τ )e−j2π(νt+ f τ ) dt dτ = Px x ( f )δ(ν), (9.66) Sx x (ν, f ) = IR2

where Px x ( f ) is the power spectral density (PSD) of x(t). This produces the Wiener– Khinchin relation ∞ Px x ( f ) = r x x (τ )e−j2π f τ dτ. (9.67) −∞

Similarly, the complementary spectral correlation is Sx x (ν, f ) = Px x ( f )δ(ν), where Px x ( f ) is the complementary power spectral density (C-PSD) of x(t), which is the Fourier transform of r˜x x (τ ). We see that E[dξ ( f )dξ ∗ ( f − ν)] = Px x ( f )δ(ν)dν d f,

(9.68)

E[dξ ( f )dξ (ν − f )] = Px x ( f )δ(ν)dν d f

(9.69)

and the CL representation of WSS signals simplifies to the spectral representation given in Result 8.1. The line ν = 0 is called the stationary manifold. A few remarks on WSS processes are now in order. r The ESD at any given frequency is either 0 or ∞. Hence, a WSS process x(t) cannot have a second-order Fourier transform X ( f ). r The spectral correlation and complementary spectral correlations are zero outside the stationary manifold. Figuratively speaking, “the (C-)ESD is the (C-)PSD sitting on top of a δ-ridge.” r The (C-)PSD itself may contain δ-functions if the signal contains periodic components (see Section 8.1).

234

Nonstationary processes

What do the remaining two corners of the four-corners diagram look like for WSS signals? The time–frequency distribution of a WSS signal is simply the PSD, ∞ Vx x (t, f ) = r x x (τ )e−j2π f τ dτ = Px x ( f ), (9.70) −∞

and also Vx x (t, f ) = Px x ( f ). Finally, the ambiguity function of a WSS signal is ∞ A x x (ν, τ ) = r x x (τ )e−j2πνt dt = r x x (τ )δ(ν), (9.71) −∞

and also A˜ x x (ν, τ ) = r˜x x (τ )δ(ν).

Nonstationary signals There are two types of nonstationary signals: r energy signals, which have nonzero bounded ESD and thus zero PSD; and r power signals, which have nonzero, but not necessarily bounded, PSD and thus unbounded ESD. WSS signals are always power signals and never energy signals. For energy signals, the ESD and C-ESD are the Fourier transform of time-integrated r x x (t, τ ) and r˜x x (t, τ ), respectively: ∞ ∞ r x x (t, τ )dt e−j2π f τ dτ, (9.72) Sx x (0, f ) = −∞

Sx x (0, f ) =

∞

−∞

−∞

∞

−∞

r˜x x (t, τ )dt e−j2π f τ dτ.

(9.73)

Power signals have a WSS component, i.e., Sx x (ν, f ) has a δ-ridge on the stationary manifold ν = 0. The definitions of the PSD and C-PSD for WSS signals in Result 8.1 may also be applied to nonstationary signals: E|dξ ( f )|2 = Px x ( f )d f

and

E[dξ ( f )dξ (− f )] = Px x ( f )d f.

(9.74)

The PSD and C-PSD are the Fourier transforms of time-averaged r x x (t, τ ) and r˜x x (t, τ ): ∞ 1 T /2 lim Px x ( f ) = r x x (t, τ )dt e−j2π f τ dτ, (9.75) −∞ T →∞ T −T /2 ∞ 1 T /2 lim Px x ( f ) = r˜x x (t, τ )dt e−j2π f τ dτ. (9.76) −∞ T →∞ T −T /2 The time-averaged correlations, and thus the PSD and C-PSD, vanish for energy signals. It is important to keep in mind that the PSD and C-PSD (or ESD and C-ESD) are only an incomplete second-order characterization of nonstationary signals, since they read out the spectral and complementary spectral correlations on the stationary manifold, and thus ignore everything outside the manifold. For instance, the spectral correlation of cyclostationary signals, discussed in Chapter 10, has δ-ridges parallel to the stationary manifold.

´ ` spectral representation 9.2 Cramer–Lo eve

f2

235

f

Sxx f1

Sxx

Sxx

ν

Sxx

(a)

(b)

Figure 9.2 Support of the spectral correlation (dark gray) and complementary spectral correlation (light gray) of a bandlimited analytic signal x(t) compared with the support of the spectral correlation of the corresponding real signal u(t) (thick-lined square/parallelogram): (a) uses the global–global frequencies ( f 1 , f 2 ) and (b) the local–global frequencies (ν, f ). This c figure is adapted from Schreier and Scharf (2003b) IEEE, and used here with permission.

9.2.3

Analytic signals Now consider the important special case of the complex analytic signal x(t) = u(t) + ˆ ˆ is the Hilbert transform of u(t). Thus, dξ ( f ) = 2( f )dη( f ), where ju(t), where u(t) ( f ) is the unit-step function and η( f ) the spectral process of u(t), whose increments satisfy the Hermitian symmetry dη( f ) = dη∗ (− f ). In the global–global coordinate system ( f 1 , f 2 ), we find the connection Sx x ( f 1 , f 2 ) = ( f 1 )( f 2 )Suu ( f 1 , f 2 ),

(9.77)

Sx x ( f 1 , f 2 ) = ( f 1 )(− f 2 )Suu ( f 1 , f 2 )

(9.78)

between the spectral and complementary spectral correlations of x(t) and the spectral correlation of u(t). In the local–global coordinate system (ν, f ), this connection becomes Sx x (ν, f ) = ( f )( f − ν)Suu (ν, f ),

(9.79)

Sx x (ν, f ) = ( f )(ν − f )Suu (ν, f ).

(9.80)

Thus, the spectral and complementary spectral correlations of x(t) cut out regions from the spectral correlation of u(t). This is illustrated in Fig. 9.2, which shows the support of the spectral correlation (dark gray) and complementary spectral correlation (light gray) of a bandlimited analytic signal x(t) in the two different coordinate systems: global–global in (a) and local–global in (b). The Hermitian symmetry dη( f ) = dη∗ (− f ) leads to many symmetry properties of Suu . To illustrate these in Fig. 9.2, the circle and the square represent two exemplary values of the spectral correlation. A filled square or circle is the conjugate of an empty square or circle. On their regions of support, Sx x and Sx x inherit these symmetries from

236

Nonstationary processes

f

fs

fs

ν

Figure 9.3 Aliasing in the spectral correlation of x[k].

Suu . Figure 9.2 also shows how Sx x complements the information in Sx x . It is clearly not possible to reconstruct Suu from Sx x alone. The stationary manifold is the dashed line in Fig. 9.2. This line does not pass through the region of support of Sx x . This proves once again that WSS analytic signals – whose support is limited to the stationary manifold – must be proper.

9.2.4

Discrete-time signals We now investigate what happens if we sample the random process x(t) with sampling frequency f s = Ts−1 to obtain x[k] = x(kTs ). The spectral representation of the sampled process x[k] is ∞ x[k] = ej2π f kTs dξ ( f ) −∞

=

f s /2

2π

ej fs

fk

− f s /2

∞

dξ ( f + m f s ).

(9.81)

m=−∞

The spectral representation of x[k] is thus subject to aliasing unless dξ ( f ) = 0 for | f | > f s /2. The spectral correlation of x[k] is Sxdx (ν, f ) =

∞ ∞

Sx x (ν + m f s , f + n f s ),

(9.82)

m=−∞ n=−∞

where the letter d stands for discrete-time. The same formula is obtained for the complementary spectral correlation. The term with m = 0 and n = 0 in (9.82) is called the principal replica of Sx x (ν, f ). Figure 9.3 illustrates aliasing for the spectral correlation of a bandlimited, undersampled, signal. The support of the principal replica is the thick-lined parallelogram centered at the origin. Figure 9.3 also shows the replicas that

9.3 Rihaczek time–frequency representation

237

overlap with the principal replica. Areas of overlap are gray. The following points are noteworthy. r There are regions of the principal replica that overlap with one other replica (light gray) and regions that overlap with three other replicas (dark gray). (This assumes, as in Fig. 9.3, that dξ ( f ) = 0 for | f | > f s ; otherwise the principal replica could overlap with further replicas.) r In general, replicas with m = 0 and/or n = 0 contribute to aliasing in the ESD or PSD along the stationary manifold ν = 0. r However, the spectral correlation of a WSS signal is zero off the stationary manifold. Therefore, only the replicas with m = 0 and n = 0 can contribute to aliasing in the PSD of a WSS signal.

9.3

Rihaczek time–frequency representation Let’s explore some of the properties of the Rihaczek time–frequency representation, 5 which is comprised of the Hermitian Rihaczek time–frequency distribution (HR-TFD) ∞ Vx x (t, f ) = r x x (t, τ )e−j2π f τ dτ (9.83) −∞

and the complementary Rihaczek time–frequency distribution (CR-TFD) ∞ r˜x x (t, τ )e−j2π f τ dτ. Vx x (t, f ) =

(9.84)

−∞

It is obvious that the Rihaczek time–frequency representation is shift-covariant in both time and frequency: x(t − t0 ) ←→ Vx x (t − t0 , f ), Vx x (t − t0 , f ),

(9.85)

x(t)ej2π f0 t ←→ Vx x (t, f − f 0 ), Vx x (t, f − f 0 ).

(9.86)

Both the HR-TFD and CR-TFD are generally complex-valued. However, the timemarginal of the HR-TFD ∞ Vx x (t, f )d f = r x x (t, 0) = E|x(t)|2 ≥ 0, (9.87) −∞

which is the instantaneous power at t, and the frequency-marginal of the HR-TFD ∞ Vx x (t, f )dt = Sx x (0, f ) ≥ 0, (9.88) −∞

which is the energy spectral density at f , are nonnegative. It is common to interpret time–frequency distributions as distributions of energy or power over time and frequency. However, such an interpretation is fraught with

238

Nonstationary processes

problems because of the following. One would think that the minimum requirements for a distribution to have an interpretation as a distribution of energy or power are that r it is a bilinear function of the signal (so that it has the right physical units), r it is covariant with respect to shifts in time and frequency, r it has the correct time- and frequency-marginals (instantaneous power and energy spectral density, respectively), and r it is nonnegative. Yet Wigner’s theorem 6 says that such a distribution does not exist. The Rihaczek distribution satisfies the first three properties but it is complex-valued. The Wigner–Ville distribution, 7 which is more popular than the Rihaczek distribution, also satisfies the first three properties but it can take on negative values. Thus, we will not attempt an energy/power-distribution interpretation. Instead we argue for the Rihaczek distribution as a distribution of correlation and then present an evocative geometrical interpretation. This geometrical interpretation is the main reason why we prefer the Rihaczek distribution over the Wigner–Ville distribution.

9.3.1

Interpretation A key insight is that the HR/CR-TFDs are inner products in the Hilbert space of secondorder random variables, between the time-domain signal at a fixed time instant t and the frequency-domain representation at a fixed frequency f : 7 6 (9.89) Vx x (t, f )d f = E[x ∗ (t)dξ ( f )ej2π f t ] = x(t), dξ ( f )ej2π f t , 7 6 Vx x (t, f )d f = E[x(t)dξ ( f )ej2π f t ] = x ∗ (t), dξ ( f )ej2π f t . (9.90) How should these inner products be interpreted? Let’s construct a linear minimum mean-squared error (LMMSE) estimator of the random variable x(t), at fixed t, from the random variable dξ ( f )ej2π f t , at fixed f : 8 E x(t)dξ ∗ ( f )e−j2π f t dξ ( f )ej2π f t xˆ f (t) = E dξ ( f )ej2π f t dξ ∗ ( f )e−j2π f t =

Vx∗x (t, f ) dξ ( f ) j2π f t . e Sx x (0, f ) d f

(9.91)

The MMSE is thus E|xˆ f (t) − x(t)|2 = r x x (t, 0)(1 − |ρx x (t, f )|2 ),

(9.92)

where |ρx x (t, f )|2 =

|Vx x (t, f )|2 r x x (t, 0)Sx x (0, f )

(9.93)

is the magnitude-squared rotational time–frequency coherence 9 between the timeand frequency-domain descriptions. Similarly, we can linearly estimate x(t) from

239

9.3 Rihaczek time–frequency representation

x(t) x(t1 )

x(t2 )

Figure 9.4 The dark solid line shows the time-domain signal x(t) as it moves through the complex plane. At t = t, the counterclockwise turning circle is a perfect approximation.

dξ ∗ ( f )e−j2π f t as xˆ f (t) = =

E x(t)dξ ( f )ej2π f t dξ ∗ ( f )e−j2π f t E dξ ∗ ( f )e−j2π f t dξ ( f )ej2π f t Vx x (t, f ) dξ ∗ ( f ) −j2π f t , e Sx x (0, f ) d f

(9.94)

and the corresponding magnitude-squared reflectional time–frequency coherence is |ρx x (t, f )|2 =

|Vx x (t, f )|2 . r x x (t, 0)Sx x (0, f )

(9.95)

The magnitude-squared time–frequency coherences satisfy 0 ≤ |ρx x (t, f )|2 ≤ 1 and 0 ≤ |ρx x (t, f )|2 ≤ 1. It is particularly illuminating to discuss them for the case of ˆ analytic signals x(t) = u(t) + ju(t). Bilinear time–frequency analysis favors the use of analytic signals because there is no interference between positive and negative frequencies due to dξ ( f ) = 0 for f < 0. For a fixed frequency f > 0 but varying time t , dξ ( f )ej2π f t is a counterclockwise rotating phasor, and dξ ∗ ( f )e−j2π f t a clockwise rotating phasor. For fixed (t, f ) and varying t , the counterclockwise rotating phasor xˆ f (t ) =

Vx∗x (t, f ) dξ ( f ) j2π f t e Sx x (0, f ) d f

(9.96)

approximates x(t) at t = t. If |ρx x (t, f )|2 = 1, this is a perfect approximation. This is illustrated in Fig. 9.4. Similarly, the clockwise rotating phasor Vx x (t, f ) dξ ∗ ( f ) −j2π f t e xˆ f (t ) = Sx x (0, f ) d f

(9.97)

approximates x(t) at t = t, where |ρx x (t, f )|2 = 1 indicates a perfect approximation. Thus, the magnitude of the HR-TFD, normalized by its time- and frequency-marginals,

240

Nonstationary processes

measures how well the time-domain signal x(t) can be approximated by counterclockwise rotating phasors. Similarly, the magnitude of the CR-TFD, normalized by the time- and frequency-marginals of the HR-TFD, measures how well the time-domain signal x(t) can be approximated by clockwise rotating phasors. This suggests that the distribution of coherence (rather than energy or power) over time and frequency is a fundamental descriptor of a nonstationary signal. Finally we point out that, if x(t) is WSS, the estimator xˆ f (t) simplifies to xˆ f (t) = dξ ( f )ej2π f t

(9.98)

for all t, which has already been shown in Section 8.4.2. Furthermore, since WSS analytic signals are proper, the conjugate-linear estimator is identically zero: xˆ f (t) = 0 for all t. Hence, WSS analytic signals can be estimated only from counterclockwise rotating phasors, whereas nonstationary improper analytic signals may also be estimated from clockwise rotating phasors.

9.3.2

Kernel estimators The Rihaczek distribution is a mathematical expectation. As a practical matter, it must be estimated. Because such an estimator is likely to be implemented digitally, we present it for a time series x[k]. We require that our estimator be a bilinear function of x[k] that is covariant with respect to shifts in time and frequency. Estimators that satisfy these properties constitute Cohen’s class. 10 (Since we want to perform time- and frequencysmoothing, we do not expect our estimator to have the correct time- and frequencymarginals.) The discrete-time version of Cohen’s class is V&x x [k, θ ) = x[k + m + µ]φ[m, µ]x ∗ [k + m]e−jµθ , (9.99) m

µ

where m is a global and µ a local time variable. The choice of the dual-time kernel φ[m, µ] determines the properties of the HR-TFD estimator V&x x [k, θ ). Omitting the conjugation in (9.99) gives an estimator of the CR-TFD. It is our objective to design a suitable kernel φ[m, µ]. A factored kernel that preserves the spirit of the Rihaczek distribution is φ[m, µ] = w1 [m + µ]w2 [µ]w3∗ [m].

(9.100)

In practice, the three windows w1 , w2 , and w3 might be chosen real and even. By inserting (9.100) into (9.99), we obtain π dω V&x x [k, θ ) = w3∗ [m]x ∗ [k + m] W2 (ω)F1 [k, θ − ω)ejm(θ−ω) . (9.101) 2π −π m Here W2 (ω) is the discrete-time Fourier transform (DTFT) of w2 [µ] and F1 is the short-time Fourier transform (STFT) of x[k] using window w1 , defined as w1 [n]x[n + k]e−jnθ . (9.102) F1 [k, θ ) = n

9.3 Rihaczek time–frequency representation

241

The corresponding expression for the CR-TFD estimator is obtained by omitting the two conjugations in (9.101). In the estimator (9.101), the STFT F1 [k, θ − ω)ejm(θ−ω) plays the role of dξ (θ )ejkθ in the CL spectral representation. The STFT is averaged over time with window w3 [k] and over frequency with window W2 (ω). Thus, the three windows play different roles. Window w1 should be a smooth tapering window to stabilize the STFT F1 , whereas the windows w3 and W2 should be localized to concentrate the estimator in time and frequency, respectively. A computationally efficient implementation of this estimator was presented by Hindberg et al. (2006). Choosing φ[m, µ] = δ[m], i.e., w1 [n] = 1, w2 [µ] = 1, and w3 [m] = δ[m], produces an instantaneous estimator of the Rihaczek distribution without any time- or frequency-smoothing.

Statistical properties From (9.99) we find that E V&x x [k, θ ) =

m

=

µ

m

φ[m, µ]r x x [k + m, µ]e−jµθ

π

−π

Vx x [m, ω) [m − k, θ − ω)

dω , 2π

(9.103)

where r x x [k, κ] = E(x[k + κ]x ∗ [k]) and [m, ω) is the DTFT of φ[m, µ] in µ. For the factored kernel (9.100) we have [m, ω) = w3∗ [m]

π

−π

W2 (ω − ν)W1 (ν)ejmν

dν . 2π

(9.104)

The mean (9.103) of the HR-TFD estimator is thus a time–frequency-smoothed version of the Rihaczek distribution. The same result holds for the CR-TFD estimator. In order to derive the second-order properties of the estimator, we assume Gaussian signals. This avoids the appearance of fourth-order terms (cf. Section 2.5.3). An exact expression for the covariance of the HR-TFD estimator is 11 cov{V&x x [k1 , θ1 ), V&x x [k2 , θ2 )} = φ[m 1 , µ1 ]φ ∗ [m 2 , µ2 ]e−j(µ1 θ1 −µ2 θ2 ) m1

m2

µ1

µ2

× {r x x [k1 + m 1 + µ1 , k2 − k1 + m 2 − m 1 + µ2 − µ1 ] × r x∗x [k1 + m 1 , k2 − k1 + m 2 − m 1 ] + r˜x x [k1 + m 1 , k2 − k1 + m 2 − m 1 + µ2 ] × r˜x∗x [k1 + m 1 + µ1 , k2 − k1 + m 2 − m 1 − µ1 ]}. We observe that the covariance of the HR-TFD estimator in general depends on both the Hermitian and the complementary correlation. The covariance of the CR-TFD estimator,

242

Nonstationary processes

on the other hand, depends on the Hermitian correlation only: & & cov{V x x [k1 , θ1 ), V x x [k2 , θ2 )} = φ[m 1 , µ1 ]φ ∗ [m 2 , µ2 ]e−j(µ1 θ1 −µ2 θ2 ) m1

m2

µ1

µ2

× {r x x [k1 + m 1 + µ1 , k2 − k1 + m 2 − m 1 + µ2 − µ1 ] × r x x [k1 + m 1 , k2 − k1 + m 2 − m 1 ] + r x x [k1 + m 1 , k2 − k1 + m 2 − m 1 + µ2 ] × r x x [k1 + m 1 + µ1 , k2 − k1 + m 2 − m 1 − µ1 ]}. We can derive the following approximate expression for the variance of the HRTFD estimator, assuming analytic and quasi-stationary signals whose time duration of stationarity is much greater than the duration of correlation: π var{V&x x [k, θ )} ≈ | [m − k, θ − ω)|2 |Vx x [k, ω)|2 m

0

+ [m − k, θ − ω) ∗ [m − k, θ + ω) |Vx x [k, ω)|2

dω . 2π (9.105)

However, the same simplifying approximation of large duration of stationarity leads to a vanishing CR-TFD estimator because stationary analytic signals have zero complementary correlation. This result should not be taken as an indication that the CR-TFD is not important for analytic signals. Rather it shows that the assumption of quasi-stationarity with large duration of stationarity is a rather crude approximation to the general class of nonstationary signals.

9.4

Rotary-component and polarization analysis In Section 9.3.1, we determined the LMMSE estimator of the time-domain signal at a fixed time instant t from the frequency-domain representation at a fixed frequency f . This showed us how well the signal can be approximated from clockwise or counterclockwise turning circles (phasors). It thus seems natural to ask how well the signal can be approximated from ellipses, which combine the contributions from clockwise and counterclockwise circles. This framework will allow us to extend rotary-component and polarization analysis from the stationary case, described in Section 8.4, to a nonstationary setting. 12 We thus construct a WLMMSE estimator of x(t), for fixed t, from the frequency-domain representation at + f and − f , where throughout this entire section f shall denote a fixed nonnegative frequency: xˆ f (t) = W1 (t, f )dξ ( f )ej2π f t + W2 (t, − f )dξ ∗ ( f )e−j2π f t + W1 (t, − f )dξ (− f )e−j2π f t + W2 (t, f )dξ ∗ (− f )ej2π f t .

(9.106)

9.4 Rotary-component and polarization analysis

243

Since (t, f ) are fixed, W1 (t, f ), W1 (t, − f ), W2 (t, f ), and W2 (t, − f ) are four complex coefficients that are determined such that E|xˆ f (t) − x(t)|2 is minimized. By combining terms with ej2π f t and terms with e−j2π f t , we obtain xˆ f (t) = [W1 (t, f )dξ ( f ) + W2 (t, f )dξ ∗ (− f )] ej2π f t

dζ (t, f ) + [W1 (t, − f )dξ (− f ) + W2 (t, − f )dξ ∗ ( f )] e−j2π f t .

dζ (t, − f )

(9.107)

For fixed (t, f ) but varying t ,

εt, f (t ) = dζ (t, f )ej2π f t + dζ (t, − f )e−j2π f t

εt, f+ (t ) εt, f− (t )

(9.108)

describes an ellipse in the complex time-domain plane. We may say that the varying t traces out a local ellipse εt, f (t ) whose parameters are “frozen” for a given (t, f ). Since εt, f (t) = xˆ f (t), the local ellipse εt, f (t ) provides the best approximation of x(t) at t = t. Following the terminology introduced in Section 8.4.1, the local ellipse εt, f (t ) is also called the polarization ellipse at (t, f ), and εt, f+ (t ) and εt, f− (t ) are the rotary components. One ellipse can be constructed for every time–frequency point (t, f ). This raises the following obvious question: which ellipses in the time–frequency plane should the analysis focus on? We propose to consider those points (t, f ) where the local ellipses provide a good approximation of the nonstationary signal x(t). The quality of the approximation can be measured in terms of a time–frequency coherence, which is introduced in the next subsection. In the WSS case, we found that the WSS ellipse ε f (t) = dξ ( f )ej2π f t + dξ (− f )e−j2π f t

(9.109)

is the WLMMSE estimate of x(t), for all t, from the rotary components at frequency f . On comparing the WSS ellipse with the nonstationary solution (9.108), we see that the random variables dζ (t, f ) and dζ (t, − f ) now play the roles of the spectral increments dξ ( f ) and dξ (− f ) in the WSS ellipse. In order to determine the optimum coefficients W1 (t, ± f ) and W2 (t, ± f ) in (9.107), we introduce the short-hand notation   dξ ( f )ej2π f t  dξ ∗ (− f )ej2π f t   dΞ(t, f ) =  (9.110) dξ (− f )e−j2π f t . dξ ∗ ( f )e−j2π f t The local ellipse is found as the output of a WLMMSE filter εt, f (t ) = Vx (t, f )K

†

(t, f )

dΞ(t , f ) , df

(9.111)

244

Nonstationary processes

x(t)

Figure 9.5 The dark solid line shows the time-domain signal x(t) as it moves through the complex plane. At t = t, the ellipse is a perfect approximation.

with Vx (t, f )d f = E x(t)dΞH (t, f ) # = Vx∗x (t, f ) Vx x (t, − f )

Vx∗x (t, − f )

$ Vx x (t, f ) d f

(9.112)

and K

(t, f )(d f )2 = E dΞ(t, f )dΞH (t, f ) .

(9.113)

In (9.111), (·)† denotes the pseudo-inverse, which is necessary because K (t, f ) can be singular. Some background on the pseudo-inverse is given in Section A1.3.2 of Appendix 1. Finding a closed-form solution for the ellipse εt, f (t ) is tedious because it involves the pseudo-inverse of the 4 × 4 matrix K (t, f ). However, in special cases (i.e., proper, WSS, or analytic signals), K (t, f ) has many zero entries and the computations simplify accordingly. In particular, in the WSS case, εt, f (t ) simplifies to the stationary ellipse ε f (t) in (9.109). The analytic case is discussed in some detail below.

9.4.1

Ellipse properties In order to measure how well the local ellipse εt, f (t ) approximates x(t) at t = t, we may consult the magnitude-squared time–frequency coherence 13 †

|ρ¯x x (t, f )|2 =

Vx (t, f )K (t, f )VHx (t, f ) , r x x (t, 0)

(9.114)

which is closely related to the approximation error at t = t: E|εt, f (t) − x(t)|2 = r x x (t, 0)(1 − |ρ¯x x (t, f )|2 ).

(9.115)

The magnitude-squared time–frequency coherence satisfies 0 ≤ |ρ¯x x (t, f )|2 ≤ 1. If |ρ¯x x (t, f )|2 = 1, the ellipse εt, f (t ) is a perfect approximation of x(t) at t = t. This is illustrated in Fig. 9.5. If |ρ¯x x (t, f )|2 = 0, the best-fit ellipse εt, f (t ) has zero

245

9.4 Rotary-component and polarization analysis

amplitude. The time–frequency coherence tells us which regions of the time–frequency plane our analysis should focus on: these are the points (t, f ) that have magnitudesquared coherence close to 1. As we have seen above, the random variables dζ (t, f ) and dζ (t, − f ) in the nonstationary case play the roles of the spectral increments dξ ( f ) and dξ (− f ) in the WSS case. The properties of the local ellipse may therefore be analyzed using the augmented ESD matrix Jεε (t, f ) defined by J (t, f ) J˜εε (t, f ) (d f )2 Jεε (t, f )(d f )2 = ˜εε (9.116) ∗ (t, f ) Jεε (t, − f ) Jεε dζ (t, f )dζ (t, − f ) |dζ (t, f )|2 , (9.117) =E |dζ (t, − f )|2 dζ ∗ (t, f )dζ ∗ (t, − f ) instead of the augmented PSD matrix Px x ( f ) in the WSS case. 14 Just as in the WSS case discussed in Section 8.4.2, the expected orientation of the ellipse is then approximated by tan(2ψ(t, f )) =

Im J˜εε (t, f ) Re J˜εε (t, f )

(9.118)

and the area by π |Jεε (t, f ) − Jεε (t, − f )|(d f )2 ,

(9.119)

and its shape is approximately characterized by sin(2χ (t, f )) =

Jεε (t, f ) − Jεε (t, − f ) . Jεε (t, f ) + Jεε (t, − f )

(9.120)

The analysis of polarization and coherence proceeds completely analogously to Section 8.4.3. For instance, as in (8.79), the degree of polarization is < 4 det Jεε (t, f ) . (9.121) (t, f ) = 1 − tr 2 Jεε (t, f )

9.4.2

Analytic signals For analytic signals, these expressions simplify. If x(t) is analytic, dξ (− f ) = 0 and the local ellipse becomes

εt, f (t ) = W1 (t, f )dξ ( f ) ej2π f t + W2 (t, − f )dξ ∗ ( f ) e−j2π f t .

dζ (t, f ) dζ (t, − f )

(9.122)

Thus, W1 (t, f )dξ ( f ) ∗ Jεε (t, f )(d f ) = E W1 (t, f )dξ ∗ ( f ) W2∗ (t, − f )dξ ( f )

2

W2 (t, − f )dξ ∗ ( f )

(9.123)

246

Nonstationary processes

and then

W1 (t, f )W2 (t, − f ) . |W2 (t, − f )|2

(9.124)

Vx∗x (t, f )Sx x (0, f ) − Vx x (t, f ) Sx∗x (2 f, f )e−j4π f t , S 2 (0, f ) − | Sx x (2 f, f )|2

(9.125)

Vx x (t, f )Sx x (0, f ) − Vx∗x (t, f ) Sx x (2 f, f )ej4π f t , S 2 (0, f ) − | Sx x (2 f, f )|2

(9.126)

|W1 (t, f )|2 Jεε (t, f ) = Sx x (0, f ) ∗ W1 (t, f )W2∗ (t, − f ) From (9.111), we determine the filter coefficients W1 (t, f )d f =

xx

W2 (t, − f )d f =

xx

provided that Sx2x (0, f ) = | Sx x (2 f, f )|2 . If Sx2x (0, f ) = | Sx x (2 f, f )|2 , then dξ ∗ ( f ) = ejα dξ ( f ) for some real α and the WLMMSE estimator becomes strictly linear. We find for this special case W1 (t, f )d f =

Vx∗x (t, f ) , Sx x (0, f )

W2 (t, − f )d f = 0,

(9.127) (9.128)

which makes the approximating ellipse a circle. It is interesting to examine the ellipse shape and polarization, which, for Sx2x (0, f ) = | Sx x (2 f, f )|2 , can be done through the angle χ (t, f ) defined by sin(2χ (t, f )) =

(|Vx x (t, f )|2 − |Vx x (t, f )|2 )(Sx2x (0, f ) − | Sx x (2 f, f )|2 ) , D

(9.129)

where D = (|Vx x (t, f )|2 + |Vx x (t, f )|2 )(Sx2x (0, f ) + | Sx x (2 f, f )|2 ) − 4 Re Vx∗x (t, f )Vx∗x (t, f )Sx x (0, f ) Sx x (2 f, f )ej4π f t . If x(t) is proper at time t and frequency f , Vx x (t, f ) = 0 and Sx x (2 f, f ) = 0, then χ (t, f ) = π/4. This says that a proper analytic signal is counterclockwise circularly polarized. On the other hand, if Sx2x (0, f ) = | Sx x (2 f, f )|2 , then dξ ∗ ( f ) = ejα dξ ( f ) and thus |Vx x (t, f )|2 = |Vx x (t, f )|2 . In this case, the signal x(t) can be regarded as maximally improper at frequency f . A maximally improper analytic signal has χ (t, f ) = π/4 and is therefore also counterclockwise circularly polarized. Since det Jεε (t, f ) = 0 for all (t, f ), all analytic signals are completely polarized. Yet, while | Sx x (2 f, f )|2 ≤ Sx2x (0, f ), the magnitude of the HR-TFD does not provide an upper bound on the magnitude of the CR-TFD, i.e., |Vx x (t, f )|2 |Vx x (t, f )|2 . Moreover, |Vx x (t, f )|2 = |Vx x (t, f )|2 does not imply Sx2x (0, f ) = | Sx x (2 f, f )|2 . Therefore it is possible that x(t) is clockwise polarized at (t, f ), i.e., χ (t, f ) < 0, provided that the signal is “sufficiently improper” at (t, f ). This result may seem surprising, considering that an analytic signal is synthesized from counterclockwise phasors only. The quality of the approximation can be judged by computing the magnitude-squared time–frequency coherence |ρ¯x x (t, f )|2 defined in (9.114). For analytic signals, we obtain

9.5 Higher-order statistics

247

the simplified expression |ρ¯x x (t, f )|2 =

N r x x (t, 0)(Sx2x (0,

f ) − | Sx x (2 f, f )|2 ),

(9.130)

where N = Sx x (0, f )(|Vx x (t, f )|2 + |Vx x (t, f )|2 ) − 2 Re[Vx∗x (t, f )Vx∗x (t, f ) Sx x (2 f, f )ej4π f t ], if Sx2x (0, f ) = | Sx x (2 f, f )|2 . Both in the proper case, which is characterized by Vx x (t, f ) = 0 and Sx x (2 f, f ) = 0, and in the maximally improper case, characterized by S 2 (0, f ) = | Sx x (2 f, f )|2 , the magnitude-squared time–frequency coherence simplifies xx

to the magnitude-squared rotational time–frequency coherence (9.93), i.e., |ρ¯x x (t, f )|2 = |ρx x (t, f )|2 =

|Vx x (t, f )|2 . r x x (t, 0)Sx x (0, f )

(9.131)

In these cases x(t) can be estimated from counterclockwise rotating phasors only. In other words, the optimum WLMMSE estimator is the LMMSE estimator, i.e., W2 (t, − f ) = 0.

9.5

Higher-order statistics We conclude this chapter with a very brief introduction to the higher-order statistics of a continuous-time signal x(t). 15 We denote the nth-order moment function by n−1 " x i (t + τi ) , (9.132) m x,♦ (t, ␶ ) = E x n (t) i=1

where, as in Section 8.5, ␶ = [τ1 , . . ., τn−1 ] , and ♦ = [n , 1 , 2 , . . ., n−1 ]T contains elements i that are either 1 or the conjugating star ∗ . This leads to 2n different nth-order moment functions, depending on which terms are conjugated. As explained in Section 8.5, not all of these functions are required for a complete statistical description. We assume that x(t) can be represented as in the CL spectral representation (9.47), ∞ dξ ( f )ej2π f t , (9.133) x(t) = T

−∞

but ξ ( f ) is now an N th-order spectral process with moments defined and bounded up to N th order. That is, the moment functions can be expressed in terms of the increment process dξ ( f ) as n−1 " T n i m x,♦ (t, ␶ ) = E dξ (− n f n ) dξ (i f i ) ej2π[f ␶ +( f1 +···+ fn−1 − fn )t] , (9.134) IRn

i=1

where n = 1, . . ., N , f = [ f 1 , . . ., f n−1 ]T , and i f i = −1 if i is the conjugation star ∗ . The “−” sign for f n is to ensure that (9.134) complies with (9.50) and (9.51) in the second-order case n = 2. Since t is a global time variable and ␶ contains local time

248

Nonstationary processes

lags, we call f 1 , . . ., f n−1 global frequencies and ν = f 1 + · · · + f n−1 − f n = f T 1 − f n a local frequency-offset. With this substitution, we obtain n−1 " T n T i m x,♦ (t, ␶ ) = E dξ (n (ν − f 1)) dξ (i f i ) ej2π(f ␶ +νt) . (9.135) IRn

i=1

We now define Sx,♦ (ν, f)dν d

n−1

n

f = E dξ (n (ν − f 1)) T

n−1 "

i

dξ (i f i ) .

(9.136)

i=1

We call Sx,♦ (ν, f) the nth-order spectral correlation, which we recognize as the ndimensional Fourier transform of the moment function m x,♦ (t, ␶ ). The spectral correlation may contain δ-functions (e.g., in the stationary case). Consider now a signal x(t) that is stationary up to order N . Then none of the moment functions m x,♦ (t, ␶ ), for n = 1, . . ., N and arbitrary ♦, depends on t: m x,♦ (t, ␶ ) = m x,♦ (␶ ). This can be true only if Sx,♦ (ν, f) is zero outside the stationary manifold ν = 0. The (n − 1)-dimensional Fourier transform of m x,♦ (␶ ) T m x,♦ (␶ )e−j2πf ␶ dn−1 ␶ , (9.137) Mx,♦ (f) = IRn−1

is the nth-order moment spectrum introduced in (8.97). The moment spectrum and spectral correlation of a stationary process are related as Sx,♦ (ν, f) = Mx,♦ (f)δ(ν).

(9.138)

The moment spectrum can be expressed directly in terms of the increments of the spectral process as n−1 " n−1 n T i dξ (i f i ) . (9.139) Mx,♦ (f) d f = E dξ (− n f 1) i=1

Notes c 1 Section 9.1 closely follows the presentation by Schreier et al. (2005). This paper is IEEE, and portions are reused with permission. 2 For Grenander’s theorem, see for instance, Section VI.B.2 in Poor (1998). 3 Section 9.2 draws on material presented by Picinbono and Bondon (1997) and Schreier and Scharf (2003b). 4 Note that Schreier and Scharf (2003b) use a slightly different way of defining global and local variables: they define t1 = t, t2 = t − τ and thus f 1 = f + ν, f 2 = f , whereas we define t1 = t + τ , t2 = t and thus f 1 = f , f 2 = f − ν. While this leads to results that look slightly different, they are fundamentally the same. 5 The Rihaczek distribution for deterministic signals was introduced by Rihaczek (1968). The extension to stochastic signals that is based on harmonizability is due to Martin (1982). Note that we do not follow the (somewhat confusing) convention of differentiating between time–frequency representations for deterministic and stochastic signals by calling the former

Notes

6 7

8

9

10

11

12

13 14

15

249

“distributions” and the latter “spectra.” Time–frequency distributions for stochastic signals in general are discussed by Flandrin (1999), and the Rihaczek distribution in particular by Scharf et al. (2005). For a discussion of Wigner’s theorem and its consequences for time–frequency analysis, see Flandrin (1999). The Wigner–Ville distribution is the Fourier transform (in τ ) of E[x(t + τ/2)x (∗) (t − τ/2)]. That is, the local time lag τ is split symmetrically between t1 = t + τ/2 and t2 = t − τ/2, which leads to a symmetric split of the frequency-offset ν between the frequency variables f 1 = f + ν/2 and f 2 = f − ν/2. For a thorough treatment of the Wigner–Ville distribution, see Flandrin (1999). Estimating x(t) from the random variable dξ ( f )ej2π f t assumes that x(t) is an energy signal whose Fourier transform X ( f ) = dξ ( f )/d f exists. If x(t) is not an energy signal, we must estimate the increment x(t)d f instead. This requires only minor changes to the remaining development in Sections 9.3 and 9.4. Following the terminology introduced in Chapter 4, ρx x (t, f ) is a rotational correlation coefficient between x(t) and dξ ( f )ej2π f t , at fixed t and f . However, in this context, the term “coherence” is preferred over “correlation coefficient.” The time–frequency coherence should not be confused with the coherence introduced in Definition 8.2, which is a frequency-domain coherence between the rotary components at a given frequency. The time–frequency coherence does not simplify to the frequency-domain coherence for a WSS signal. Cohen’s class was introduced in quantum mechanics by Cohen (1966). Members of this class can also be regarded as estimators of the Wigner–Ville distribution, see Martin and Flandrin (1985). The estimator of Section 9.3.2 was presented by Scharf et al. (2005), building upon c prior work by Scharf and Friedlander (2001). The paper Scharf et al. (2005) is IEEE, and portions are reused with permission. The evaluation of the statistical properties of the HR/CR-TFD estimators is due to Scharf et al. (2005), following the lead of Martin and Flandrin (1985). Note again that the results presented in Section 9.3.2 look different at first than those of Scharf et al. (2005) even though they are really identical. This is because Scharf et al. (2005) define r x x [k, κ] = E(x[k]x ∗ [k − κ]), whereas this book defines r x x [k, κ] = E(x[k + κ]x ∗ [k]). c Section 9.4 follows Schreier (2008b). The paper is IEEE, and portions are reused with permission. Previous extensions of rotary-component and polarization analysis to nonstationary signals have been centered mainly around the wavelet transform, and to a much lesser extent, the STFT. The interested reader is referred to Lilly and Park (1995), Olhede and Walden (2003a, 2003b), Lilly and Gascard (2006), and Roueff et al. (2006). In the language of Chapter 4, ρ¯x x (t, f ) is a total correlation coefficient between x(t) and [dξ ( f )ej2π f t , dξ (− f )e−j2π f t ]. This assumes that x(t) is an energy signal with bounded ESD. If x(t) is a power signal, we have to work with PSDs instead. That is, instead of using Jεε (t, f )(d f )2 = E|dζ (t, f )|2 , we use Pεε (t, f )d f = E|dζ (t, f )|2 . This requires only minor changes in the remainder of the section. A detailed discussion of higher-order statistics for nonstationary signals (albeit real-valued) is given by Hanssen and Scharf (2003).

10

Cyclostationary processes

Cyclostationary processes are an important class of nonstationary processes that have periodically varying correlation properties. They can model periodic phenomena occurring in science and technology, including communications (modulation, sampling, and multiplexing), meteorology, oceanography, climatology, astronomy (rotation of the Earth and other planets), and economics (seasonality). While cyclostationarity can manifest itself in statistics of arbitrary order, we will restrict our attention to phenomena in which the second-order correlation and complementary correlation functions are periodic in their global time variable. Our program for this chapter is as follows. In Section 10.1, we discuss the spectral properties of harmonizable cyclostationary processes. We have seen in Chapter 8 that the second-order averages of a WSS process are characterized by the power spectral density (PSD) and complementary power spectral density (C-PSD). These each correspond to a single δ-ridge (the stationary manifold) in the spectral correlation and complementary spectral correlation. Cyclostationary processes have a (possibly countably infinite) number of so-called cyclic PSDs and C-PSDs. These correspond to δ-ridges in the spectral correlation and complementary spectral correlation that are parallel to the stationary manifold. In Section 10.2, we derive the cyclic PSDs and C-PSDs of linearly modulated digital communication signals. We will see that there are two types of cyclostationarity: one related to the symbol rate, the other to impropriety and carrier modulation. Because cyclostationary processes are spectrally correlated between different frequencies, they have spectral redundancy. This redundancy can be exploited in optimum estimation. The widely linear minimum mean-squared error filter is called the cyclic Wiener filter, which is discussed in Section 10.3. It can be considered the cyclostationary extension of the WSS Wiener filter. The cyclic Wiener filter consists of a bank of linear shift-invariant filters that are applied to frequency-shifted versions of the signal and its conjugate. In Section 10.4, we develop an efficient causal filter-bank implementation of the cyclic Wiener filter for cyclostationary discrete-time series, which is based on a connection between scalar-valued cyclostationary and vector-valued WSS time series. We need to emphasize that, due to space constraints, we can barely scratch the surface of the very rich topic of cyclostationarity in this chapter. For further information, we refer the reader to the review article by Gardner et al. (2006) and the extensive bibliography by Serpedin et al. (2005).

10.1 Characterization and spectral properties

10.1

251

Characterization and spectral properties Cyclostationary processes have a correlation function r x x (t, τ ) = E[x(t + τ )x ∗ (t)] and complementary correlation function r˜x x (t, τ ) = E[x(t + τ )x(t)] that are both periodic in their global time variable t. Definition 10.1. A zero-mean process x(t) is cyclostationary (CS) if there exists T > 0 such that r x x (t, τ ) = r x x (t + T, τ ),

(10.1)

r˜x x (t, τ ) = r˜x x (t + T, τ )

(10.2)

for all t and τ . CS processes are sometimes also called periodically correlated. If x(t) does not have zero mean, we require that the mean be T -periodic as well, i.e., µx (t) = µx (t + T ) for all t. Note that r x x (t, τ ) might actually be T1 -periodic and r˜x x (t, τ ) T2 -periodic. The period T would then be the least common multiple of T1 and T2 . It is important to stress that the periodicity is in the global time variable t, not the local time offset τ . Periodicity in τ may occur independently of periodicity in t. It should also be emphasized that periodicity refers to the correlation and complementary correlation functions, not the process itself. That is, in general, x(t) = x(t + T ). For WSS processes, (10.1) and (10.2) hold for arbitrary T > 0. Thus, WSS processes are a subclass of CS processes. The CS processes we consider are more precisely called wide-sense (or second-order) CS, since Definition 10.1 considers only second-order statistics. A process can also be higher-order CS if higher-order correlation functions are periodic in t, and strict-sense CS if the probability density function is periodic in t. Adding two uncorrelated CS processes with periods T1 and T2 produces another CS process whose period is the least common multiple (LCM) of T1 and T2 . However, if the two periods are incommensurate, e.g., T1 = 1 and T2 = π , then the LCM is ∞. Such a process is CS in a generalized sense, which is called almost CS because its correlation function is an almost periodic function in the sense of Bohr. For fixed τ , the correlation and complementary correlation functions of an almost CS process are each a possibly infinite sum of periodic functions. Many of the results for CS processes generalize to almost CS processes.

10.1.1

Cyclic power spectral density We now discuss the spectral properties of harmonizable CS processes, which possess a Cramér–Loève (CL) spectral representation (cf. Section 9.2). The spectral correlation and complementary correlation are the two-dimensional Fourier transforms of the time correlation and complementary correlation function. However, since r x x (t, τ ) and r˜x x (t, τ ) are periodic in t, we compute the Fourier series coefficients rather than the

252

Cyclostationary processes

Fourier transform in t. These Fourier series coefficients are 1 T px x (n/T, τ ) = r x x (t, τ )e−j2πnt/T dt, T 0 1 T p˜ x x (n/T, τ ) = r˜x x (t, τ )e−j2πnt/T dt, T 0

(10.3) (10.4)

and they are called the cyclic correlation function and cyclic complementary correlation function, respectively. There is a subtle difference between the cyclic correlation function and the ambiguity function, which is defined as the Fourier transform of the correlation function in t (see Section 9.2.1). The cyclic correlation function and the ambiguity function have different physical units. The Fourier transform of the cyclic correlation and cyclic complementary correlation functions in τ yields the cyclic power spectral density and cyclic complementary power spectral density: ∞ px x (n/T, τ )e−j2π f τ dτ, (10.5) Px x (n/T, f ) = −∞

Px x (n/T, f ) =

∞

−∞

p˜ x x (n/T, τ )e−j2π f τ dτ.

(10.6)

The appropriate generalization for the cyclic correlation and complementary correlation of almost CS processes is px x (νn , τ ) = lim

1 T

p˜ x x (˜νn , τ ) = lim

1 T

T →∞

T →∞

T /2

−T /2

T /2

−T /2

r x x (t, τ )e−j2πνn t dt,

(10.7)

r˜x x (t, τ )e−j2π νñ t dt.

(10.8)

The frequency offset variables νn and ν˜ n are called the cycle frequencies. The set of cycle frequencies {νn } for which px x (νn , τ ) is not identically zero and the set of cycle frequencies {˜νn } for which p˜ x x (˜νn , τ ) is not identically zero are both countable (but possibly countably infinite). These two sets do not have to be identical. CS processes are a subclass of almost CS processes in which the frequencies {νn } and {˜νn } are contained in a lattice of the form {n/T }. Relationships (10.5) and (10.6) remain valid for almost CS processes if n/T is replaced with the cycle frequencies νn and ν˜ n : ∞ px x (νn , τ )e−j2π f τ dτ, (10.9) Px x (νn , f ) = −∞

Px x (˜νn , f ) =

∞

−∞

p˜ x x (˜νn , τ )e−j2π f τ dτ.

Harmonizable signals may be expressed as ∞ x(t) = dξ ( f )ej2π f t . −∞

(10.10)

(10.11)

10.1 Characterization and spectral properties

253

The next result for harmonizable almost CS processes, which follows immediately from the CL spectral representation in Result 9.2, connects the cyclic PSD and C-PSD to the second-order properties of the increments of the spectral process ξ ( f ). Result 10.1. The cyclic PSD Px x (νn , f ) can be expressed as Px x (νn , f )d f = E[dξ ( f )dξ ∗ ( f − νn )],

(10.12)

and the cyclic C-PSD Px x (˜νn , f ) as Px x (˜νn , f )d f = E[dξ ( f )dξ (˜νn − f )].

(10.13)

Moreover, the spectral correlation Sx x (ν, f ) and the complementary spectral correlation Sx x (ν, f ), defined in Result 9.2, contain the cyclic PSD and C-PSD on δ-ridges parallel to the stationary manifold ν = 0: Px x (νn , f )δ(ν − νn ), (10.14) Sx x (ν, f ) = n

Sx x (ν, f ) =

Px x (˜νn , f )δ(ν − ν˜ n ).

(10.15)

n

On the stationary manifold ν = 0, the cyclic PSD is the usual PSD, i.e., Px x (0, f ) = Px x ( f ), and the cyclic C-PSD the usual C-PSD, i.e., Px x (0, f ) = Px x ( f ). It is clear from the bounds in Result 9.3 that it is not possible for the spectral correlation Sx x (ν, f ) to have only δ-ridges for ν = 0. CS processes must always have a WSS component, i.e., they must have a PSD Px x ( f ) ≡ 0. However, it is possible that a process has cyclic PSD Px x (ν, f ) ≡ 0 for all ν = 0 but cyclic C-PSD Px x (˜νn , f ) ≡ 0 for some ν˜ n = 0. Example 10.1. Let u(t) be a real WSS random process with PSD Puu ( f ). The complexmodulated process x(t) = u(t)ej2π f0 t has PSD Px x ( f ) = Puu ( f − f 0 ) and cyclic PSD Px x (ν, f ) ≡ 0 for all ν = 0. However, x(t) is CS rather than WSS because the cyclic C-PSD is nonzero outside the stationary manifold for ν = 2 f 0 : Px x (2 f 0 , f ) = Puu ( f − f 0 ). Unfortunately, there is not universal agreement on the nomenclature for CS processes. It is also quite common to call the cyclic PSD Px x (νn , f ) the “spectral correlation,” a term that we use for Sx x (ν, f ). In the CS literature, one can often find the statement “only (almost) CS processes can exhibit spectral correlation.” This can thus be translated as “only (almost) CS processes can have Px x (νn , f ) ≡ 0 for νn = 0,” which means that Sx x (ν, f ) has δ-ridges outside the stationary manifold ν = 0. However, there also exist processes whose Sx x (ν, f ) have δ-ridges that are curves (not necessarily straight lines) in the (ν, f )-plane. These processes have been termed “generalized almost CS” and were first investigated by Izzo and Napolitano (1998).

10.1.2

Cyclic spectral coherence The cyclic spectral coherence quantifies how much the frequency-shifted spectral process ξ ( f − νn ) tells us about the spectral process ξ ( f ). It is defined as the cyclic

254

Cyclostationary processes

PSD, normalized by the PSDs at f and f − νn . There is also a complementary version. Definition 10.2. The rotational cyclic spectral coherence of an (almost) CS process is defined as ρx x (νn , f ) =

E[dξ ( f )dξ ∗ ( f − νn )] E|dξ ( f )|2 E|dξ ( f − νn )|2

Px x (νn , f ) Px x ( f )Px x ( f − νn )

(10.16)

Px x (˜νn , f ) . Px x ( f )Px x (˜νn − f )

(10.17)

=√

and the reflectional cyclic spectral coherence is ρx x (˜νn , f ) =

E[dξ ( f )dξ (˜νn − f )] E|dξ ( f )|2 E|dξ (˜νn − f )|2

=√

The terms “rotational” and “reflectional” are borrowed from the language of Chapter 4, which discusses rotational and reflectional correlation coefficients. Correlation coefficients in the frequency domain are usually called coherences. It is also possible to define a total cyclic spectral coherence, which takes into account both Hermitian and complementary correlations. The squared magnitudes of the cyclic spectral coherences are bounded as 0 ≤ |ρx x (νn , f )|2 ≤ 1 and 0 ≤ |ρx x (˜νn , f )|2 ≤ 1. If |ρx x (νn , f )|2 = 1 for a given (νn , f ), then dξ ( f ) can be perfectly linearly estimated from dξ ( f − νn ). If |ρx x (˜νn , f )|2 = 1 for a given (ν˜ n , f ), then dξ ( f ) can be perfectly linearly estimated from dξ ∗ (˜νn − f ). Note that, for ν˜ n = 0, ρx x (0, f ) is the coherence ρx x ( f ) in Definition 8.2. It is also clear that ρx x (0, f ) = 1.

10.1.3

Estimating the cyclic power-spectral density In practice, the cyclic correlation function and cyclic PSD must be estimated from realizations of the random process. For ergodic WSS processes, ensemble averages can be replaced by time averages over a single realization. Under certain conditions1 this is also possible for CS processes. For simplicity, we will consider only the Hermitian cyclic correlation and PSD in this section. Expressions for the complementary quantities follow straightforwardly. Since the correlation function is T -periodic, we need to perform synchronized T periodic time-averaging. The correlation function is thus estimated as K 1 x(t + kT + τ )x ∗ (t + kT ). K →∞ 2K + 1 k=−K

rˆx x (t, τ ) = lim

(10.18)

In essence, this estimator regards each period of the process as a separate realization. Consequently, an estimate of the cyclic correlation function is obtained by 1 T rˆx x (t, τ )e−j2πnt/T dt pˆ x x (n/T, τ ) = T 0 T /2 1 x(t + τ )x ∗ (t)e−j2πnt/T dt. (10.19) = lim T →∞ T −T /2

10.2 Linearly modulated communication signals

255

This estimator has the form of (10.7) without the expected value operator. It can also be used for almost CS processes if n/T is replaced with νn . A straightforward estimator of the cyclic PSD is the cyclic periodogram x x (νn , f ; t0 , T ) =

1 X ( f ; t0 , T )X ∗ ( f − νn ; t0 , T ), T

where X ( f ; t0 , T ) =

t0

t0 +T /2

x(t)e−j2π f t dt

(10.20)

(10.21)

−T /2

is the windowed Fourier transform of x(t) of length T , centered at t0 . This estimator is asymptotically unbiased as T → ∞ but not consistent. A consistent estimator of the cyclic PSD is the infinitely time-smoothed cyclic periodogram t0 +T /2 1 x x (νn , f ; t, 1/! f ) dt (10.22) Pˆxt x (νn , f ) = lim lim ! f →0 T →∞ T t −T /2 0 or the infinitely frequency-smoothed cyclic periodogram f +! f /2 1 f ˆ Px x (νn , f ) = lim lim x x (νn , f ; t0 , T ) d f . ! f →0 T →∞ ! f f −! f /2

(10.23)

The order of the two limits cannot be interchanged. It can be shown that the two estimators Pˆxt x (νn , f ) and Pˆxfx (νn , f ) are equivalent.2

10.2

Linearly modulated digital communication signals In this section, we derive the cyclic PSD and C-PSD of linearly modulated digital communication signals such as Pulse Amplitude Modulation (PAM), Phase Shift Keying (PSK), and Quadrature Amplitude Modulation (QAM).3 We will find that there are two sources of cyclostationarity: one is related to the repetition of pulses at the symbol interval, the other to impropriety and carrier modulation.

10.2.1

Symbol-rate-related cyclostationarity An expression for a digitally modulated complex baseband signal is x(t) =

∞

dk b(t − kT ),

(10.24)

k=−∞

where b(t) is the deterministic transmit pulse shape, T the symbol interval (1/T is the symbol rate), and dk a complex-valued WSS random data sequence with correlation rdd [κ] = E[dk+κ dk∗ ] and complementary correlation r˜dd [κ] = E[dk+κ dk ]. The correlation function of x(t) is r x x (t, τ ) =

∞ ∞ k=−∞ m=−∞

E[dk dm∗ ]b(t + τ − kT )b∗ (t − mT ).

(10.25)

256

Cyclostationary processes

With the variable substitution κ = k − m, we obtain r x x (t, τ ) =

∞

∞

rdd [κ]

κ=−∞

b(t + τ − κ T − mT )b∗ (t − mT ).

(10.26)

m=−∞

It is easy to see that r x x (t, τ ) = r x x (t + T, τ ), and a similar development for the complementary correlation shows that r˜x x (t, τ ) = r˜x x (t + T, τ ). Thus, x(t) is CS with period T . The cyclic correlation function is 1 T px x (n/T, τ ) = r x x (t, τ )e−j2πnt/T dt T 0 ∞ ∞ T 1 rdd [κ] b(t + τ − κ T − mT )b∗ (t − mT )e−j2πnt/T dt = T κ=−∞ m=−∞ 0 ∞ ∞ 1 rdd [κ] b(t + τ − κ T )b∗ (t)e−j2πnt/T dt. = T κ=−∞ −∞

Now we define

φ(n/T, τ ) =

∞

−∞

b(t + τ )b∗ (t)e−j2πnt/T dt,

(10.27)

(10.28)

whose Fourier transform (n/T, f ) is the spectral correlation of the deterministic pulse shape b(t) ←→ B( f ): (n/T, f ) = B( f )B ∗ ( f − n/T ).

(10.29)

Inserting (10.28) into (10.27) yields px x (n/T, τ ) =

∞ 1 rdd [κ]φ(n/T, τ − κ T ). T κ=−∞

(10.30)

Its Fourier transform in τ is the cyclic PSD Px x (n/T, f ) =

1 1 Pdd ( f ) (n/T, f ) = Pdd ( f )B( f )B ∗ ( f − n/T ), T T

(10.31)

where Pdd ( f ) is the PSD of the data sequence dk , which is the discrete-time Fourier transform of rdd [κ]. For n = 0, Px x (0, f ) is the PSD. A completely analogous development leads to the cyclic C-PSD 1 Px x (n/T, f ) = Pdd ( f )B( f )B(n/T − f ), T

(10.32)

where Pdd ( f ) is the C-PSD of the data sequence dk . For n = 0, Px x (0, f ) is the C-PSD. Applying these findings to the special case of an uncorrelated data sequence leads to the following result. Result 10.2. If the data sequence dk is uncorrelated with zero mean, variance σd2 , and d2 , then the cyclic PSD and cyclic C-PSD of the linearly complementary variance σ

10.2 Linearly modulated communication signals

257

modulated baseband signal (10.24) are Px x (n/T, f ) =

σd2 B( f )B ∗ ( f − n/T ), T

2 σ Px x (n/T, f ) = d B( f )B(n/T − f ). T

(10.33) (10.34)

We see that the transmit pulse b(t) determines not only the bandwidth of x(t) but also the number of cycle frequencies for which the cyclic PSD and C-PSD are nonzero.

Example 10.2. In Quaternary Phase Shift Keying (QPSK), the transmitted data are uncord2 = 0. This produces a proper related, equally likely dk ∈ {±1, ±j}, so that σd2 = 1 and σ baseband signal x(t). We will consider two transmit pulse shapes. The first is the rectangular pulse    1, |t| < T /2, b1 (t) = 12 , |t| = T /2,   0, |t| > T /2 with Fourier transform B1 ( f ) = sin(π f T )/(π f ). Another transmit pulse commonly employed in digital communications is the Nyquist roll-off (or raised-cosine) pulse with roll-off factor 0 ≤ α ≤ 1: t cos(απ t/T ) . (10.35) b2 (t) = sinc T 1 − 4α 2 (t/T )2 Its Fourier transform is    T,    B2 ( f ) = T 1 − sin π T | f | − 1 ,  2 α 2T   0,

1−α , |f| ≤ 2T 1−α 1+α < |f| ≤ , 2T 2T else.

(10.36)

Figure 10.1(a) shows the magnitude of the cyclic PSD of QPSK for the rectangular pulse b1 (t), and Fig. 10.1(b) is for the Nyquist pulse b2 (t) with roll-off factor α = 1. The rectangular pulse b1 (t) is not bandlimited, so the cyclic PSD is spread out over the entire (ν, f )-plane. The Nyquist pulse b2 (t), on the other hand, has a limited two-sided bandwidth of 2/T , and the cyclic PSD is nonzero only for {νn } = {−1/T, 0, 1/T }. Since QPSK signals are proper, the cyclic complementary PSD is identically zero. In Binary Phase Shift Keying (BPSK), the transmitted data are equally likely dk ∈ d2 = 1. The rectangular pulse and the Nyquist roll-off pulse are {±1}, so that σd2 = σ both real-valued, which means B( f ) = B ∗ (− f ). Therefore, the BPSK baseband signal is maximally improper and the cyclic C-PSD equals the cyclic PSD: Px x (n/T, f ) = Px x (n/T, f ).

258

Cyclostationary processes

5 5 (a)

0 0 ν

f

2 4

2

(b)

0 0 ν

2

2 4

f

Figure 10.1 The magnitude of the cyclic power spectral density of a QPSK baseband-signal with

rectangular transmit pulse (a), and Nyquist roll-off pulse (b), with parameters T = 1, α = 1.

10.2.2

Carrier-frequency-related cyclostationarity Carrier modulation is an additional source of cyclostationarity if the complex baseband signal x(t) is improper.4 As explained in Section 1.4, the carrier-modulated bandpass signal is p(t) = Re {x(t)ej2π f0 t }.

(10.37)

We denote the spectral process of p(t) by π ( f ). Its increments are related to the increments of ξ ( f ) as dπ ( f ) = 12 [dξ ( f − f 0 ) + dξ ∗ (− f − f 0 )],

(10.38)

and the cyclic PSD of p(t) is therefore Ppp (ν, f ) = 14 [Px x (ν, f − f 0 ) + Px∗x (−ν, − f − f 0 ) + Px x (ν − 2 f 0 , f − f 0 ) + Px∗x (−2 f 0 − ν, − f − f 0 )].

(10.39)

We see that the cyclic PSD of the bandpass signal exhibits additional CS components around the cycle frequencies ν = ±2 f 0 if the baseband signal is improper. Note that, since the modulated signal p(t) is real, its cyclic C-PSD equals the cyclic PSD. Example 10.3. Figure 10.2 shows the cyclic PSD of a carrier-modulated QPSK signal (a) and BPSK signal (b) using a Nyquist roll-off pulse. Since QPSK has a proper baseband signal, its cyclic PSD has only two peaks, at ν = 0 and f = ± f 0 , which correspond to the first two terms in (10.39). BPSK, on the other hand, has a maximally improper baseband signal with Px x (ν, f ) = Px x (ν, f ). Its cyclic PSD has two additional peaks located at ν = 2 f 0 , f = f 0 , and ν = −2 f 0 , f = − f 0 , which are due to the last two terms in (10.39).

259

10.2 Linearly modulated communication signals

5

10 0 (a)

ν

10

5

f

5

10

0 (b)

0

0

ν

f

Figure 10.2 The cyclic power spectral density of a QPSK bandpass signal (a) and a BPSK

bandpass signal (b) with Nyquist roll-off pulse, with parameters T = 1, α = 1, f 0 = 5.

10.2.3

Cyclostationarity as frequency diversity Let us compute the rotational cyclic spectral coherence in Definition 10.2 for the baseband signal x(t) in (10.24). Using Result 10.2, we immediately find |ρx x (n/T, f )|2 = 1 as long as the product B( f )B ∗ ( f − n/T ) = 0. The consequences of this are worth dwelling on. First of all, it means that the baseband signal x(t) is spectrally redundant if its two-sided bandwidth exceeds the Nyquist bandwidth 1/T . In fact, any nonzero spectral band of width 1/T of dξ ( f ) can be used to reconstruct the entire process ξ ( f ) because, for frequencies outside this band, dξ ( f ) can be perfectly estimated from frequency-shifted versions dξ ( f − n/T ), if |ρx x (n/T, f )|2 = 1. Every nonzero spectral band of width 1/T can be considered a replica of the signal, which contains the entire information about the transmitted data. Thus, CS processes exhibit a form of frequency diversity.5 Increasing the bandwidth of the transmit pulse b(t) leads to more redundant replicas. This redundancy can be used to combat interference, as illustrated in the following example. Example 10.4. Consider a baseband QPSK signal, using a Nyquist roll-off pulse with α = 1 as described in Example 10.2. This pulse has a two-sided bandwidth of 2/T and the spectrum of x(t) thus contains the information about the transmitted data sequence dk exactly twice. Assume now that there is a narrowband interferer at frequency f = 1/(2T ). If this interferer is not CS with period T it will not affect dξ ( f − 1/T ) at f = 1/(2T ). Hence, dξ ( f − 1/T ) can be used to compensate for the effects of the narrowband interferer. This is illustrated in Fig. 10.3. There is an analogous story for the reflectional cyclic spectral coherence. If we assume a real-valued pulse shape b(t), its Fourier transform satisfies B(n/T − f ) = B ∗ ( f − n/T ). Thus, as long as the product B( f )B ∗ ( f − n/T ) = 0, we obtain |ρx x (n/T, f )|2 =

| σd2 |2 , σd4

(10.40)

260

Cyclostationary processes

narrowband interferer

f −1 T

0

1 T

2 T

Figure 10.3 The solid dark line shows the support of dξ ( f ), and the dashed dark line the support

of frequency-shifted dξ ( f − 1/T ). The interferer disturbs dξ ( f ) at f = 1/(2T ) and dξ ( f − 1/T ) at f = 3/(2T ). Since dξ ( f ) and dξ ( f − 1/T ) correlate perfectly on [0, 1/T ], dξ ( f − 1/T ) can be used to compensate for the effects of the narrowband interferer.

d2 are the variance and complementary variance of the data sequence where σd2 and σ dk . Hence, if dk is maximally improper, there is complete reflectional coherence |ρx x (n/T, f )|2 = 1 as long as B( f )B ∗ ( f − n/T ) = 0. Thus, dξ ( f ) can be perfectly estimated from frequency-shifted conjugated versions dξ ∗ (n/T − f ). This spectral redundancy comes on top of the spectral redundancy already discussed above. For instance, the spectrum of baseband BPSK, using a Nyquist roll-off pulse with α = 1, contains the information about the transmitted real-valued data sequence dk exactly four times. On the other hand, if dk is proper (as in QPSK), then ρx x (n/T, f ) = 0 and there is no additional complementary spectral redundancy that can be exploited using conjugate-linear operations.

10.3

Cyclic Wiener filter The problem discussed in the previous subsection can be generalized to the setting where we observe a random process y(t), with corresponding spectral process υ( f ), that is a noisy measurement of a message x(t). Measurement and message are assumed to be individually and jointly (almost) CS. We are interested in constructing a filter that ˆ of the message x(t) from the measurement y(t) by exploiting produces an estimate x(t) CS properties. In the frequency domain, such a filter adds suitably weighted spectral components of y(t), which have been frequency-shifted by the cyclic frequencies, to produce an estimate of the spectral process: dξˆ ( f ) =

N

G n ( f )dυ( f − νn ) +

n=1

N˜

n ( f )dυ ∗ (˜νn − f ). G

(10.41)

n=1

Such a filter is called a (widely linear) frequency-shift (FRESH) filter. The corresponding noncausal time-domain estimate is ˆ = x(t)

N n=1

gn (t) ∗ (y(t)e

j2πνn t

)+

N˜

g˜ n (t) ∗ (y ∗ (t)ej2π νñ t ).

(10.42)

n=1 ˜

N N Since this FRESH filter allows arbitrary frequency shifts {νn }n=1 and {ν˜ n }n=1 , with N ˜ and N possibly infinite, it can be applied to CS or almost CS signals. Of course, only those frequency shifts that correspond to nonzero cyclic PSD/C-PSD are actually useful.

10.3 Cyclic Wiener filter

261

e j2πν1 t y(t)

G1 ( f ) e j2πνN t GN ( f )

(·)∗

x(t) ˆ

e j2π ν˜ 1 t 1 ( f ) G e j2π νÑ˜ t ˜(f) G N

Figure 10.4 Widely linear frequency-shift (FRESH) filter.

For instance, if y(t) is a noisy QPSK baseband signal, using a Nyquist roll-off pulse as described in Example 10.2, then {νn } = {−1/T, 0, 1/T } and {ν˜ n } = ∅ because QPSK is proper. The FRESH filter structure is shown in Fig. 10.4. It consists of a bank of linear shift-invariant filters that are applied to frequency-shifted versions of the signal and its conjugate. We will now discuss how the optimum FRESH filter responses {gn (t)} ←→ {G n ( f )} n ( f )} are determined for minimum mean-squared error estimation. and {g˜ n (t)} ←→ {G Let us define the (N + N˜ )-dimensional vectors T dΥ( f ) = dυ( f − ν1 ), . . ., dυ( f − ν N ), dυ ∗ (˜ν1 − f ), . . ., dυ ∗ (˜ν N˜ − f ) , (10.43) $ # 1 ( f ), . . ., G N˜ ( f ) . (10.44) F( f ) = G 1 ( f ), . . ., G N ( f ), G If {νn } = {˜νn }, dΥ( f ) and F( f ) both have the structure of augmented vectors. We can now write (10.41) compactly as dξˆ ( f ) = F( f )dΥ( f ).

(10.45)

This turns the FRESH filtering problem into a standard Wiener filtering problem. In order to minimize the mean-squared error E|dξˆ ( f ) − dξ ( f )|2 for any given frequency f , we need to choose F( f ) = E[dξ ( f )dΥH ( f )]E[dΥ( f )dΥH ( f )]† .

(10.46)

262

Cyclostationary processes

The vector E[dξ ( f )dΥH ( f )] contains the cyclic cross-PSDs and C-PSDs E[dξ ( f )dυ ∗ ( f − νn )] = Px y (νn , f )d f,

(10.47)

E[dξ ( f )dυ(˜νn − f )] = Px y (˜νn , f )d f.

(10.48)

The matrix E[dΥ( f )dΥH ( f )] contains the cyclic PSDs and C-PSDs E[dυ( f − νn )dυ ∗ ( f − νm )] = Pyy (νm − νn , f − νn )d f, ∗

(10.49)

E[dυ(˜νn − f )dυ (˜νm − f )] = Pyy (˜νn − ν˜ m , ν˜ n − f )d f,

(10.50)

E[dυ( f − νn )dυ(˜νm − f )] = Pyy (˜νm − νn , f − νn )d f.

(10.51)

Equation (10.46) holds for all frequencies f . The sets of frequency shifts {νn } and {ν˜ n } need to be chosen such that (10.46) produces all filter frequency responses {G n ( f )} n ( f )} that are not identically zero. With these choices, the optimal FRESH filter and {G (10.41) is called the cyclic Wiener filter.6 It is easy to see that the cyclic Wiener filter becomes the standard noncausal Wiener filter if measurement and message are jointly WSS.

10.4

Causal filter-bank implementation of the cyclic Wiener filter We now develop an efficient causal filter-bank implementation of the cyclic Wiener filter for CS discrete-time series. Starting from first principles, we construct a widely linear shift-invariant Multiple-In Multiple-Out (WLSI-MIMO) filter for CS sequences, which is based on a connection between scalar-valued CS and vector-valued WSS processes. This filter suffers from an inherent delay and is thus unsuitable for a causal implementation. This problem is addressed in a modified implementation, which uses a sliding input window and filter branches with staggered execution times. We then show that this implementation is equivalent to FRESH filtering – which we knew it would have to be if it is supposed to optimum. So what have we gained from this detour? By establishing a connection between FRESH filtering and WLSI-MIMO filtering, we are able to enlist the spectral factorization algorithm from Section 8.3 to find the causal part of the cyclic Wiener filter. The presentation in this section follows Spurbeck and Schreier (2007).

10.4.1

Connection between scalar CS and vector WSS processes Let y[k] be a discrete-time CS process whose correlation function r yy [k, κ] = r yy [k + M, κ] and complementary correlation function r˜yy [k, κ] = r˜yy [k + M, κ] are M-periodic. For simplicity, we assume that the period M is the same integer for both correlation and complementary correlation function. If this is not the case, the sampling rate must be converted so that the sampling times are synchronized with the period of correlation. Such a sampling-rate conversion will not be possible for almost CS signals.

10.4 Causal filter-bank implementation

263

Therefore, our results in this section do not apply directly to almost CS signals even though such an extension exists. We first establish a connection between scalar-valued CS and vector-valued WSS processes.7 We perform a serial-to-parallel conversion that produces the vector-valued sequence T y[k] = y[k M] y[k M − 1] · · · y[(k − 1)M + 1] T = y0 [k] y1 [k] · · · y M−1 [k] . (10.52) For convenience, we index elements in vectors and matrices starting from 0 in this section. The subsequence ym [k], m ∈ {0, 1, . . ., M − 1}, is obtained by delaying the original CS scalar sequence y[k] by m and then decimating it by a factor of M. Each subsequence ym [k] is WSS and any two subsequences ym [k] and yn [k] are also jointly WSS. Hence, the vector-valued time series y[k] is WSS with matrix correlation R yy [κ] = E [y[k + κ]yH [k]]   r yy [0, κ M] r yy [M − 1, κ M + 1] · · · r yy [1, (κ + 1)M − 1] ∗  r yy r yy [M − 1, κ M] · · · r yy [1, (κ + 1)M − 2]  [M − 1, −κ M + 1]  =  .. .. .. ..   . . . . ∗ ∗ r yy [1, (−κ + 1)M − 1] r yy [1, (−κ + 1)M − 2] · · · r yy [1, κ M] (10.53) and matrix complementary correlation yy [κ] R = E [y[k + κ]yT [k]]  r˜yy [0, κ M]  r˜yy [M − 1, −κ M + 1]  = ..  .

r˜yy [M − 1, κ M + 1] r˜yy [M − 1, κ M] .. .

· · · r˜yy [1, (κ · · · r˜yy [1, (κ .. .

r˜yy [1, (−κ + 1)M − 1] r˜yy [1, (−κ + 1)M − 2] · · ·

 + 1)M − 1] + 1)M − 2]  , ..  .

r˜yy [1, κ M] (10.54)

both independent of k. As required for a vector-valued WSS time series, the matrix correlation satisfies R yy [κ] = RHyy [−κ], and the matrix complementary cor yy [κ] = R Tyy [−κ]. The z-transform of R yy [κ] will be denoted by P yy (z), relation R yy (z). These can be arranged in the augmented yy [κ] by P and the z-transform of R matrix yy (z) P yy (z) P . (10.55) P yy (z) = ∗ ∗ P yy (z ) P∗yy (z ∗ )

264

Cyclostationary processes

y[k]

↑M

↓M (·)∗

D

↓M (·)∗

D

H(z)

↑M

D

D

D

↓M

↑M

x[k] ˆ

(·)∗ Figure 10.5 A widely linear polyphase filter bank for cyclostationary time series. The filter H(z) operates at a reduced rate that is 1/M times the rate of the CS sequence.

10.4.2

Sliding-window filter bank The correspondence between scalar-valued CS and vector-valued WSS processes allows us to use a WLSI-MIMO filter for CS sequences. As shown in Fig. 10.5, the CS time series is first transformed into a vector-valued WSS process, then MIMOfiltered, and finally transformed back into a CS time series. The MIMO filter operation is ∗ H[κ]y[k − κ] + H[κ]y [k − κ]. (10.56) xˆ [k] = κ

In the z-domain, the filter output is, in augmented notation, & X(z) H(z) H(z) Y(z) = & ∗ (z ∗ ) ∗ (z ∗ ) H∗ (z ∗ ) Y∗ (z ∗ ) X H & = H(z)Y(z). X(z)

(10.57) (10.58)

Therefore, Pxˆ xˆ (z) = H(z)P yy (z)HH (z −∗ ).

(10.59)

This filter processes M samples at a time, operating at a reduced rate that is 1/M ˆ times the rate of the CS sequence. Consequently, the CS output signal x[k] is delayed by M − 1 samples with respect to the input signal y[k]. While this delay might not matter in a noncausal filter, it is not acceptable in a causal realization. In order to avoid it, we use a filter bank with a sliding input window such that the output subsequence ˆ M − m] is produced as soon as the input subsequence ym [k] = y[k M − m] xˆm [k] = x[k is available. This necessitates the use of downsamplers and upsamplers that decimate and expand by M with a phase offset corresponding to m sample periods of the incoming CS sequence y[k]. These will be denoted by ↓ M (m) and ↑ M (m) .

265

10.4 Causal filter-bank implementation

(m)

y[k]

H0 (z)

↓ M (m)

(m) (z) H 0

(·)∗

D

(m)

H1 (z)

↓ M (m)

(m) (z) H 1

(·)∗ D

↑ M (m)

xˆm [k]

(m)

D

↓

HM−1 (z)

M (m)

(m) (z) H M−1

(·)∗

Figure 10.6 Branch m of the widely linear filter bank with sliding input window.

The filter bank consists of M branches, each handling a single output phase. Figure 10.6 shows the mth branch, which produces the subsequence xˆm [k]. It operates on the input vector sequence   y[k M − m]  y[k M − m − 1]    y(m) [k] =  (10.60) , ..   . y[(k − 1)M − m + 1] which can be more conveniently expressed in the z-domain as Y(m) (z) = D(m) (z)Y(z) with

D(m) (z) =

0 z −1 Im

I M−m . 0

(10.61)

(10.62)

Each branch thus contains M filter pairs, each of which handles a single input phase. (m) (z) determines the contribution from input Specifically, the filter pair Hn(m) (z) and H n phase m + n to output phase m. All filter branches operate at a reduced rate that is 1/M times the sampling rate of the CS sequence. However, the execution times of the branches are staggered by the sampling interval of the CS sequence.

10.4.3

Equivalence to FRESH filtering The sliding-window filter bank is an efficient equivalent implementation of a FRESH filter. This can be shown along the lines of Ferrara (1985). We first combine the filters

266

Cyclostationary processes

(m) (m) (m) (z), . . ., H (m) (z), which are operating on y(m) [k] at a H0 (z), . . ., HM−1 (z) and H 0 M−1 rate that is 1/M times the rate of y[k], into filters Fm (z) and Fm (z), which are operating on y[k] at full rate. Owing to the multirate identities, these full-rate filters are

Fm (z) =

M−1

z −n Hn(m) (z M ),

(10.63)

n=0

Fm (z) =

M−1

(m) (z M ), z −n H n

m = 0, . . ., M − 1.

(10.64)

n=0

The subsequence xˆm [k] is obtained by decimating the output of f m [k] ∗ y[k] + f˜m [k] ∗ y ∗ [k],

(10.65)

where ∗ denotes convolution, by a factor of M with a phase offset of m. Such decimation is described by multiplication with δ[(k − m) mod M] =

M−1 1 j2πn(k−m)/M e , M n=0

(10.66)

where δ[·] denotes the Kronecker delta. Therefore, the spectral process corresponding to xˆm [k] has increments M−1 $ 1# ∗ −j2πnm/M ˆ dξm (θ ) = e δ(θ − 2π n/M) , Fm (θ )dυ(θ ) + Fm (θ )dυ (−θ ) ∗ M n=0 (10.67) where δ(·) is the Dirac delta function, and Fm (θ ) = Fm (z)|z=ejθ . The CS output sequence ˆ x[k] is the sum of its subsequences, ˆ x[k] =

M−1

xˆm [k],

(10.68)

m=0

and the spectral process has increments M−1 1 M−1 dξˆ (θ ) = e−j2πnm/M Fm (θ − 2π n/M) dυ(θ − 2π n/M) M n=0 m=0 M−1 M−1 1 −j2πnm/M Fm (θ − 2π n/M) dυ ∗ (2π n/M − θ ). e + M n=0 m=0

(10.69)

If we define a new set of filter frequency responses by G n (θ ) =

M−1 1 −j2πnm/M e Fm (θ − 2π n/M), M m=0

M−1 n (θ ) = 1 G e−j2πnm/M Fm (θ − 2π n/M), M m=0

(10.70)

n = 0, . . ., M − 1,

(10.71)

10.4 Causal filter-bank implementation

267

G0 (q)

y[k] (·)∗ e j2πk M

0 (q) G G1 (q)

(·)∗

e j2π(M−1)k

M−1 (q) G

x[k] ˆ

M

GM−1 (q) (·)∗

1 (q) G

Figure 10.7 A widely linear frequency shift (FRESH) filter.

we can finally write dξˆ (θ ) =

M−1

n (θ )dυ ∗ (2π n/M − θ), G n (θ )dυ(θ − 2π n/M) + G

(10.72)

n=0

which is a FRESH filtering operation. The corresponding time-domain filter operation is ˆ x[k] =

M−1

gn [k] ∗ (y[k]ej2πn/M ) + g˜ n [k] ∗ (y ∗ [k]ej2πn/M ),

(10.73)

n=0

which is depicted in Fig. 10.7.

10.4.4

Causal approximation ˆ We can now proceed to the causal approximation of a cyclic Wiener filter. Let x[k] be the widely linear estimate of the CS message x[k] based on the noisy observation y[k]. Furthermore, let xˆ [k], x[k], and y[k] denote the equivalent WSS vector representations. Given the correlation and complementary correlation functions r yy [k, κ], r˜yy [k, κ], r x y [k, κ], and r˜x y [k, κ], we first compute the augmented spectral density matrices P yy (z) and Px y (z) of their associated WSS sequences, along the lines of Section 10.4.1. The polyphase filter structure depicted in Fig. 10.5 cannot be used to produce a delayˆ free, causal estimate x[k]. The WLSI-MIMO filter in this structure operates at a rate that is 1/M times the rate of the incoming CS data sequence. Therefore, the estimated ˆ sequence x[k] will be delayed by M − 1 with respect to the input y[k]. Moreover, the ˆ M − m] depends on ˆ estimate x[k] is not causal because the subsequence xˆm [k] = x[k the future values y[k M − m + 1], . . ., y[k M − 1]. The only exception to this is the subsequence xˆ0 [k], whose sampling phase matches the phase of the decimators.

268

Cyclostationary processes

To circumvent this problem, we use the filter bank with sliding input window whose mth branch is shown in Fig. 10.6. The subsequence estimate xˆm [k] is based on the vector sequence y(m) [k] with a phase offset of m samples relative to y[k], as defined in (10.60)–(10.62). The vector sequence y(m) [k] is still WSS with augmented spectral density matrix T & (m) (z) & (m) (z −1 ) D D 0 0 (m) P yy (z) = (10.74) & (m) (z) P yy (z) & (m) (z −1 ) . 0 D 0 D We may now utilize our results from Section 8.3 to spectrally factor (m) (m) H −∗ P(m) yy (z) = A yy (z)(A yy ) (z ),

(10.75)

(m) where A(m) yy (z) is minimum phase, meaning that both A yy (z) and its inverse are causal and stable. The cross-spectral density matrix P(m) x y (z) is found analogously to (10.74),

P(m) x y (z)

& (m) (z) D = 0

T

& (m) (z −1 ) D P (z) x y & (m) (z) D 0 0

0 (m) −1 & D (z )

.

(10.76)

The transfer function of a causal WLMMSE filter for the augmented WSS vector sequence y(m) [k] can now be obtained as (m) −H −∗ −1 (z )]+ (A(m) H(m) (z) = [P(m) x y (z)(A yy ) yy ) (z),

(10.77)

where [·]+ denotes the causal part. Again, only the subsequence xˆm [k] produced by this filter is a delay-free, causal MMSE estimate. Hence, for each polyphase matrix H(m) (z), we extract the top row vector, denoted by $ # (m) (m) (m) (z) · · · H (m) (z) . (10.78) H0 (z) · · · HM−1 (z) H 0 M−1 This vector is the polyphase representation of a causal WLMMSE filter that produces ˆ the mth subsequence xˆm [k] of the CS estimate x[k]. This filter constitutes the mth branch ˆ from y[k] shown in Fig. 10.6. The causal widely linear MMSE filter for producing x[k] comprises a parallel collection of M such filters, whose outputs are summed.

Notes 1 CS processes for which ensemble averages can be replaced by time averages are sometimes called cycloergodic. An in-depth treatment of cycloergodicity is provided by Boyles and Gardner (1983). Gardner (1988) presents a treatment of cyclostationary processes in the fraction-of-time probabilistic framework, which is entirely based on time averages. 2 A thorough discussion of estimators for the cyclic correlation and PSD, including the proof that the time-smoothed and frequency-smoothed cyclic periodograms are asymptotically equivalent, is provided by Gardner (1986). 3 The cyclic PSDs of several digital communication signals were derived by Gardner et al. (1987). A closer look at the cyclic PSD of continuous-phase modulated (CPM) signals is provided by Napolitano and Spooner (2001).

Notes

269

4 Improper digital communication signals include all real-valued baseband signals such as Pulse Amplitude Modulation (PAM) and Binary Phase Shift Keying (BPSK), but also Offset Quaternary Phase Shift Keying (OQPSK), Minimum Shift Keying (MSK), and Gaussian Minimum Shift Keying (GMSK). GMSK is used in the mobile-phone standard GSM. 5 The term “diversity” is commonly used in mobile communications, where the receiver obtains several replicas of the same information-bearing signal. In “frequency diversity,” the signal is transmitted on several carriers that are sufficiently separated in frequency. 6 FRESH filtering and the cyclic Wiener filter were introduced and analyzed by Gardner (1988, 1993). The cyclic Wiener filter utilizes the frequency diversity of CS signals in a way that is reminiscent of the RAKE receiver in wireless communications (see Proakis (2001) for background on the RAKE receiver). 7 The connection between scalar-valued CS and vector-valued WSS time series was found by Gladyshev (1961). There is a corresponding result for continuous-time CS processes due to Gladyshev (1963). If x(t) is CS with period T , and θ an independent random variable uniformly distributed over [0, T ], then xm (t) = x(t − θ)ej2πm(t−θ)/T is WSS, and xm (t) and xn (t) are jointly WSS.

Appendix 1 Rudiments of matrix analysis

In this appendix, we present a very concise review of some basic results in matrix analysis that are used in this book. As an excellent reference for further reading we recommend Horn and Johnson (1985).

A1.1

Matrix factorizations

A1.1.1

Partitioned matrices If A and D are square matrices and have inverses, there exist the factorizations I 0 A B I BD−1 A − BD−1 C 0 (A1.1) A= = 0 I 0 D D−1 C I C D I 0 A 0 I A−1 B = (A1.2) CA−1 I 0 D − CA−1 B 0 I The factors A − BD−1 C and D − CA−1 B are referred to as the Schur complements of D and A, respectively, within A. These factorizations establish the determinant formulae det A = det(A − BD−1 C)det D = det A det(D − CA−1 B).

A1.1.2

(A1.3) (A1.4)

Eigenvalue decomposition A matrix A ∈ C n×n is normal if it commutes with its Hermitian transpose: AAH = AH A. A matrix is Hermitian if it is Hermitian-symmetric: AH = A. Any Hermitian matrix is therefore normal. Every normal matrix has the factorization A = UΛUH ,

(A1.5)

called an eigenvalue (or spectral) decomposition (EVD), where U is unitary. The columns ui of U are called the eigenvectors, and the diagonal matrix Λ = Diag(λ1 , λ2 , . . ., λn ) contains the eigenvalues on its diagonal. We will assume the ordering |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. The number r of nonzero eigenvalues is the rank of A, and the zero eigenvalues are

Rudiments of matrix analysis

271

λr+1 = λr+2 = · · · = λn = 0. Thus, A may be expressed as A=

r

λi ui uiH ,

(A1.6)

i=1

and, for any polynomial f , f (A) =

r

f (λi )ui uiH .

(A1.7)

i=1

An important fact is that the nonzero eigenvalues of a product of matrices AB are the nonzero eigenvalues of BA, for arbitrary A and B. The following list contains equivalent statements for different types of normal matrices. A = AH ⇔ λi real for all i,

(A1.8)

A unitary ⇔ |λi | = 1 for all i,

(A1.9)

A positive definite ⇔ λi > 0 for all i,

(A1.10)

A positive semidefinite ⇔ λi ≥ 0 for all i.

(A1.11)

2 × 2 matrix

The two eigenvalues of the 2 × 2 matrix

a A= c

are λ1,2 =

1 2

tr A ±

1 2

b d

tr 2 A − 4 det A

(A1.12)

(A1.13)

and the unit-length eigenvectors are

1 b , u1 λ1 − a 1 b u2 = . u2 λ2 − a u1 =

A1.1.3

(A1.14) (A1.15)

Singular value decomposition Every matrix A ∈ C n×m has the factorization A = FKGH ,

(A1.16)

called a singular value decomposition (SVD), where F ∈ C n× p and G ∈ C m× p , p = min(n, m), both have unitary columns, FH F = I and GH G = I. The columns fi of F are called the left singular vectors, and the columns gi of G are called the right singular vectors. The diagonal matrix K = Diag(k1 , k2 , . . ., k p ) contains the singular values k1 ≥ k2 ≥ · · · ≥ k p ≥ 0 on its diagonal. The number r of nonzero singular values is the rank of A.

272

Appendix 1

Since AAH = FKGH (GKFH ) = FK2 FH and AH A = (GKFH )FKGH = GK2 GH , the singular values of A are the nonnegative roots of the eigenvalues of AAH or AH A. So, for normal A, how are the EVD A = UΛUH and the SVD A = FKGH related? If we write Λ = |Λ|D with |Λ| = Diag(|λ1 |, |λ2 |, . . ., |λn |) and D = Diag(exp(jλ1 ), exp(jλ2 ), . . ., exp(jλn )) then A = UΛUH = U|Λ|(UDH )H = FKGH

(A1.17)

∗

is an SVD of A with F = U, K = |Λ|, and G = UD . Therefore, the singular values of A are the absolute values of the eigenvalues of A. If A is Hermitian, then the eigenvalues and thus D are real, and D = Diag(sign(λ1 ), sign(λ2 ), . . ., sign(λn )). If A is Hermitian and positive semidefinite, then F = G = U and K = Λ. That is, the SVD and EVD of a Hermitian positive semidefinite matrix are identical.

A1.2

Positive definite matrices Positive definite and semidefinite matrices are important because correlation and covariance matrices are positive (semi)definite. A Hermitian matrix A ∈ C n×n is called positive definite if xH Ax > 0 for all nonzero x ∈ C n

(A1.18)

and positive semidefinite if the weaker condition xH Ax ≥ 0 holds. A Hermitian matrix is positive definite if and only if all of its eigenvalues are positive, and positive semidefinite if and only if all eigenvalues are nonnegative.

A1.2.1

Matrix square root and Cholesky decomposition A matrix A is positive definite if and only if there exists a nonsingular lower-triangular matrix L with positive diagonal entries such that A = LLH . This factorization is often called the Cholesky decomposition of A. It determines a square root A1/2 = L, which is neither positive definite nor Hermitian. The unique positive semidefinite square root of a positive semidefinite matrix A is obtained via the EVD of A = UΛUH as

with Λ1/2

A1.2.2

A1/2 = UΛ1/2 UH √ √ √ = Diag( λ1 , λ2 , . . ., λn ).

(A1.19)

Updating the Cholesky factors of a Grammian matrix Consider the data matrix X p = [x1 , x2 , . . ., x p ] ∈ C n× p with n ≥ p and positive definite Grammian matrix G p−1 h p G p = XHp X p = , (A1.20) hHp h pp

Rudiments of matrix analysis

273

where G p−1 = XHp−1 X p−1 , h p = XHp−1 x p and h pp = xHp x p . The Cholesky decomposition of G p−1 is G p−1 = L p−1 LHp−1 , where L p−1 is lower-triangular. It follows that the Cholesky factors of G p may be updated as H L p−1 m∗p 0 G p−1 h p L p−1 G p = L p LHp = = , (A1.21) hHp h pp mTp m pp 0 m ∗pp where m p and m pp solve the following consistency equations: L p−1 m∗p = h p , mTp m∗p + |m pp |2 = h pp .

(A1.22) (A1.23)

This factorization may then be inverted for G−1 p to achieve the upper-triangular–lowertriangular factorization −1 −H L p−1 0 L p−1 k p −1 −H −1 Gp = Lp Lp = . (A1.24) kHp k ∗pp 0 k pp From the constraint that L−1 p L p = I, k p and k pp are determined from the consistency equations

A1.2.3

kHp L p−1 = −k ∗pp mTp ,

(A1.25)

k ∗pp m pp = 1.

(A1.26)

Partial ordering For A, B ∈ C n×n , we write A > B when A − B is positive definite, and A ≥ B when A − B is positive semidefinite. This is a partial ordering of the set of n × n Hermitian matrices. It is partial because we may have A B and B A. For A > 0, B ≥ 0, and nonsingular C, we have CH AC > 0,

(A1.27)

C BC ≥ 0,

(A1.28)

A−1 > 0,

(A1.29)

H

A + B ≥ A > 0,

(A1.30)

A ≥ B ⇒ evi (A) ≥ evi (B).

(A1.31)

The last statement assumes that the eigenvalues of A and B are each arranged in decreasing order. As a particular consequence, det A ≥ det B and tr A ≥ tr B. The partitioned Hermitian matrix A B , (A1.32) A= H D B with square blocks A and D, is positive definite if and only if A > 0 and its Schur complement D − BH A−1 B > 0, or D > BH A−1 B.

274

Appendix 1

A1.2.4

Inequalities There are several important determinant inequalities, which can all be proved from partial-ordering results. Alternative closely related proofs are also possible via majorization, which is discussed in Appendix 3. The Hadamard determinant inequality for a positive semidefinite A ∈ C n×n is det A ≤

n "

Aii .

(A1.33)

i=1

For positive definite A, equality holds if and only if A is diagonal. The Fischer determinant inequality for the partitioned positive definite matrix A in (A1.32) is det A ≤ det A det D.

(A1.34)

The Minkowski determinant inequality for positive definite A, B ∈ C n×n is det1/n (A + B) ≥ det1/n A + det1/n B,

(A1.35)

with equality if and only if B = cA for some positive constant c.

A1.3

Matrix inverses

A1.3.1

Partitioned matrices The inverse of a block-diagonal matrix of nonsingular blocks is obtained by taking the inverses of the individual blocks, −1 −1 [Diag(A1 , A2 , . . ., Ak )]−1 = Diag(A−1 1 , A2 , . . ., Ak ).

Given the factorizations (A1.1) and (A1.2) for identities −1 I X I −X I = and 0 I 0 I X

(A1.36)

the partitioned matrix A and the −1 0 I 0 = I −X I

there are the following corresponding factorizations for A−1 : −1 0 I −BD−1 A B I 0 (A − BD−1 C)−1 −1 = A = I 0 D−1 0 C D −D−1 C I I −A−1 B A−1 0 I 0 = . 0 I 0 (D − CA−1 B)−1 −CA−1 I

(A1.37)

(A1.38) (A1.39)

These lead to the determinant formulae det A−1 = det(A − BD−1 C)−1 det D−1

(A1.40)

= det A−1 det(D − CA−1 B)−1

(A1.41)

Rudiments of matrix analysis

275

and a formula for A−1 , which is sometimes called the matrix-inversion lemma: −1 −(A − BD−1 C)−1 BD−1 A B (A − BD−1 C)−1 −1 . (A1.42) A = = −(D − CA−1 B)−1 CA−1 (D − CA−1 B)−1 C D Different expressions may be derived by employing the Woodbury identities (A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 , (A − BD−1 C)−1 BD−1 = A−1 B(D − CA−1 B)−1 . A simple special case is the inverse of a 2 × 2 matrix −1 1 d a b = −c c d ad − bc

A1.3.2

(A1.43) (A1.44)

−b . a

(A1.45)

Moore–Penrose pseudo-inverse Let A ∈ C n×m , possibly rank-deficient. Its Moore–Penrose pseudo-inverse (or generalized inverse) A† ∈ C m×n satisfies the following conditions: A† A = (A† A)H ,

(A1.46)

AA† = (AA† )H ,

(A1.47)

†

†

†

A AA = A ,

(A1.48)

†

AA A = A.

(A1.49)

The pseudo-inverse always exists and is unique. It may be computed via the SVD of A = FKGH , as A† = GK† FH . The pseudo-inverse of K = Diag(k1 , k2 , . . ., kr , 0, . . ., 0), kr > 0, is K† = Diag(k1−1 , k2−1 , . . ., kr−1 , 0, . . ., 0). The SVD is an expensive thing to compute but the pseudo-inverse may be computed inexpensively via a Gram–Schmidt factorization. Example A1.1. If the 2 × 2 positive semidefinite Hermitian matrix a b A= ∗ b d

(A1.50)

is singular, i.e., ad = |b|2 , then its eigen- and singular values are λ1 = a + d and λ2 = 0, thanks to (A1.13). The eigen- and singular vector associated with λ1 is 1 b u1 = . (A1.51) u1 d The pseudo-inverse of A is therefore H A† = λ−1 1 u1 u1 =

2 1 |b| (a + d)(|b|2 + d 2 ) b∗ d

bd . d2

(A1.52)

276

Appendix 1

There are special cases when A has full rank. If m ≤ n, then A† = (AH A)−1 AH , and A A = Im×m . If n ≤ m, then A† = AH (AAH )−1 , and AA† = In×n . If A is square and nonsingular, then A† = A−1 . †

A1.3.3

Projections A matrix P ∈ C n×n that satisfies P2 = P is called a projection. If, in addition, P = PH then P is an orthogonal projection. For H ∈ C n× p , the matrix PH = HH† is an orthogonal projection whose range is the range of H. If H has full rank and p ≤ n, PH = HH† = H(HH H)−1 HH . If [H, A] is a full rank n × n matrix, where H ∈ C n× p and A ∈ C n×(n− p) , then x = PH x + PA x

(A1.53)

decomposes the vector x into two orthogonal components, one of which lies in H and the other of which lies in A. Note that PA = A(AH A)−1 AH = I − PH .

Appendix 2 Complex differential calculus (Wirtinger calculus)

In statistical signal processing, we often deal with a real nonnegative cost function, such as a likelihood function or a quadratic form, which is then either analytically or numerically optimized with respect to a vector or matrix of parameters. This involves taking derivatives with respect to vectors or matrices, leading to gradient vectors and Jacobian and Hessian matrices. What happens when the parameters are complex-valued? That is, how do we differentiate a real-valued function with respect to a complex argument? What makes this situation confusing is that classical complex analysis tells us that a complex function is differentiable on its entire domain if and only if it is holomorphic (which is a synonym for complex analytic). A holomorphic function with nonzero derivative is conformal because it preserves angles (including their orientations) and the shapes of infinitesimally small figures (but not necessarily their size) in the complex plane. Since nonconstant real-valued functions defined on the complex domain cannot be holomorphic, their classical complex derivatives do not exist. We can, of course, regard a function f defined on C n as a function defined on IR2n . If f is differentiable on IR2n , it is said to be real-differentiable, and if f is differentiable on C n , it is complex-differentiable. A function is complex-differentiable if and only if it is real-differentiable and the Cauchy–Riemann equations hold. Is there a way to define generalized complex derivatives for functions that are real-differentiable but not complex-differentiable? This would extend complex differential calculus in a way similar to the way that impropriety extends the theory of complex random variables. It is indeed possible to do this. The theory was developed by the Austrian mathematician Wilhelm Wirtinger (1927), which is why this generalized complex differential calculus is sometimes referred to as Wirtinger calculus. In the engineering literature, Wirtinger calculus was rediscovered by Brandwood (1983) and then further developed by van den Bos (1994a). In this appendix, we mainly follow the outline by van den Bos (1994a), with some minor extensions. The key idea of Wirtinger calculus is to formally regard f as a function of two independent complex variables x and x ∗ . A generalized complex derivative is then formally defined as the derivative with respect to x, while treating x ∗ as a constant. Another generalized derivative is defined as the derivative with respect to x ∗ , while formally treating x as a constant. The generalized derivatives exist whenever f is realdifferentiable. These ideas extend in a straightforward fashion to complex gradients, Jacobians, and Hessians.

278

Appendix 2

A2.1

Complex gradients We shall begin with a scalar real-valued, totally differentiable function f (z), z = [u, v]T ∈ IR2 . Its linear approximation at z0 = [u 0 , v0 ]T is f (z) = f (u, v) ≈ f (z0 ) + ∇z f (z0 )(z − z0 ),

(A2.1)

with the gradient (a row vector) ∇z f (z0 ) =

∂f (z0 ) ∂u

∂f (z0 ) . ∂v

(A2.2)

How should the approximation (A2.1) work for a real-valued function with complex argument? Letting x = u + jv, we may express f as a function of complex x. In a slight abuse of notation, we write f (x) = f (z). Utilizing the 2 × 2 real-to-complex matrix 1 j T= , (A2.3) 1 −j which satisfies TH T = TTH = 2I, we now obtain ∇z f (z0 )(z − z0 ) = 12 ∇z f (z0 )TH (T(z − z0 )) x − x0 = ∇x f (x0 ) ∗ x − x0∗ = ∇x f (x0 )(x − x0 ). This introduces the augmented vector u + jv x = Tz. x= ∗ = u − jv x Using (A2.4)–(A2.6), we define the complex gradient 1 ∂f 1 ∂f ∂f ∂f H 1 ∇x f (x0 ) 2 ∇z f (z0 )T = −j (z0 ) +j (z0 ) . 2 ∂u ∂v 2 ∂u ∂v

(A2.4) (A2.5) (A2.6)

(A2.7)

(A2.8)

In (A2.5), the first component of ∇x f (x0 ) operates on x and the second component operates on x ∗ . Thus, ∂f ∂f (A2.9) ∇x f (x0 ) = (x0 ) . (x0 ) ∂x ∂x∗ By equating this expression with the right-hand side of (A2.8), we obtain the following definition. Definition A2.1. The generalized complex differential operator is defined as 1 ∂ ∂ ∂ −j ∂x 2 ∂u ∂v

(A2.10)

279

Complex differential calculus (Wirtinger calculus)

and the conjugate generalized complex differential operator as 1 ∂ ∂ ∂ +j . ∂x∗ 2 ∂u ∂v

(A2.11)

The derivatives obtained by applying the generalized complex and conjugate complex differential operators are called Wirtinger derivatives after the pioneering work of Wirtinger (1927). These generalized differential operators can be formally implemented by treating x and x ∗ as independent variables. That is, when applying ∂/∂ x, we take the derivative with respect to x while formally treating x ∗ as a constant. Similarly, ∂/∂ x ∗ yields the derivative with respect to x ∗ , regarding x as a constant. The following example shows how this approach works. Example A2.1. Consider the function f (x) = |x|2 = x x ∗ . Treating x and x ∗ as two independent variables, we find ∂ f (x) = x∗ ∂x

and

∂ f (x) = x. ∂x∗

This can be checked by writing f (x) = f (u, v) = u 2 + v 2 and ∂ 1 ∂ −j f (u, v) = u − jv = x ∗ , 2 ∂u ∂v ∂ 1 ∂ +j f (u, v) = u + jv = x. 2 ∂u ∂v The linear approximation of f at x0 is ∂f f (x) ≈ f (x0 ) + (x0 ) ∂x

x − x0 ∂f (x0 ) x ∗ − x0∗ ∂x∗

= f (x0 ) + x0∗ (x − x0 ) + x0 (x ∗ − x0∗ ) = f (x0 ) + 2 Re(x0∗ x) − 2|x0 |2 .

There is actually nothing in our development thus far that would prevent us from applying the generalized complex differential operators to complex-valued functions. Therefore, we no longer assume that f is real-valued.

A2.1.1

Holomorphic functions In classical complex analysis, the derivative of a function f (x) at x0 is defined as df f (x) − f (x0 ) . (x0 ) = lim x→x 0 dx x − x0

(A2.12)

280

Appendix 2

This limit exists only if it is independent of the direction with which x approaches x0 in the complex plane. If the limit exists at x0 , f is called complex-differentiable at x0 . A fundamental result in complex analysis is that f (x) is complex-differentiable if and only if f (u, v) is real-differentiable and the Cauchy–Riemann equations hold: ∂(Re f ) ∂(Im f ) = ∂u ∂v

∂(Re f ) ∂(Im f ) =− . ∂v ∂u

and

(A2.13)

Result A2.1. The Cauchy–Riemann equations are more simply stated as ∂f = 0. ∂x∗

(A2.14)

Definition A2.2. If a function f on an open domain A ⊆ C is complex-differentiable for every x ∈ A it is called holomorphic or analytic. We now see that a real-differentiable function f is holomorphic if and only if it does not depend on x ∗ . For a holomorphic function f , the Wirtinger derivative ∂ f /∂ x is the standard complex derivative. A holomorphic function with nonzero derivative is conformal because it only rotates and scales infinitesimally small figures in the complex plane, preserving oriented angles and shapes. The Cauchy–Riemann equations immediately make clear that nonconstant real-valued functions cannot be holomorphic.

A2.1.2

Complex gradients and Jacobians Wirtinger derivatives may also be computed for functions with vector-valued domain, and vector-valued functions. Definition A2.3. Let f : C n → C. Assuming that Re f (z) and Im f (z) are each realdifferentiable, the complex gradient is the 1 × 2n row vector ∂f ∂f ∂f (A2.15) ∇x f = = ∂x ∂x ∂x∗ with

∂f ∂f = ∂x ∂ x1 ∂f ∂f = ∗ ∂ x1∗ ∂x

∂f ∂ x2 ∂f ∂ x2∗

∂f , ∂ xn ∂f ··· . ∂ xn∗

···

(A2.16) (A2.17)

Definition A2.4. The complex Jacobian of a vector-valued function f: C n → C m is the m × 2n matrix   ∇x f 1  ∇x f 2    (A2.18) Jx =  . .  ..  ∇x f m

281

Complex differential calculus (Wirtinger calculus)

Result A2.2. The Cauchy–Riemann equations for a vector-valued function f are ∂f = 0. (A2.19) ∂x∗ If these hold everywhere on the domain, f is holomorphic. This means that f must be a function of x only, and must not depend on x∗ .

A2.1.3

Properties of Wirtinger derivatives From Definition A2.1, we easily obtain the following rules for working with Wirtinger derivatives: ∂x ∂x∗ = I, (A2.20) =I and ∂x ∂x∗ ∂x =0 ∂x∗ ∗ ∂f ∗ ∂f = ∗ ∂x ∂x

and and

∂x∗ = 0, ∂x ∗ ∗ ∂f ∂f = , ∗ ∂x ∂x

(A2.21) (A2.22)

∂f(g) ∂f ∂g ∂f ∂g∗ = + ∗ , ∂x ∂g ∂x ∂g ∂x

(A2.23)

∂f ∂g ∂f ∂g∗ ∂f(g) = + . ∂x∗ ∂g ∂x∗ ∂g∗ ∂x∗

(A2.24)

The last two identities are the chain rule for non-holomorphic functions. If f is holomorphic, (A2.23) simplifies to the usual chain rule since the second summand is zero, and (A2.24) vanishes. If f is real-valued, it follows from f(x) = f ∗ (x) and (A2.22) that ∗ ∂f ∂f = . (A2.25) ∗ ∂x ∂x The gradient of a scalar real-valued function f (x) is therefore an augmented vector ∗ ∂f ∂f , (A2.26) ∇x f = ∂x ∂x where the last n components are the conjugates of the first n components. The linear approximation of f at x0 is ∂f f (x) ≈ f (x0 ) + ∇x f (x0 )(x − x0 ) = f (x0 ) + 2 Re (A2.27) (x0 )(x − x0 ) . ∂x When we are looking for local extrema, we search for points with ∇x f = 0. The following result shows that local extrema of real-valued functions can be identified using the Wirtinger derivative or the conjugate Wirtinger derivative.

282

Appendix 2

Result A2.3. For a real-valued function f , the following three conditions are equivalent: ∇x f = 0 ⇔

∂f ∂f = 0 ⇔ ∗ = 0. ∂x ∂x

(A2.28)

In the engineering literature, the gradient of a real-valued function is frequently defined as (∂/∂u + j ∂/∂v) f , which is equal to 2∂ f /∂x∗ . This definition is justified only insofar as it can be used to search for local extrema of f , thanks to Result A2.3. However, the preceding development has made it clear that (∂/∂u + j ∂/∂v) f is not the right definition for the complex gradient of a real (and therefore non-holomorphic) function of a complex vector x.

A2.2

Special cases In this section, we present some formulae for derivatives of common expressions involving linear transformations, quadratic forms, traces, and determinants. We emphasize that there is no need to develop new differentiation rules for Wirtinger derivatives. All rules for taking derivatives of real functions remain valid. However, care must be taken to properly distinguish between the variables with respect to which differentiation is performed and those that are formally regarded as constants. Some expressions that involve derivatives with respect to a complex vector x follow (it is assumed that a and A are independent of x and x∗ ): ∂ H a x = 0, ∂x∗ ∂ H and x a = aT , ∂x∗ ∂ H ∂ H and x Ax = xT AT , x Ax = xH A ∂x ∂x∗ ∂ T ∂ T and x Ax = 0, x Ax = xT (A + AT ) ∂x ∂x∗ ∂ exp − 12 xH A−1 x = − 12 exp − 12 xH A−1 x xH A−1 , ∂x ∂ H H −1 H x A. ln x Ax = x Ax ∂x ∂ H a x = aH ∂x ∂ H x a=0 ∂x

and

(A2.29) (A2.30) (A2.31) (A2.32) (A2.33) (A2.34)

Sometimes we encounter derivatives of a scalar-valued function f (X) with respect to an n × m complex matrix X. These derivatives are defined as  ∂f  ∂ x11  . ∂f .. = ∂X   ∂f ∂ xn1

··· ..

.

···

∂f  ∂ x1m  ..   .  ∂f  ∂ xnm



and

∂f  ∂x∗  11 ∂f  .. =  .  ∂X∗  ∂f ∗ ∂ xn1

··· ..

.

···

 ∂f ∗  ∂ x1m  .. . .   ∂f  ∗ ∂ xnm

(A2.35)

Complex differential calculus (Wirtinger calculus)

283

A few important special cases follow: ∂ ∂ tr(AX) = tr(XA) = AT , ∂X ∂X

(A2.36)

∂ tr(AX−1 ) = −X−T AT X−T , ∂X ∂ ∂ and tr(XH AX) = AX, tr(XH AX) = AT X∗ ∂X ∂X∗ ∂ tr Xk = k(Xk−1 )T , ∂X ∂ det X = (det X)X−T , ∂X ∂ ln det X = X−T , ∂X ∂ det Xk = k(det X)k X−T . ∂X

A2.3

(A2.37) (A2.38) (A2.39) (A2.40) (A2.41) (A2.42)

Complex Hessians Consider the second-order approximation of a scalar real-valued function f (z), z = [uT , vT ]T , u, v ∈ IRn , at z0 = [uT0 , vT0 ]T , f (z) = f (u, v) ≈ f (z0 ) + ∇z f (z0 )(z − z0 ) + 12 (z − z0 )T Hzz (z0 )(z − z0 ), with Hessian matrix T ∂ ∂f (z0 ) Hzz (z0 ) = ∂z ∂z  T ∂ ∂f ) (z  ∂u ∂u 0 T =  ∂ ∂ f (z0 ) ∂u ∂v

T  ∂ ∂f (z0 )  Huu (z0 ) ∂v ∂u T   = Huv (z0 ) ∂ ∂f (z0 ) ∂v ∂v

Hvu (z0 ) . Hvv (z0 )

(A2.43)

(A2.44)

Assuming that the second-order partial derivatives are continuous, the Hessian is symmetric, which implies Huu = HTuu ,

Hvv = HTvv ,

and

Hvu = HTuv .

(A2.45)

However, the Hessian is not generally positive definite or semidefinite. In Section A2.1, we derived an equivalent description for the linear term in (A2.43) for a function f (x) = f (u, v) with x = u + jv. We would now like to do the same for the quadratic term. Using the 2n × 2n real-to-complex matrix I jI T= (A2.46) I −jI

284

Appendix 2

with TH T = TTH = 2I, we obtain (z − z0 )T Hzz (z0 )(z − z0 ) = (z − z0 )T TH 14 THzz (z0 )TH (T(z − z0 )) = (x − x0 )H Hx x (x0 )(x − x0 ).

(A2.47)

Therefore, we define the complex augmented Hessian matrix as Hx x (x0 ) 14 THzz (z0 )TH .

(A2.48)

This is equivalent to the following more detailed definition. Definition A2.5. The complex augmented Hessian matrix of a function f : C n → IR is the 2n × 2n matrix xx ∂f ∂f H Hx x H Hx x = , (A2.49) = ∗ Hx x H∗x x ∂x ∂x whose n × n blocks are the complex Hessian matrix Hx x

∂ ∂f H ∂ ∂f T = = ∂x ∂x ∂x ∂x∗

(A2.50)

and the complex complementary Hessian matrix xx = ∂ H ∂x∗

∂f ∂x

H =

∂ ∂x∗

∂f ∂x∗

T .

(A2.51)

Example A2.2. Consider f (x) = xH Ax + Re(aH x). If f is to be real-valued, we need to have A = AH , in which case xH Ax is a Hermitian quadratic form. The complex gradient and Hessians are ∇x f (x) = xH A + 12 aH Hx x = A

and

xT A∗ + 12 aT , x x = 0. H

The gradient is zero at 2Ax = −a ⇔ 2A∗ x∗ = −a∗ , which again demonstrates the equivalences of Result A2.3. Now consider g(x) = xH Ax + Re(xH Bx∗ ) + Re(aH x) with A = AH and B = BT . We obtain ∇x g(x) = xH A + xT B∗ + 12 aH Hx x = A

and

xT A∗ + xH B + 12 aT ,

x x = B. H

We note that neither f (x) nor g(x) is holomorphic, yet f somehow seems “better x x vanishes. behaved” because H

285

Complex differential calculus (Wirtinger calculus)

A2.3.1

Properties From the connection Hx x = 14 THzz TH we obtain Hx x = 14 [Huu + Hvv + j(Huv − HTuv )],

(A2.52)

x x = 1 [Huu − Hvv + j(Huv + HTuv )]. H 4

(A2.53)

The augmented Hessian Hx x is Hermitian, so Hx x = HHx x

xx = H Tx x . H

and

(A2.54)

The second-order approximation of a scalar real-valued function f (x) at x0 may therefore be written as f (x) ≈ f (x0 ) + ∇x f (x0 )(x − x0 ) + 12 (x − x0 )H Hx x (x0 )(x − x0 ) ∂f = f (x0 ) + 2 Re (x0 )(x − x0 ) + (x − x0 )H Hx x (x0 )(x − x0 ) ∂x $ # x x (x0 )(x∗ − x∗0 ) . (A2.55) + Re (x − x0 )H H While the augmented Hessian matrix Hx x (just like any augmented matrix) contains some redundancy, it is often more convenient to work with than considering Hx x and x x separately. In particular, as a simple consequence of Hx x = 1 THzz TH , the following H 4 holds. Result A2.4. The eigenvalues of Hx x are the eigenvalues of Hzz multiplied by 1/2. This means that Hx x shares the definiteness properties of Hzz . That is, Hx x is positive or negative (semi)definite if and only if Hzz is respectively positive or negative (semi) definite. Moreover, the condition numbers of Hx x and Hzz are identical. These properties are important for numerical optimization methods.

A2.3.2

Extension to complex-valued functions We can also define the Hessian of a complex-valued function f . The key difference is that the Hessian ∂f ∂f H (A2.56) ∂x ∂x no longer satisfies the block pattern of an augmented matrix because H ∗ T ∂f ∂ ∂ ∂ f ∗x x . Hx x = =H ∂x ∂x ∂x∗ ∂x

(A2.57)

On the other hand, ∂ ∂x∗

∂f ∂x

T

∂ = ∂x

∂f ∂x

H T = HTx x ,

(A2.58)

286

Appendix 2

but Hx x = HHx x . The following example illustrates that, for complex-valued f , we need x x , and Hx x defined in (A2.57). to consider three n × n Hessian matrices: Hx x , H Example A2.3. Let f (x) = xH Ax + 12 xH Bx∗ + 12 xT Cx with B = BT and C = CT . However, we may have A = AH so that xH Ax is generally complex. We obtain ∂ ∂f H = AH , Hx x = ∂x ∂x ∂f T xx = ∂ = B. H ∂x∗ ∂x∗ However, in order to fully characterize the quadratic function f , we also need to consider the third Hessian matrix ∂ ∂f T = C. Hx x = ∂x ∂x For complex-valued f , the second-order approximation at x0 is therefore f (x) ≈ f (x0 ) + ∇x f (x − x0 ) + 12 (x − x0 )H Hx x (x − x0 ) = f (x0 ) +

∂f ∂f (x − x0 ) + ∗ (x∗ − x∗0 ) ∂x ∂x

+ 12 (x − x0 )H Hx x (x − x0 ) + 12 (x − x0 )T HTx x (x∗ − x∗0 ) x x (x∗ − x∗0 ) + 1 (x − x0 )T Hx x (x − x0 ), + 12 (x − x0 )H H 2

(A2.59)

where all gradients and Hessians are evaluated at x0 . For holomorphic f , the only nonzero Hessian is Hx x .

Appendix 3 Introduction to majorization

The origins of majorization theory can be traced to Schur (1923), who studied the relationship between the diagonal elements of a positive semidefinite Hermitian matrix H and its eigenvalues as a means to illuminate Hadamard’s determinant inequality. Schur found that the diagonal elements H11 , . . ., Hnn are majorized by the eigenvalues λ1 , . . ., λn , written as [H11 , . . ., Hnn ]T ≺ [λ1 , . . ., λn ]T , which we shall precisely define in due course. Intuitively, this majorization relation means that the eigenvalues are more spread out than the diagonal elements. Importantly, majorization defines an ordering. Schur identified all functions g that preserve this ordering, i.e., x ≺ y ⇒ g(x) ≤ g(y). These functions are now called Schur-convex. In doing so, Schur implicitly characterized all possible inequalities that relate a function of the diagonal elements of a Hermitian matrix to the same function of the eigenvalues. (Schur originally worked with positive semidefinite matrices, but this restriction subsequently turned out to be unnecessary.) For us, the main implication of this result is the following. In order to maximize any Schurconvex function (or minimize any Schur-concave function) of the diagonal elements of a Hermitian matrix H with given eigenvalues λ1 , . . ., λn , we must unitarily diagonalize H. For instance, from the fact that the product x1 x2 · · · xn is Schur-concave, we obtain Hadamard’s determinant inequality n " i=1

Hii ≥

n "

λi = det H.

i=1

Hence, the product of the diagonal elements of H with prescribed eigenvalues is minimized if H is diagonal. It is possible to generalize these results to apply to the singular values of nonsquare non-Hermitian matrices. Majorization is not as well known in the signal-processing community as it perhaps should be. In this book, we consider many problems from a majorization point of view, even though they can also be solved using other, better-known but also more cumbersome, algebraic tools. The concise introduction presented in this appendix barely scratches the surface of the rich theory of majorization. For an excellent treatment of

288

Appendix 3

this topic, including proofs and references for all unreferenced results in this chapter, we refer the reader to the book by Marshall and Olkin (1979).

A3.1

Basic definitions

A3.1.1

Majorization Definition A3.1. A vector x ∈ IRn is said to be majorized by a vector y ∈ IRn , written as x ≺ y, if r

x[i] ≤

i=1

r

y[i] ,

r = 1, . . ., n − 1,

(A3.1)

i=1

n

x[i] =

i=1

n

y[i] ,

(A3.2)

i=1

where [·] is a permutation such that x[1] ≥ x[2] ≥ · · · ≥ x[n] . That is, the sum of the r largest components of x is less than or equal to the sum of the r largest components of y, with equality required for the total sum. Intuitively, if x ≺ y, the components of x are less spread out or “more equal” than the components of y. Example A3.1. Under the constraints n

xi = N ,

xi ≥ 0, i = 1, . . ., n,

i=1

the vector with the least spread-out components in the sense of majorization is [N /n, N /n, . . ., N /n]T , and the vector with the most spread-out components is [N , 0, . . ., 0]T (or any permutation thereof). Thus, for any vector x, 1 [N , N , . . ., N ]T ≺ x ≺ [N , 0, . . ., 0]T . n Definition A3.2. If the weaker condition r i=1

x[i] ≤

r

y[i] ,

r = 1, . . ., n

(A3.3)

i=1

holds, without equality for r = n, x is said to be weakly majorized by y, written as x ≺w y. The inequality (A3.3) is sometimes referred to as weak submajorization in order to distinguish it from another form of weak majorization, which is called weak supermajorization. For our purposes, we will not require the concept of weak supermajorization. Because the components of x and y are reordered in the definitions of majorization and weak majorization, their original order is irrelevant. Note, however, that majorization is

289

Introduction to majorization

sometimes also defined with respect to increasing rather than decreasing order. In that case, a vector x is majorized by a vector y if the components of x are more spread out than the components of y. Majorization defines a preordering on a set A since it is reflexive, x ≺ x,

∀ x ∈ A,

(A3.4)

and transitive, (x ≺ y and y ≺ z) ⇒ x ≺ z,

∀(x, y, z) ∈ A3 .

(A3.5)

Yet, in a strict sense, majorization does not constitute a partial ordering because it is not antisymmetric, x ≺ y ≺ x x = y.

(A3.6)

However, if x ≺ y and y ≺ x, then x is simply a permutation of y. Therefore, majorization does determine a partial ordering if it is restricted to the set of ordered n-tuples D = {x: x1 ≥ x2 ≥ · · · ≥ xn }. It is straightforward to verify that all statements in this paragraph regarding preordering and partial ordering also apply to weak majorization.

A3.1.2

Schur-convex functions Functions that preserve the preordering of majorization are called Schur-convex. Definition A3.3. A real-valued function g defined on a set A ⊂ IRn is Schur-convex on A if x ≺ y on A ⇒ g(x) ≤ g(y).

(A3.7)

If strict inequality holds when x is not a permutation of y, then g is called strictly Schurconvex. A function g is called (strictly) Schur-concave if −g is (strictly) Schur-convex. Functions that preserve the preordering of weak majorization must be Schur-convex and increasing. Definition A3.4. A real-valued function g defined on A ⊂ IRn is called increasing if (xi ≤ yi ,

i = 1, . . ., n) ⇒ g(x) ≤ g(y).

(A3.8)

It is called strictly increasing if the right-hand inequality in (A3.8) is strict when x = y, and (strictly) decreasing if −g is (strictly) increasing. Result A3.1. We have x ≺w y on A ⇒ g(x) ≤ g(y)

(A3.9)

if and only if g is Schur-convex and increasing on A. The strict version of this result goes as follows: (x ≺w y on A and x not a permutation of y) ⇒ g(x) < g(y) if and only if g is strictly Schur-convex and strictly increasing on A.

(A3.10)

290

Appendix 3

Testing whether or not a function g is Schur-convex is usually straightforward but possibly tedious. We will look at tests for Schur-convexity in the next section. If g is differentiable, it is a simple matter to check whether it is increasing. Result A3.2. If I ⊂ IR is an open interval, then g is increasing on I if and only if the derivative g (x) ≥ 0 for all x ∈ I. If A ⊂ IRn is a convex set with nonempty interior and g is differentiable on the interior of A and continuous on the boundary of A, then g is increasing on A if and only if all partial derivatives ∂ g(x) ≥ 0, ∂ xi

i = 1, . . ., n,

for all x in the interior of A.

A3.2

Tests for Schur-convexity We will need the following definition. Definition A3.5. A function g is called symmetric if it is invariant under permutations of the arguments, i.e., g(x) = g(Πx) for all x and all permutation matrices Π. One of the most important characterizations of Schur-convex functions is the following. Result A3.3. Let I ⊂ IR be an open interval and g: I n → IR be continuously differentiable. Then g is Schur-convex on I n if and only if g is symmetric on I n and

∂ ∂ (x1 − x2 ) g(x) − g(x) ≥ 0, ∂ x1 ∂ x2

(A3.11)

∀ x ∈ In.

(A3.12)

The fact that (A3.12) can be evaluated for x1 and x2 rather than two general components xi and x j is a consequence of the symmetry of g. The conditions (A3.11) and (A3.12) are also necessary and sufficient if g has domain A ⊂ IRn , which is not a Cartesian product, provided that (1) x ∈ A ⇒ Πx ∈ A, ∀ permutation matrices Π, (2) A is convex and has nonempty interior, and (3) g is continuously differentiable on the interior of A and continuous on A. We will not consider tests for strict Schur-convexity in this book. Example A3.2. Consider the function g(x) =

n i=1

xi2 1 − xi2

Introduction to majorization

291

defined on I n = (0, 1)n . This function is obviously symmetric on I n . It is also increasing on I n since, for all x ∈ I n , 2xi ∂ g(x) = ≥ 0, ∂ xi (1 − xi2 )2

i = 1, . . ., n.

The condition (A3.12) becomes 2x1 2x2 ≥ 0, − (x1 − x2 ) (1 − x12 )2 (1 − x22 )2

∀ x ∈ In.

This can be verified by computing the derivative of f (x) =

2x , (1 − x 2 )2

which can be seen to be positive for all x ∈ I, f (x) =

2(1 − x 2 )(1 + 3x 2 ) > 0. (1 − x 2 )4

Thus, g is Schur-convex.

A3.2.1

Specialized tests It is possible to explicitly determine the form of condition (A3.12) for special cases. The following list contains but a small sample of these results. 1. If I is an interval and h: I → IR is convex, then g(x) =

n

h(xi )

(A3.13)

i=1

is Schur-convex on I n . 2. Let h be a continuous nonnegative function on I ⊂ IR. The function g(x) =

n "

h(xi )

(A3.14)

i=1

is Schur-convex on I n if and only if log h(x) is convex on I. 3. A more general version, which subsumes the first two results, considers the composition of functions of the form g(x) = f (h(x1 ), h(x2 ), . . ., h(xn )),

(A3.15)

where f : IRn → IR and h: IR → IR. If f is increasing and Schur-convex and h is convex (increasing and convex), then g is Schur-convex (increasing and Schur-convex). Alternatively, if f is decreasing and Schur-convex and h is concave (decreasing and concave), then g is also Schur-convex (increasing and Schur-convex). These results establish that there is a connection between convexity (in the conventional sense) and Schur-convexity. In addition to the cases listed above, if g is

292

Appendix 3

symmetric and convex, then it is also Schur-convex. The reverse conclusion is not valid.

Example A3.3. Consider again the function g(x) from Example A3.2, which is of the form (A3.13) with h(xi ) =

xi2 . 1 − xi2

It can be checked that h(xi ) is convex by showing that its second derivative is nonnegative. Therefore, g(x) is Schur-convex. This is exactly what we have done in Example A3.2, thus implicitly deriving the special rule for functions of the form (A3.13).

Example A3.4. Consider a discrete random variable X that takes on n values with probabilities p1 , p2 , . . ., pn , which we collect in the vector p = [ p1 , p2 , . . ., pn ]T . The entropy of X is given by H (X ) = −

n

pi log pi .

i=1

Since h(x) = x log x is strictly convex, entropy is strictly Schur-concave. Using the result from Example A3.1, it follows that minimum entropy is achieved when probabilities are “most unequal”, i.e., p = [1, 0, . . ., 0]T or any permutation thereof. Maximum entropy is achieved when all probabilities are equal, i.e., p = [1/n, 1/n, . . ., 1/n]T .

A3.2.2

Functions defined on D We know from Result A3.3 that Schur-convex functions are necessarily symmetric if they are defined on IRn . However, it is not required that functions defined on the set of ordered n-tuples D = {x: x1 ≥ x2 ≥ · · · ≥ xn } be symmetric in order to be Schur-convex. Result A3.4. Let g: D → IR be continuous on D and continuously differentiable on the interior of D. Then g is Schur-convex if and only if ∂ ∂ g(x) ≥ g(x), ∂ xi ∂ xi+1

i = 1, . . ., n − 1,

(A3.16)

for all x in the interior of D. For functions of the form g(x) =

n

h i (xi ),

(A3.17)

i=1

where each h i : IR → IR is differentiable, this condition simplifies to h i (xi ) ≥ h i+1 (xi+1 ),

∀ xi ≥ xi+1 , i = 1, . . ., n − 1.

(A3.18)

Introduction to majorization

x

UT

ξ

Quantizer

coder

ξˆ

U

293

xˆ

decoder

Figure A3.1 An orthogonal transform coder.

In some situations, it is possible to assume without loss of generality that the components of vectors are always arranged in decreasing order. This applies, for instance, when there is no a-priori ordering of components, as is the case with eigenvalues and singular values. Under this assumption, majorization is indeed a partial ordering since, on D, x ≺ y ≺ x implies that x = y. In the context of Schur-convex functions, the fact that majorization only defines a preordering rather than a partial ordering on IRn is actually irrelevant. On IRn , x ≺ y ≺ x means that x is a permutation of y. However, since Schur-convex functions defined on IRn are necessarily symmetric (i.e., invariant with respect to permutations), x ≺ y ≺ x implies g(x) = g(y) for all Schur-convex functions g.

A3.3

Eigenvalues and singular values

A3.3.1

Diagonal elements and eigenvalues Result A3.5. Let H be a complex n × n Hermitian matrix with diagonal elements diag(H) = [H11 , H22 , . . ., Hnn ]T and eigenvalues ev(H) = [λ1 , λ2 , . . ., λn ]T . Then, diag(H) ≺ ev(H).

(A3.19)

Combined with the concept of Schur-convexity, this result can be used to establish a number of well-known inequalities (e.g., Hadamard’s determinant inequality) and the optimality of the eigenvalue decomposition for many applications.

Example A3.5. Consider the transform coder shown in Fig. A3.1. The real random vector x = [x1 , x2 , . . ., xn ]T , which is assumed to be zero-mean Gaussian, is passed through an orthogonal n × n coder UT . The output of the coder is ␰ = UT x, which is subsequently processed by a bank of n scalar quantizers. That is, each component of ␰ is indepenˆ The transformadently quantized. The quantizer output ␰ˆ is then decoded as xˆ = U␰. tion U therefore determines the internal coordinate system in which quantization takes place. It is well known that the mean-squared error Ex − xˆ 2 is minimized by choosing the coder UT to contain the n eigenvectors of the covariance matrix Rx x , irrespective of how the total bit budget for quantization is distributed over individual components. In this example, we use majorization to prove this result, following ideas by Goyal et al. (2000) and also Schreier and Scharf (2006a). We first write the quantizer output as ␰ˆ = ␰ + q,

294

Appendix 3

where q = [q1 , q2 , . . ., qn ]T denotes additive quantization noise. The variance of qi can be modeled as Eqi2 = di f (bi ), where di ≥ 0 is the variance of component ξi , and f (bi ) is a decreasing function of the number of bits bi spent on quantizing ξi . The function f characterizes the quantizer. We may now express the MSE as ˆ 2 = Eq2 = Ex − xˆ 2 = EU(␰ − ␰)

n

di f (bi ).

i=1

We can assume without loss of generality that the components of ␰ are arranged such that di ≥ di+1 ,

i = 1, . . ., n − 1.

Then, the minimum MSE solution requires that bi ≥ bi+1 , i = 1, . . ., n − 1, because f (bi ) is a decreasing function. With these assumptions, diag(Rξ ξ ) = [d1 , d2 , . . ., dn ]T and b = [b1 , b2 , . . ., bn ]T are both ordered n-tuples. For a fixed but arbitrary, ordered, bit assignment vector b, the MSE is a Schur-concave function of the diagonal elements of Rξ ξ = UT Rx x U. In order to show this, we note that the MSE is of the form (A3.17) and f (bi ) ≤ f (bi+1 ), i = 1, . . ., n − 1, because f is decreasing and bi ≥ bi+1 . With this in mind, the majorization diag(UT Rx x U) ≺ ev(Rx x ) establishes that the MSE is minimized if UT Rx x U is diagonal. Therefore, U is determined by the eigenvalue decomposition of Rx x .

A3.3.2

Diagonal elements and singular values For an arbitrary nonsquare, non-Hermitian matrix A ∈ C m×n , the diagonal elements and eigenvalues are generally complex, so that majorization would be possible only for absolute values or real parts. Unfortunately, no such relationship exists. However, there is the following comparison, which involves the singular values of A. Result A3.6. With p = min(m, n) and |diag(A)| [| A11 |, | A22 |, . . ., | A pp |]T , we have the weak majorization |diag(A)| ≺w sv(A).

(A3.20)

This generalizes Result A3.5 to arbitrary matrices. Just as Result A3.5 establishes many optimality results for the eigenvalue decomposition, Result A3.6 does so for the singular value decomposition.

Introduction to majorization

A3.3.3

295

Partitioned matrices In Result A3.5, eigenvalues are compared with diagonal elements. We shall now compare the eigenvalues of the n × n Hermitian matrix H11 H12 (A3.21) H= HH H22 12 with the eigenvalues of the corresponding block-diagonal matrix 0 H11 H0 = . 0 H22

(A3.22)

The submatrices are H11 ∈ C n 1 ×n 1 , H12 ∈ C n 1 ×n 2 , and H22 ∈ C n 2 ×n 2 , where n = n 1 + n 2 . Obviously, ev(H0 ) = (ev(H11 ), ev(H22 )), and tr H = tr H0 . Result A3.7. There exists the majorization ev(H0 ) ≺ ␭ = ev(H). More generally, let

αH12 . H22

Hα =

(A3.23)

H11 αHH 12

(A3.24)

Then, ev(Hα1 ) ≺ ev(Hα2 ),

0 ≤ α1 < α2 ≤ 1.

(A3.25)

This shows that the “stronger” the off-diagonal blocks are, the more spread out the eigenvalues become. An immediate consequence is the “block Hadamard inequality” det H0 ≥ det H, or more generally, det Hα1 ≥ det Hα2 . There is another result for block matrices of the form (A3.21) with n 1 = n 2 = n/2, due to Thompson and Therianos (1972). Result A3.8. Assuming that n 1 = n 2 and the eigenvalues of H, H11 , and H22 are each ordered decreasingly, we have k

λi + λn−k+i ≤

i=1

k

evi (H11 ) +

i=1

k

evi (H22 ),

k = 1, . . ., n 1 .

(A3.26)

i=1

These inequalities are reminiscent of majorization. It is not necessarily majorization because the partial sums on the left-hand side of (A3.26) need not be maximized for some k. However, if λi + λn−i+1 ≥ λi+1 + λn−i ,

i = 1, . . ., n 1 − 1,

(A3.27)

then (A3.26) becomes [λ1 + λn , λ2 + λn−1 , . . ., λn 1 + λn−n 1 +1 ]T ≺ ev(H11 ) + ev(H22 ), which is an actual majorization relation.

(A3.28)

References

Adali, T. and Calhoun, V. D. (2007). Complex ICA of brain imaging data. IEEE Signal Processing Mag., 24:136–139. Adali, T., Li, H., Novey, M., and Cardoso, J.-F. (2008). Complex ICA using nonlinear functions. IEEE Trans. Signal Processing, 56:4536–4544. Amblard, P. O., Gaeta, M., and Lacoume, J. L. (1996a). Statistics for complex variables and signals – part I: variables. Signal Processing, 53:1–13. (1996b). Statistics for complex variables and signals – part II: signals. Signal Processing, 53:15–25. Andersson, S. A. and Perlman, M. D. (1984). Two testing problems relating the real and complex multivariate normal distribution. J. Multivariate Analysis, 15:21–51. Anttila, L., Valkama, M., and Renfors, M. (2008). Circularity-based I/Q imbalance compensation in wideband direct-conversion receivers. IEEE Trans. Vehicular Techn., 57:2099–2113. Bangs, W. J. (1971). Array Processing with Generalized Beamformers. Ph.D. thesis, Yale University. Bartmann, F. C. and Bloomfield, P. (1981). Inefficiency and correlation. Biometrika, 68:67–71. Besson, O., Scharf, L. L., and Vincent, F. (2005). Matched direction detectors and estimators for array processing with subspace steering vector uncertainties. IEEE Trans. Signal Processing, 53:4453–4463. Blahut, R. E. (1985). Fast Algorithms for Digital Signal Processing. Reading, MA: AddisonWesley. Bloomfield, P. and Watson, G. S. (1975). The inefficiency of least squares. Biometrika, 62:121–128. Born, M. and Wolf, E. (1999). Principles of Optics. Cambridge: Cambridge University Press. Boyles, R. A. and Gardner, W. A. (1983). Cycloergodic properties of discrete parameter nonstationary stochastic processes. IEEE Trans. Inform. Theory, 29:105–114. Brandwood, D. H. (1983). A complex gradient operator and its application in adaptive array theory. IEE Proceedings H, 130:11–16. Brown, W. M. and Crane, R. B. (1969). Conjugate linear filtering. IEEE Trans. Inform. Theory, 15:462–465. Burt, W., Cummings, T., and Paulson, C. A. (1974). Mesoscale wind field over ocean. J. Geophys. Res., 79:5625–5632. Buzzi, S., Lops, M., and Sardellitti, S. (2006). Widely linear reception strategies for layered space-time wireless communications. IEEE Trans. Signal Processing, 54:2252–2262. Cacciapuoti, A. S., Gelli, G., and Verde, F. (2007). FIR zero-forcing multiuser detection and code designs for downlink MC-CDMA. IEEE Trans. Signal Processing, 55:4737–4751. Calman, J. (1978). On the interpretation of ocean current spectra. J. Physical Oceanography, 8:627–652.

References

297

Charge, P., Wang, Y., and Saillard, J. (2001). A non-circular sources direction finding method using polynomial rooting. Signal Processing, 81:1765–1770. Chen, M., Chen, Z., and Chen, G. (1997). Approximate Solutions of Operator Equations. Singapore: World Scientific. Chevalier, P. and Blin, A. (2007). Widely linear MVDR beamformers for the reception of an unknown signal corrupted by noncircular interferences. IEEE Trans. Signal Processing, 55:5323–5336. Chevalier, P. and Picinbono, B. (1996). Complex linear-quadratic systems for detection and array processing. IEEE Trans. Signal Processing, 44:2631–2634. Chevalier, P. and Pipon, F. (2006). New insights into optimal widely linear array receivers for the demodulation of BPSK, MSK, and GMSK interferences – application to SAIC. IEEE Trans. Signal Processing, 54:870–883. Cohen, L. (1966). Generalized phase-space distribution functions. J. Math. Phys., 7:781– 786. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36:287– 314. Conte, E., De Maio, A., and Ricci, G. (2001). GLRT-based adaptive detection algorithms for range-spread targets. IEEE Trans. Signal Processing, 49:1336–1348. Coxhead, P. (1974). Measuring the relationship between two sets of variables. Brit. J. Math. Statist. Psych., 27:205–212. Cramer, E. M. and Nicewander, W. A. (1979). Some symmetric, invariant measures of multivariate association. Psychometrika, 44:43–54. Davis, M. C. (1963). Factoring the spectral matrix. IEEE Trans. Automatic Control, 8:296– 305. DeLathauwer, L. and DeMoor, B. (2002). On the blind separation of non-circular sources, in Proc. EUSIPCO, pp. 99–102. Delmas, J. P. (2004). Asymptotically minimum variance second-order estimation for noncircular signals with application to DOA estimation. IEEE Trans. Signal Processing, 52: 1235–1241. Delmas, J. P. and Abeida, H. (2004). Stochastic Cramér–Rao bound for noncircular signals with application to DOA estimation. IEEE Trans. Signal Processing, 52:3192–3199. Dietl, G., Zoltowski, M. D., and Joham, M. (2001). Recursive reduced-rank adaptive equalization for wireless communication, Proc. SPIE, vol. 4395. Drury, S. W. (2002). The canonical correlations of a 2 × 2 block matrix with given eigenvalues. Lin. Algebra Appl., 354:103–117. Drury, S. W., Liu, S., Lu, C.-Y., Puntanen, S., and Styan, G. P. H. (2002). Some comments on several matrix inequalities with applications to canonical correlations: historical background and recent development. Sankhya A, 64:453–507. Eaton, M. L. (1983). Multivariate Statistics: A Vector Space Approach. New York: Wiley. Eriksson, J. and Koivunen, V. (2006). Complex random vectors and ICA models: identifiability, uniqueness, and separability. IEEE Trans. Inform. Theory, 52:1017–1029. Erkmen, B. I. (2008). Phase-Sensitive Light: Coherence Theory and Applications to Optical Imaging. Ph.D. thesis, Massachusetts Institute of Technology. Erkmen, B. I. and Shapiro, J. H. (2006). Optical coherence theory for phase-sensitive light, in Proc. SPIE, volume 6305. Bellingham, WA: The International Society for Optical Engineering. Fang, K.-T., Kotz, S., and Ng, K. W. (1990). Symmetric Multivariate and Related Distributions. London: Chapman and Hall.

298

References

Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. New York: Academic Press. Ferrara, E. R. (1985). Frequency-domain implementations of periodically time-varying filters. IEEE Trans. Acoustics, Speech, Signal Processing, 33:883–892. Flandrin, P. (1999). Time–Frequency/Time–Scale Analysis. San Diego: Academic. Gardner, W. A. (1986). Measurement of spectral correlation. IEEE Trans. Acoust. Speech Signal Processing, 34:1111–1123. (1988). Statistical Spectral Analysis. Englewood Cliffs, NJ: Prentice Hall. (1993). Cyclic Wiener filtering: theory and method. IEEE Trans. Commun., 41:151–163. Gardner, W. A., Brown, W. A., and Chen, C.-K. (1987). Spectral correlation of modulated signals, part II: digital modulation. IEEE Trans. Commun., 35:595–601. Gardner, W. A., Napolitano, A., and Paura, L. (2006). Cyclostationarity: half a century of research. Signal Processing, 86:639–697. Gelli, G., Paura, L., and Ragozini, A. R. P. (2000). Blind widely linear multiuser detection. IEEE Commun. Lett., 4:187–189. Gersho, A. and Gray, R. M. (1992). Vector Quantization and Signal Compression. Boston, MA: Kluwer. Gerstacker, H., Schober, R., and Lampe, A. (2003). Receivers with widely linear processing for frequency-selective channels. IEEE Trans. Commun., 51:1512–1523. Gladyshev, E. (1963). Periodically and almost periodically correlated random processes with continuous time parameter. Theory Probab. Applic., 8:173–177. Gladyshev, E. D. (1961). Periodically correlated random sequences. Soviet Math. Dokl., 2:385– 388. Gleason, T. C. (1976). On redundancy in canonical analysis. Psych. Bull., 83:1004–1006. Goh, S. L. and Mandic, D. P. (2007a). An augmented ACRTRL for complex valued recurrent neural networks. Neural Networks, 20:1061–1066. (2007b). An augmented extended Kalman filter algorithm for complex-valued recurrent neural networks. Neural Computation, 19:1039–1055. Goldstein, J. S., Reed, I. S., and Scharf, L. L. (1998). A multistage representation of the Wiener filter based on orthogonal projections. IEEE Trans. Inform. Theory, 44:2943–2959. Gonella, J. (1972). A rotary-component method for analysing meteorological and oceanographic vector time series. Deep-Sea Res., 19:833–846. Goodman, N. R. (1963). Statistical analysis based on a certain multivariate complex Gaussian distribution (an introduction). Ann. Math. Statist., 34:152–177. Gorman, J. D. and Hero, A. (1990). Lower bounds for parametric estimation with constraints. IEEE Trans. Inform. Theory, 26:1285–1301. Goyal, V. K., Zhuang, J., and Vetterli, M. (2000). Transform coding with backward adaptive updates. IEEE Trans. Inform. Theory, 46:1623–1633. Grettenberg, T. L. (1965). A representation theorem for complex normal processes. IEEE Trans. Inform. Theory, 11:305–306. Haardt, M. and Roemer, F. (2004). Enhancements of unitary ESPRIT for non-circular sources, in Proc. ICASSP, pp. 101–104. Hanson, B., Klink, K., Matsuura, K., Robeson, S. M., and Willmott, C. J. (1992). Vector correlation: review, exposition, and geographic application. Ann. Association Am. Geographers, 82:103– 116. Hanssen, A. and Scharf, L. (2003). A theory of polyspectra for nonstationary stochastic processes. IEEE Trans. Signal Processing, 51:1243–1252.

References

299

Hindberg, H., Birkelund, Y., Øig˚ard, T. A., and Hanssen, A. (2006). Kernel-based estimators for the Kirkwood–Rihaczek time-frequency spectrum, in Proc. European Signal Processing Conference. Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis. Cambridge: Cambridge University Press. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28:321–377. Hua, Y., Nikpour, M., and Stoica, P. (2001). Optimal reduced-rank estimation and filtering. IEEE Trans. Signal Processing, 49:457–469. Izzo, L. and Napolitano, A. (1997). Higher-order statistics for Rice’s representation of cyclostationary signals. Signal Processing, 56:179–292. (1998). Multirate processing of time-series exhibiting higher-order cyclostationarity. IEEE Trans. Signal Processing, 46:429–439. Jeon, J. J., Andrews, J. G., and Sung, K. M. (2006). The blind widely linear minimum output energy algorithm for DS-CDMA systems. IEEE Trans. Signal Processing, 54:1926– 1931. Jezek, J. and Kucera, V. (1985). Efficient algorithm for matrix spectral factorization. Automatica, 21:663–669. Jones, A. G. (1979). On the difference between polarisation and coherence. J. Geophys., 45:223– 229. Jones, R. C. (1941). New calculus for the treatment of optical systems. J. Opt. Soc. Am., 31:488– 493. Jouny, I. I. and Moses, R. L. (1992). The bispectrum of complex signals: definitions and properties. IEEE Trans. Signal Processing, 40:2833–2836. Jupp, P. E. and Mardia, K. V. (1980). A general correlation coefficient for directional data and related regression problems. Biometrika, 67:163–173. Kelly, E. J. (1986). An adaptive detection algorithm. IEEE Trans. Aerosp. Electron. Syst., 22:115– 127. Kelly, E. J. and Root, W. L. (1960). A representation of vector-valued random processes. Group Rept. MIT Lincoln Laboratory, no. 55–21. Kraut, S., Scharf, L. L., and McWhorter, L. T. (2001). Adaptive subspace detectors. IEEE Trans. Signal Processing, 49:1–16. Lampe, A., Schober, R., and Gerstacker, W. (2002). A novel iterative multiuser detector for complex modulation schemes. IEEE J. Sel. Areas Commun., 20:339–350. Lee, E. A. and Messerschmitt, D. G. (1994). Digital Communication. Boston, MA: Kluwer. Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 2nd edn. New York: Springer. Li, H. and Adali, T. (2008). A class of complex ICA algorithms based on the kurtosis cost function. IEEE Trans. Neural Networks, 19:408–420. Lilly, J. M. and Gascard, J.-C. (2006). Wavelet ridge diagnosis of time-varying elliptical signals with application to an oceanic eddy. Nonlin. Processes Geophys., 13:467–483. Lilly, J. M. and Park, J. (1995). Multiwavelet spectral and polarization analysis. Geophys. J. Int., 122:1001–1021. Loève, M. (1978). Probability Theory II, 4th edn. New York: Springer. Mandic, D. P. and Goh, V. S. L. (2009). Complex Valued Nonlinear Adaptive Filters. New York: Wiley. Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. New York: Academic. Marshall, A. W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications. New York: Academic.

300

References

Martin, W. (1982). Time–frequency analysis of random signals, in Proc. Int. Conf. Acoustics., Speech, Signal Processing, pp. 1325–1328. Martin, W. and Flandrin, P. (1985). Wigner–Ville spectral analysis of nonstationary processes. IEEE Trans. Acoustics, Speech, Signal Processing, 33:1461–1470. Marzetta, T. L. (1993). A simple derivation of the constrained multiple parameter Cramer–Rao bound. IEEE Trans. Signal Processing, 41:2247–2249. McWhorter, L. T. and Scharf, L. (1993a). Geometry of the Cramer–Rao bound. Signal Processing, 31:301–311. (1993b). Cramér–Rao bounds for deterministic modal analysis. IEEE Trans. Signal Processing, 41:1847–1865. (1993c). Properties of quadratic covariance bounds, in Proc. 27th Asilomar Conf. Signals, Systems, Computers, pp. 1176–1180. McWhorter, T. and Schreier, P. (2003). Widely-linear beamforming, in Proc. 37th Asilomar Conf. Signals, Systems, Comput., pp. 753–759. Mirbagheri, A., Plataniotis, N., and Pasupathy, S. (2006). An enhanced widely linear CDMA receiver with OQPSK modulation. IEEE Trans. Commun., 54:261–272. Mitra, S. K. (2006). Digital Signal Processing, 3rd edn. Boston, MA: McGraw-Hill. Molle, J. W. D. and Hinich, M. J. (1995). Trispectral analysis of stationary random time series. IEEE Trans. Signal Processing, 97:2963–2978. Mooers, C. N. K. (1973). A technique for the cross spectrum analysis of pairs of complex-valued time series, components and rotational invariants. Deep-Sea Res., 20:1129–1141. Morgan, D. R. (2006). Variance and correlation of square-law detected allpass channels with bandpass harmonic signals in Gaussian noise. IEEE Trans. Signal Processing, 54:2964– 2975. Morgan, D. R. and Madsen, C. K. (2006). Wide-band system identification using multiple tones with allpass filters and square-law detectors. IEEE Trans. Circuits Systems I, 53:1151–1165. Mueller, H. (1948). The foundation of optics. J. Opt. Soc. Am., 38:661–672. Mullis, C. T. and Scharf, L. L. (1996). Applied Probability. Course notes for ECE5612, Univ. of Colorado, Boulder, CO. Napolitano, A. and Spooner, C. M. (2001). Cyclic spectral analysis of continuous-phase modulated signals. IEEE Trans. Signal Processing, 49:30–44. Napolitano, A. and Tanda, M. (2004). Doppler-channel blind identification for noncircular transmissions in multiple-access systems. IEEE Trans. Commun., 52:2073–2078. Navarro-Moreno, J., Ruiz-Molina, J. C., and Fernández-Alcalá, R. M. (2006). Approximate series representations of linear operations on second-order stochastic processes: application to simulation. IEEE Trans. Inform. Theory, 52:1789–1794. Neeser, F. D. and Massey, J. L. (1993). Proper complex random processes with applications to information theory. IEEE Trans. Inform. Theory, 39:1293–1302. Nilsson, R., Sjoberg, F., and LeBlanc, J. P. (2003). A rank-reduced LMMSE canceller for narrowband interference suppression in OFDM-based systems. IEEE Trans. Commun., 51:2126– 2140. Novey, M. and Adali, T. (2008a). Complex ICA by negentropy maximization. IEEE Trans. Neural Networks, 19:596–609. (2008b). On extending the complex FastICA algorithm to noncircular sources. IEEE Trans. Signal Processing, 56:2148–2154. Olhede, S. and Walden, A. T. (2003a). Polarization phase relationships via multiple Morse wavelets. I. Fundamentals. Proc. Roy. Soc. Lond. A Mat., 459:413–444.

References

301

(2003b). Polarization phase relationships via multiple Morse wavelets. II. Data analysis. Proc. Roy. Soc. Lond. A Mat., 459:641–657. Ollila, E. (2008). On the circularity of a complex random variable. IEEE Signal Processing Lett., 15:841–844. Ollila, E. and Koivunen, V. (2004). Generalized complex elliptical distributions, in Proc. SAM Workshop, pp. 460–464. (2009). Complex ICA using generalized uncorrelating transform. Signal Processing, 89:365– 377. Pflug, L. A., Ioup, G. E., Ioup, J. W., and Field, R. L. (1992). Properties of higher-order correlations and spectra for bandlimited, deterministic transients. J. Acoust. Soc. Am., 91:975–988. Picinbono, B. (1994). On circularity. IEEE Trans. Signal Processing, 42:3473–3482. (1996). Second-order complex random vectors and normal distributions. IEEE Trans. Signal Processing, 44:2637–2640. Picinbono, B. and Bondon, P. (1997). Second-order statistics of complex signals. IEEE Trans. Signal Processing, 45:411–419. Picinbono, B. and Chevalier, P. (1995). Widely linear estimation with complex data. IEEE Trans. Signal Processing, 43:2030–2033. Picinbono, B. and Duvaut, P. (1988). Optimal linear-quadratic systems for detection and estimation. IEEE Trans. Inform. Theory, 34:304–311. Poor, H. V. (1998). An Introduction to Signal Detection and Estimation. New York: Springer. Proakis, J. G. (2001). Digital Communications, 4th edn. Boston, MA: McGraw-Hill. Ramsay, J. O., ten Berge, J., and Styan, G. P. H. (1984). Matrix correlation. Psychometrika, 49:402–423. Reed, I. S., Mallett, J. D., and Brennan, L. E. (1974). Rapid convergence rate in adaptive arrays. IEEE Trans. Aerosp. Electron. Syst., 10:853–863. Renyi, A. (1959). On measures of dependence. Acta Math. Acad. Sci. Hungary, 10:441–451. Richmond, C. D. (2006). Mean-squared error and threshold SNR prediction of maximum likelihood signal parameter estimation with estimated colored noise covariances. IEEE Trans. Inform. Theory, 52:2146–2164. Rihaczek, A. W. (1968). Signal energy distribution in time and frequency. IEEE Trans. Inform. Theory, 14:369–374. Rivet, B., Girin, L., and Jutten, C. (2007). Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Trans. Audio, Speech, Language Processing, 15:796–802. Robert, P. and Escoufier, Y. (1976). A unifying tool for linear multivariate statistical methods: the RV-coefficient. Appl. Statist., 25:257–265. Robey, F. C., Fuhrmann, D. R., Kelly, E. J., and Nitzberg, R. A. (1992). A CFAR adaptive matched filter detector. IEEE Trans. Aerosp. Electron. Syst., 28:208–216. Römer, F. and Haardt, M. (2007). Deterministic Cramér–Rao bounds for strict sense non-circular sources, in Proc. ITG/IEEE Workshop on Smart Antennas. (2009). Multidimensional unitary tensor-esprit for non-circular sources, in Proc. ICASSP. Roueff, A., Chanussot, J., and Mars, J. I. (2006). Estimation of polarization parameters using time-frequency representations and its application to waves separation. Signal Processing, 86:3714–3731. Rozanov, Y. A. (1963). Stationary Random Processes. San Francisco, CA: Holden-Day. Rozeboom, W. W. (1965). Linear correlation between sets of variables. Psychometrika, 30: 57–71.

302

References

Rubin-Delanchy, P. and Walden, A. T. (2007). Simulation of improper complex-valued sequences. IEEE Trans. Signal Processing, 55:5517–5521. (2008). Kinematics of complex-valued time series. IEEE Trans. Signal Processing, 56:4189– 4198. Rykaczewski, P., Valkama, M., and Renfors, M. (2008). On the connection of I/Q imbalance and channel equalization in direct-conversion transceivers. IEEE Trans. Vehicular Techn., 57:1630–1636. Sampson, P. D., Streissguth, A. P., Barr, H. M., and Bookstein, F. L. (1989). Neurobehavioral effects of prenatal alcohol: part II. Partial least squares analysis. Neurotoxicology Teratology, 11:477–491. Samson, J. C. (1980). Comments on polarization and coherence. J. Geophys., 48:195–198. Scharf, L. L. (1991). Statistical Signal Processing. Reading, MA: Addison-Wesley. Scharf, L. L., Chong, E. K. P., Zoltowski, M. D., Goldstein, J. S., and Reed, I. S. (2008). Subspace expansion and the equivalence of conjugate direction and multistage Wiener filters. IEEE Trans. Signal Processing, 56:5013–5019. Scharf, L. L. and Friedlander, B. (1994). Matched subspace detectors. IEEE Trans. Signal Processing, 42:2146–2157. (2001). Toeplitz and Hankel kernels for estimating time-varying spectra of discrete-time random processes. IEEE Trans. Signal Processing, 49:179–189. Scharf, L. L., Schreier, P. J., and Hanssen, A. (2005). The Hilbert space geometry of the Rihaczek distribution for stochastic analytic signals. IEEE Signal Processing Lett., 12:297–300. Schreier, P. J. (2008a). Bounds on the degree of impropriety of complex random vectors. IEEE Signal Processing Lett., 15:190–193. (2008b). Polarization ellipse analysis of nonstationary random signals. IEEE Trans. Signal Processing, 56:4330–4339. (2008c). A unifying discussion of correlation analysis for complex random vectors. IEEE Trans. Signal Processing, 56:1327–1336. Schreier, P. J., Adalı, T., and Scharf, L. L. (2009). On ICA of improper and noncircular sources, in Proc. Int. Conf. Acoustics, Speech, Signal Processing. Schreier, P. J. and Scharf, L. L. (2003a). Second-order analysis of improper complex random vectors and processes. IEEE Trans. Signal Processing, 51:714–725. (2003b). Stochastic time-frequency analysis using the analytic signal: why the complementary distribution matters. IEEE Trans. Signal Processing, 51:3071–3079. (2006a). Canonical coordinates for transform coding of noisy sources. IEEE Trans. Signal Processing, 54:235–243. (2006b). Higher-order spectral analysis of complex signals. Signal Processing, 86:3321– 3333. Schreier, P. J., Scharf, L. L., and Hanssen, A. (2006). A generalized likelihood ratio test for impropriety of complex signals. IEEE Signal Processing Lett., 13:433–436. Schreier, P. J., Scharf, L. L., and Mullis, C. T. (2005). Detection and estimation of improper complex random signals. IEEE Trans. Inform. Theory, 51:306–312. ¨ Schur, I. (1923). Uber eine Klasse von Mittelbildungen mit Anwendungen auf die Determinantentheorie. Sitzungsber. Berliner Math. Ges., 22:9–20. Serpedin, E., Panduru, F., Sari, I., and Giannakis, G. B. (2005). Bibliography on cyclostationarity. Signal Processing, 85:2233–2303. Shapiro, J. H. and Erkmen, B. I. (2007). Imaging with phase-sensitive light, in Int. Conf. on Quantum Information. New York: Optical Society of America.

References

303

Slepian, D. (1954). Estimation of signal parameters in the presence of noise. Trans. IRE Prof. Group Inform. Theory, 3:68–89. Spurbeck, M. S. and Mullis, C. T. (1998). Least squares approximation of perfect reconstruction filter banks. IEEE Trans. Signal Processing, 46:968–978. Spurbeck, M. S. and Schreier, P. J. (2007). Causal Wiener filter banks for periodically correlated time series. Signal Processing, 87:1179–1187. Srivastava, M. S. (1965). On the complex Wishart distribution. Ann. Math. Statist., 36:313– 315. Stewart, D. and Love, W. (1968). A general canonical correlation index. Psych. Bull., 70:160– 163. Stoica, P. and Ng, B. C. (1998). On the Cramer–Rao bound under parametric constraints. IEEE Signal Processing Lett., 5:177–179. Stokes, G. G. (1852). On the composition and resolution of streams of polarized light from different sources. Trans. Cambr. Phil. Soc., 9:399–423. Tauböck, G. (2007). Complex noise analysis of DMT. IEEE Trans. Signal Processing, 55:5739– 5754. Thompson, R. C. and Therianos, S. (1972). Inequalities connecting the eigenvalues of a Hermitian matrix with the eigenvalues of complementary principal submatrices. Bull. Austral. Math. Soc., 6:117–132. Tishler, A. and Lipovetsky, S. (2000). Modelling and forecasting with robust canonical analysis: method and application. Comput. Op. Res., 27:217–232. van den Bos, A. (1994a). Complex gradient and Hessian. IEE Proc. Vision, Image, Signal Processing, 141:380–383. (1994b). A Cramér–Rao lower bound for complex parameters. IEEE Trans. Signal Processing, 42:2859. (1995). The multivariate complex normal distribution – a generalization. IEEE Trans. Inform. Theory, 41:537–539. Van Trees, H. L. (2001). Detection, Estimation, and Modulation Theory: Part I. New York: Wiley. Van Trees, H. L. and Bell, K. L., editors (2007). Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking. New York: IEEE and Wiley-Interscience. Wahlberg, P. and Schreier, P. J. (2008). Spectral relations for multidimensional complex improper stationary and (almost) cyclostationary processes. IEEE Trans. Inform. Theory, 54:1670– 1682. Walden, A. T. and Rubin-Delanchy, P. (2009). On testing for impropriety of complex-valued Gaussian vectors. IEEE Trans. Signal Processing, 57:825–834. Weinstein, E. and Weiss, A. J. (1988). A general class of lower bounds in parameter estimation. IEEE Trans. Inform. Theory, 34:338–342. Weippert, M. E., Hiemstra, J. D., Goldstein, J. S., and Zoltowski, M. D. (2002). Insights from the relationship between the multistage Wiener filter and the method of conjugated gradients, Proc. IEEE Workshop on Sensor Array Multichannel Signal Processing, pp. 388–392. Weiss, A. J. and Weinstein, E. (1985). A lower bound on the mean squared error in random parameter estimation. IEEE Trans. Inform. Theory, 31:680–682. Wiener, N. and Masani, P. (1957). The prediction theory of multivariate stochastic processes, part I. Acta Math., 98:111–150. (1958). The prediction theory of multivariate stochastic processes, part II. Acta Math., 98:93– 137.

304

References

Wilson, G. T. (1972). The factorization of matricial spectral densities. SIAM J. Appl. Math., 23:420–426. Wirtinger, W. (1927). Zur formalen Theorie der Funktionen von mehr komplexen Veränderlichen. Math. Ann., 97:357–375. Witzke, M. (2005). Linear and widely linear filtering applied to iterative detection of generalized MIMO signals. Ann. Telecommun., 60:147–168. Wold, H. (1975). Path models with latent variables: the NIPALS approach, in Blalock, H. M., editor, Quantitative Sociology: International Perspectives on Mathematical and Statistical Modeling. New York: Academic, pp. 307–357. (1985). Partial least squares, in Kotz, S. and Johnson, N. L., editors, Encyclopedia of the Statistical Sciences, New York: Wiley, pp. 581–591. Wolf, E. (1959). Coherence properties of partially polarized electromagnetic radiation. Nuovo Cimento, 13:1165–1181. Wooding, R. A. (1956). The multivariate distribution of complex normal variables. Biometrika, 43:212–215. Yanai, H. (1974). Unification of various techniques of multivariate analysis by means of generalized coefficient of determination (G.C.D.). J. Behaviormetrics, 1:45–54. Yoon, Y. C. and Leib, H. (1997). Maximizing SNR in improper complex noise and applications to CDMA. IEEE Commun. Letters, 1:5–8. Youla, D. C. (1961). On the factorization of rational matrices. IRE Trans. Inform. Theory, 7:172– 189. Zou, Y., Valkama, M., and Renfors, M. (2008). Digital compensation of I/Q imbalance effects in space-time coded transmit diversity systems. IEEE Trans. Signal Processing, 56:2496– 2508.

Index

aliasing, 236 almost cyclostationary process, 251 almost periodic function, 251 ambiguity function, 232 analytic function, 280 analytic signal, 11 higher-order spectra, 218–221 nonstationary, 235, 245 Nth-order circular, 221 WSS, 201 applications (survey), 27 augmented covariance function, 55 covariance matrix, 34, 36 eigenvalue decomposition, 62 expansion-coefficient matrix, 161 frequency-response matrix, 201 information matrix, 161 matrix, 32 mean vector, 34 PSD matrix, 55 sensitivity matrix, 161 vector, 31 baseband signal, 9, 201 Bayes detector, 180 Bayes risk, 180 Bayesian point of view, 116, 152–157 beamformer, 143 Bedrosian’s theorem, 14 best linear unbiased estimator, see LMVDR estimation bicorrelation, 217 bispectrum, 217 blind source separation, 81–84 BLUE, see LMVDR estimation BPSK spectrum, 257 C-ESD, 233 C-PSD, 55, 197 connection with C-ESD, 233 nonstationary process, 234 canonical correlations, 93–100

between x and x∗ , 65 for rank reduction, 135–137 invariance properties, 97–100 Cauchy distribution, 46 Cauchy–Riemann equations, 280, 281 CCA, see canonical correlations CFAR matched subspace detector, 193 chain rule (for differentiating non-holomorphic functions), 281 characteristic function, 49–50 Cholesky decomposition, 272 circular, 53 Nth-order, 54, 221 circularity coefficients, 65–69 circularity spectrum, 65–69 circularly polarized, 7, 206, 246 Cohen’s class, 240 coherence between rotary components (WSS), 211–216 cyclic spectral (cyclostationary), 254 time–frequency (nonstationary), 238, 244, 247 complementary covariance function, 55 covariance matrix, 34 characterization, 36, 69 energy spectral density, see C-ESD expansion-coefficient matrix, 161 Fisher information, 167 information matrix, 161 power spectral density, see C-PSD Rihaczek distribution, see Rihaczek distribution sensitivity matrix, 161 spectral correlation, see spectral correlation variance, 22 complex . . ., see corresponding entry without “complex” concentration ellipsoid, 127 conditional mean estimator, 119–121 conformal function, 280 conjugate covariance, see complementary covariance conjugate differential operator, 279 contour (pdf), 42, 46

306

Index

correlation analysis, 85–110 correlation coefficient, 22, 42, 86–93, 102–108 based on canonical correlations, 103–106 based on half-canonical correlations, 106–107 based on PLS correlations, 108 for real data, 86 reflectional, 87–91, 94–97 rotational, 87–91, 94–97 total, 87–91, 94–97 correlation function, see covariance function correlation spread, 108 covariance function, 54 CR-TFD, see Rihaczek distribution Cramér–Loève spectral representation, see spectral representation Cramér–Rao bound, 163–170 stochastic, 171–174 CS process, see cyclostationary process cumulant, 52 cycle frequency, 252 cyclic complementary correlation function, 252 complementary power spectral density, 252–260 correlation function, 252 periodogram, 255 power spectral density, 252–260 spectral coherence, 254 Wiener filter, 260–268 cycloergodic, 268 cyclostationary process, 250–268 connection with vector WSS process, 262 estimation of, 260–268 spectral representation, 251–253 damped harmonic oscillator, 5 decreasing function, 289 deflection, 184 degree of impropriety, 70–77 degree of polarization, 109, 211–215, 245 demodulation, 13 detection, 177–194 nonstationary process, 230 probability, 178 DFT, 17 differential calculus, 277–286 differential entropy, see entropy discrete Fourier transform, 17 dual-frequency spectrum, see spectral correlation efficient estimator, 159, 171 eigenfunction, 224 eigenvalue decomposition, 62, 270 electromagnetic polarization, 6 elliptical distribution, 44–47 energy signal, 234

energy spectral density, see ESD entropy, 37, 67 envelope, 9 error score, 153 ESD, 233–234 estimation, 116–149 cyclostationary process, 260–268 linear MMSE, 121–129 linear MVDR, 137–144 MMSE, 119–121 nonstationary process, 227–229 performance bounds, 151–175 reduced-rank, 132–137 widely linear MMSE, 129–132 widely linear MVDR, 143 widely linear–quadratic, 144–149 WSS process, see Wiener filter EVD, see eigenvalue decomposition expansion-coefficient matrix, 158 augmented, 161 complementary, 161 expectation operator, 38, 153 extreme point, 70 false-alarm probability, 178 fast Fourier transform, 17 FFT, 17 Fischer determinant inequality, 274 Fisher information matrix, 162–170 complementary, 167 Fisher score, 162–170 Fisher–Bayes bound, 171–174 Fisher–Bayes score, 171–174 four-corners diagram, 231 frequency-shift filter, 260–268 frequentist point of view, 116, 152–157 FRESH filter, 260–268 Gaussian distribution, 19–23, 39–44 characteristic function, 50 generalized complex differential operator, 278 generalized likelihood-ratio test, see GLRT generalized sidelobe canceler, 139–142 generating matrix (elliptical pdf), 45 global frequency variable, 231–232 global time variable, 231–232 GLRT for correlation structure, 110–114 impropriety, 77–81 independence, 112 sphericity, 112 goniometer, 142 gradient, 280–282 Grammian matrix, 119, 272 Grettenberg’s theorem, 53 GSC, 139–142

Index

Hadamard’s determinant inequality, 274, 287 half-canonical correlations, 93–97, 100 for rank reduction, 133–135 invariance properties, 100 harmonic oscillator, 5 harmonizable, 230 Hermitian matrix, 270 Hessian matrix, 283–286 higher-order moments, 50–54, 217–222, 247 higher-order spectra, 217–222, 248 analytic signal, 219–222 Hilbert space of random variables, 117–119 Hilbert transform, 11–15 holomorphic function, 279–280 HR-TFD, see Rihaczek distribution hypothesis test, see test I/Q imbalance, 71 ICA, 81–84 improper, 35 maximally, 22, 44, 76 impropriety degree of, 70–77 test for, 77–81 in-phase component, 9 increasing function, 289 independent component analysis, 81–84 information matrix, 158 augmented, 161 complementary, 161 Fisher, 162–170 inner product, 33, 117–119 instantaneous amplitude, 14 frequency, 14 phase, 14 invariant statistic, 181 invariant test, 189 inverse matrix, 274–276 Jacobian, 280 jointly proper, 41 Jones calculus, 7, 214 Jones vector, 7, 213–215 Karhunen–Loève expansion, 224 Karlin–Rubin theorem, 188 Kramers–Kronig relation, 12 Lanczos kernel, 167 left–circularly polarized, 7, 208 level curve (pdf), 42, 46 likelihood-ratio test, 179 linear minimum mean-squared error, see LMMSE linear minimum variance distortionless response estimation, see LMVDR estimation

307

linear–conjugate-linear, see widely linear linearly polarized, 6, 208 Lissajous figure, 6 LMMSE estimation, 121–129 gain, 127 Gaussian, 128 LMVDR estimation, 137–144, 156 widely linear, 143 Loève spectrum, see spectral correlation local frequency variable, 231–232 local time variable, 231–232 log-likelihood ratio, 179 majorization, 287–295 weak, 288 matched filter, 138, 144, 183 noncoherent adaptive, 143 matched subspace detector, 190–194 CFAR, 193 matrix factorizations, 270–272 Grammian, 272 Hermitian, 270 inverse, 274–276 normal, 270 partial ordering, 273 positive definite, 272 positive semidefinite, 272 pseudo-inverse, 275 square root, 272 maximal invariant, 98 maximally improper, 22, 44, 76 maximum likelihood, see ML measurement score, 154 Mercer’s expansion, 224 minimal statistic, 181 minimum mean-squared error, see MMSE Minkowski determinant inequality, 274 mixing matrix, 81 ML estimation, 156 of covariance matrices, 48 MLR, see half-canonical correlations MMSE estimation, 119–121 modulation, 8–10 monomial matrix, 81 Moore–Penrose inverse, 275 Mueller calculus, 214 multivariate association, 85–108 multivariate linear regression, see half-canonical correlations MVDR estimation, see LMVDR estimation Neyman–Pearson lemma (detector), 179 noncircular, 53 strict-sense, see maximally improper noncoherent matched subspace detector, 190–194

308

Index

nonstationary process, 223–248 analytic, 235, 245 detection of, 230 discrete time, 236 estimation of, 227–229 spectral representation, 230–237 normal, see Gaussian normal matrix, 270 Nth-order circular, 54, 221 Nth-order proper, see Nth-order circular orthogonal increments, 199 orthogonality principle, 120 PAM spectrum, 255–258 partial least squares, 93–97, 101 partial ordering majorization, 289 matrices, 273 PCA, see principal components pdf, see probability density function pdf generator (elliptical), 44 performance bounds, 151–175 Bayesian, 170–174 deterministic, 157–170 frequentist, 157–170 stochastic, 170–174 performance comparison between WL and linear detection, 186–188, 230 between WL and LMMSE estimation, 131, 229 periodically correlated, see cyclostationary periodogram (cyclic), 255 phase splitter, 11 phasor, 5 PLS, 93–97, 101 polarization, 6–8, 211–215, 245 circular, 7, 206, 246 degree of, 109, 211–215, 245 ellipse, 6, 23, 206–211, 242–247 left circular, 7, 208 linear, 6, 208 right circular, 7, 208 positive definite matrix, 272 positive semidefinite matrix, 272 power detection, 179 distribution (time–frequency), 237 random process, 56 random vector, 37 power signal, 234 power spectral density, see PSD preordering (majorization), 289 principal components, 63 principal domain, 218

probability density function, 38–47 Cauchy, 46 elliptical, 44–47 Gaussian, 19–23, 39–44 Rayleigh, 44 t, 46 Wishart, 47 probability distribution function, 38 projection, 276 proper, 35 complex baseband signal, 56 jointly, 41 Nth-order, see Nth-order circular random process, 56 PSD, 55, 197 connection with ESD, 233 digital modulation schemes, 255–258 nonstationary process, 234 pseudo-covariance, see complementary covariance pseudo-inverse, 275 PSK spectrum, 255–258 QAM spectrum, 255–258 QPSK spectrum, 257 quadratic form, 34, 145, 284 quadrature component, 9 quadrature modulation, 9 rank reduction, 64, 132–137 Rayleigh distribution, 44 Rayleigh resolution limit, 167 receiver operating characteristic, 181 rectilinear, see maximally improper reduced-rank estimation, 132–137 redundancy index, 106 reflectional correlation coefficient, 87–91, 94–97 reflectional cyclic spectral coherence, 254 relation matrix, see complementary covariance right-circularly polarized, 7, 208 Rihaczek distribution, 232, 237–242 estimation of, 240–242 ROC curve, 181 rotary components nonstationary, 243 WSS, 205–211 rotating phasor, 5 rotational correlation coefficient, 87–91, 94–97 rotational cyclic spectral coherence, 254 Schur complement, 270 Schur-concave, 289 Schur-convex, 289 score function, 154 Fisher, 162–170 Fisher–Bayes, 171–174 second-order stationary, 55

Index

sensitivity matrix, 158 augmented, 161 complementary, 161 separating matrix, 81 single-sideband modulation, 15 singular value decomposition, see SVD size, 179 spectral correlation, 230–237 aliasing, 236 analytic signal, 235 cyclostationary process, 253 higher-order, 248 spectral decomposition, see EVD spectral factorization, 204 spectral process, see spectral representation spectral representation aliasing, 236 cyclostationary process, 251–253 nonstationary process, 230–237 WSS process, 197–200 spectrum analyzer, 143 spherical pdf contours, 54 spherically distributed, 54 spread (majorization), 288 SSB modulation, 15 stationary manifold, 233–237, 248, 253 Stokes parameters, 213–215 strong uncorrelating transform, 65–69, 82 subspace identification, 142 sufficient statistic, 181 for covariance matrices, 48 SUT, see strong uncorrelating transform SVD, 271 symmetric function, 290 t-distribution, 46 Takagi factorization, 65–69 test invariant, 189 statistic, 178 test for common mean and uncommon covariances, 185 correlation structure, 110–114 impropriety, 77–81 independence, 113

309

sphericity, 112 uncommon means and common covariance, 183–185 threshold detector, 178 time–frequency coherence, 238, 244, 247 time–frequency distribution, 232 total correlation coefficient, 87–91, 94–97 transform coder, 64, 293 tricorrelation, 217 trispectrum, 217 underdamped oscillator, 5 uniformly most powerful, 188 weak majorization, 288 wide-sense stationary, see WSS widely linear, 25, 32 estimation of cyclostationary process, see cyclic Wiener filter estimation of nonstationary process, 227–229 estimation of WSS process, see Wiener filter MMSE estimation, 129–132 MVDR estimation, 143 reduced-rank estimation, 132–137 shift-invariant filtering, see WLSI filtering time-invariant filtering, see WLSI filtering widely linear–quadratic estimation, 144–149 widely unitary, 63 Wiener filter causal, 203–205 causal cyclic, 262–268 cyclic, 260–268 noncausal, 202 Wiener–Khinchin relation, 233 Wigner’s theorem, 238 Wigner–Ville distribution, 238 Wirtinger calculus, 277–286 Wirtinger derivative, 279 Wishart distribution, 47 WL, see widely linear WLQ estimation, 144–149 WLSI filtering, 57, 200 Woodbury identity, 275 WSS process, 55, 197–222 analytic, 201, 218–221 spectral representation, 197–200

Statistical Signal Processing of Complex-Valued Data.pdf

Short Description

Description

Comments

We need your help!