Statistical Methods in HYDROLOGY-Haan

March 28, 2018 | Author: 3mmahmoodea | Category: Autoregressive Integrated Moving Average, Statistical Analysis, Probability Theory, Statistical Theory, Physics & Mathematics

Share Embed Donate

Report this link

Short Description

Statistical Methods in HYDROLOGY-Haan...

Description

Statistical Methods,in

HYDROLOG*

S t a t i s t i c a l Methods i n

HYDROLOGY Second Edition

CHARLES T. HAAN

Iowa State Press A Blackwell Publishing Company

CHARLES T. HAAN is Regents Professor and Sarkeys Distinguished Professor, Emeritus, from the Department of Biosystems and A,gicultural Engineering, Oklahoma State University, Stillwater. O 1974 Iowa State University Press O 2002 Iowa State Press

A Blackwell Publishing Company All rights reserved Iowa State Press 2121 State Avenue, Ames, Iowa 50014 Orders: Office: Fax: Web site:

1-800-862-6657 1-515-292-0140 1-515-292-3348 www.iowastatepress.com

Authorization to photocopy iteins for internal or personal use, or the internal or personal use of specific clients, is granted by Iowa State Press, provided that the base fee of $.lo per copy is paid directly to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For those organizations that have been granted a photocopy license by CCC, a separate system of payments has been arranged. The fee code for users of the Transactional Reporting Service is 0-8 138-1503712002 $. 10. @Printed on acid-free paper in the United States of America First edition, 1974 Second edition, 2002 Library of Congress Cataloging-in-Publication Data Haan, C. T. (Charles Thomas) Statistical methods in hydrology / Charles T. Haan.-2nd ed. p. cm. Includes bibliographical references and index. ISBN 0-8 138-1503-7 (acid-free paper) 1. Hydrology-Statistical methods. I. Title. GB656.2.S7 H3 2002 55 1.48'07'27-4~21 2002000060 The last digit is the print number: 9 8 7 6 5 4 3 2 1

I dedicate this book once again to my wife,Janice, who has been my constant companion, friend, helpmate, and source of encouragementfor the past 34 years.

Secondly, I dedicate the book to my two daughters, Patti and Pam, and to my son Chris, his wzye Rie, and their two children, Katrina and Daniel. nirdly, I dedicate the book to my parents, Charles and Dorothy, who gaue me a start in life and taught me many of the values I hold dear: Finally, the book is dedicated to the many graduate students that I have worked with. They have been a constant source of renewal, challenge, inspiration, and joy.

Contents PREFACE TO SECOND EDITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv .. PREFACE TO FIRST EDITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x v ~ i

ACKNOWLEDGMENTS FOR THE SECOND EDITION . . . . . . . . . . . . . . . . . . . . . . .xix ACKNOWLEDGMENTS FOR THE FIRST EDITION . . . . . . . . . . . . . . . . . . . . . . . . . .xx 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Hydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

2

PROBABILITY AND PROBABILITY DISTRIBUTIONS-BASIC CONCEPTS . . . . .16 Probability .............................................. : ..........17 Total probability theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 Bayestheorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Graphical presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Randomvariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Univariate probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Bivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Deriveddistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Mixed distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3

PROPERTIES OF RANDOM VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Moments and expectation-univariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 53 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Arithmeticmean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Geometricinean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Weightedmean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -57 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Measures of symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Measuresofpeakedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Moments and expectation-jointly distributed random variables . . . . . . . . . . . . . . . 60 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Further properties of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Sample moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Probability-weighted moments and L-moments . . . . . . . . . . . . . . . . . . . . . . . . . . . -68 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chebyshevinequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Lawoflargenumbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4

SOME DISCRETE PROBABILITY DISTRIBUTIONS AND THEIR APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 Bernoulli processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -90 Summary of Bernoulli process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90

CONTENTS

ix

Poissonprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Summary of Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -94 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5

NORMALDISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 General normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Reproductiveproperties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -102 Approximations for standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Constructing pdf curves for data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Normal approximations for other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .109 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6

CONTINUOUS PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . .114 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114 Triangular distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Gammadistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -126 Extreme value distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Extreme Value Type I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -132 Extreme Value Type III Minimum (Weibull) . . . . . . . . . . . . . . . . . . . . . . . . . .134 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Generalized extreme value distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Betadistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Pearson distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Some important distributions of sample statistics . . . . . . . . . . . . . . . . . . . . . . . . . .142 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143 TheFdistribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146

7

FREQUENCYANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 . Probability plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Historicaldata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158 Analytical hydrologic frequency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160 Log Pearson type I11 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Extreme value type I distribution (Gumbel distribution) . . . . . . . . . . . . . . . . .164 Other distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Generalconsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165 Confidenceintervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Treatmentofzeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Truncation of low flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176 Use of paleohydrologic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Probable maximum flood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Discussion of flood frequency determinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178 Regionalfrequencyanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180 Delineation of homogeneous regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180 Historical development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182 Frequencydistributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182 Regression-based procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Index-floodmethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Regional index-flood relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186 Regionalization using L-moments and the GEV distribution . . . . . . . . . . . . . 187 Regionalization using modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Frequency analysis of precipitation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -189 Frequency analysis of other hydrologic variables . . . . . . . . . . . . . . . . . . . . . . . . . .191 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192

8

CONFIDENCE INTERVALS AND HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . .194 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196 Mean of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Variance of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 One-sided confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200 Parameters of probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -201 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -201 H, p = pl. Ha: p = p2.normal distribution. known variance . . . . . . . . . . .206 H, p = p,. Ha: p = p2.normal distribution. unknown variance . . . . . . . . .206 H, p = po. Ha: p # po.normal distribution. known variance . . . . . . . . . . . .207 H, p = po.Ha: p # po.normal distribution. unknown variance . . . . . . . . . -207 Test for differences in means of two normal distributions . . . . . . . . . . . . . . . .208

.. ..

CONTENTS

.

xi

Test of H,: u2 = a; versus Ha: a ' # a: normal population . . . . . . . . . . . . . . 209 Test of H, a: = a; versus Ha: a: # a; for two normal populations . . . . . . .209 Test for equality of variances from several normal distributions . . . . . . . . . . .209 Testing the goodness of fit of data to probability distributions . . . . . . . . . . . . . . . . 210 Chi-square goodness of fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Distributional tests based on cumulative distributions . . . . . . . . . . . . . . . . . . .213 Comparing two empirical distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 General comments on goodness of fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 9

SIMPLE LINEAR REGRESSION . . . . . . . . . . . . . . Simple regression . . . . . . . . . . . . . . . . . . . . . . . Evaluating the regression . . . . . . . . . . . . . . . . . Confidence intervals and tests of hypotheses . . Inferences on regression coefficients . . . . Confidence intervals on regression line . . Confidence intervals on standard error . . . Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . General considerations . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 MULTIPLE LINEAR REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Generallinearmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Confidence intervals and tests of hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Confidence intervals on standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Inferences on the regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Confidence intervals on the regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Other inferences in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251 Whichlineisbest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .256 Autocorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257 Testing for serial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Corrective action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260 Multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260 Detection of multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -262 An application of multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -262 Transforming linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -266 Indicator variables in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .268 Generalcomments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 . Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .278

11 CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 . Inferences about population correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . .282 Serialcorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .287 Correlation and regional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Correlation and cause and effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -291 Spurious correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .293 12 MULTIVARIATE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .297 Principalcomponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .298 Regression on principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -307 Multivariate multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .311 Canonical correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .312 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .318 13 DATAGENERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Univariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -321 Multivariate data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .327 Multivariate. correlated. normal random variables . . . . . . . . . . . . . . . . . . . . . -327 Multivariate. correlated. nornormal random variables . . . . . . . . . . . . . . . . . . .328 Applications of data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -331 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .334 14 ANALYSIS OF HYDROLOGIC TIME SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -336 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Trendanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .340 Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .346 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .350 Autoregressive integrated moving average models (ARIMA) . . . . . . . . . . . . . . . . . 355 Moving Average Processes (MA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .356 . Autoregressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Autoregressive Moving Average Models ARMA (p, q) . . . . . . . . . . . . . . . . . .362 Autoregressive Integrated Moving Average ARIMA (p. d. q) . . . . . . . . . . . . -363 ~stimateof noise variance o: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -364 Parameter estimation via least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 ARmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364 MAmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364 Parameter estimation via maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . -366 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

CONTENTS

... xu1

15 SOME STOCHASTIC HYDROLOGIC MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 Purely random stochastic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -374 First-order Markov process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .375 First-order Markov process with periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 Higher-order autoregressive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Markovchainmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .380 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .388 16 PROBABILISTIC METHODS FOR UNCERTAINTY. RISK. AND RELIABILITY ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .390 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .391 Traditional or local sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -391 Global sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Uncertainty analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396 Reliability and risk analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396 Uncertainty. risk. and reliability analysis methods . . . . . . . . . . . . . . . . . . . . . .398 First-order approximation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Simplified FOA estimates for some functional forms . . . . . . . . . . . . . . . . . . -399 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -404 Corrected FOA method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -406 Correcting FOA mean and variance estimates of an individual function . . . . .406 Second-order approximation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -411 First-order reliability method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -412 Generic expectation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .418 Othermethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Second-order reliability methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423 Point estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Transform methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .424 17 GEOSTATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .426 Semivariogrammodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .430 Combination semivariogram models . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . .432 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .433 Anexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 . Anisotropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Cokriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 . Local and global estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Polygon declustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Celldeclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 Pointkriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 Blockkriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

xiv

CONTENTS Estimation of cumulative distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .448 Modeling using geostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -449 APPENDIXES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .451 A .1. Common distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -451 Hydrologicdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .454 A.2. Monthly runoff (in.), Cave Creek near Fort Spring, Kentucky . . . . . . . -454 A.3. Peak discharge (cfs), Cumberland River at Cumberland Falls, Kentucky .................................................. 455 A.4. Peak discharge (cfs), Piscataquis River, Dover-Foxcroft, Maine ......457 A.5. Total Precipitation (in.) for week of March 1 to March 7, Ashland, Kentucky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .458 A.6. Flow and sediment load, Green River at Munfordville, Kentucky . . . . . .458 A.7. Streamflow (in.), Walnut Gulch near Tombstone, Arizona . . . . . . . . . . . .459 A.8. Monthly Rainfall (in.), Walnut Gulch near Tombstone, Arizona . . . . . . .460 A.9. Annual discharge (cfs ), Spray River, Banff, Canada . . . . . . . . . . . . . . . .461 A.lO. Annual discharge (cfs), Piscataquis River, Dover-Foxcroft, Maine . . . .461 A.ll. Annual discharge (cfs), Llano River, Junction, Texas . . . . . . . . . . . . . .461 Statistical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462 A.12. Standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462 A .13. Percentile values for the t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 464 A.14. Percentile values for the chi square distribution . . . . . . . . . . . . . . . . . . .465 A.15. Percentile values for the F distribution . . . . . . . . . . . . . . . . . . . . . . . . . .467 A .16. Critical values for the Kolmogorov-Smirnov test statistic . . . . . . . . . . .469 A .17. Durban-Watson test bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .470 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .471 INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .483

Preface to the Second Edition SINCE THE publication of the first edition of this book, statistics has come to play an increasingly important role in hydrology. The advancements in computing technology and data management have made the application of statistical techniques that were previously known but difficult to implement allnost routine. User friendly software for personal computers has made powerful statistical routines available to nearly all hydrologists. Generally, this software comes with user manuals or help files that lead a new user through the steps needed to use the programs. Unfortunately, these aids rarely indicate the assumptions inherent in the techniques, the limitations of the techniques, and the situations in which the techniques should or should not be used. They are generally weak in instructing one on the interpretation of the results of the analysis as well. This software is a tool that is available for use in hydrology but does not replace sound hydrologic understanding of the problem at hand nor does it replace a basic understanding of the statistical technique being used. This current edition should serve as a companion to many of the software programs available-not to explain how to use the software, but to provide guidance as to the proper routines to use for a particular problem and the interpretation of the results of the analysis. The basic philosophy of the current edition is the same as that of the first edition. Enough detail on particular statistical methods is presented to gain a working understanding of the technique. Certainly the treatment on any particular statistical technique is not exhaustive. Much theory and derivation are omitted and left to more in-depth treatments found in books dealing specifically with the various topics. Two chapters have been added to the book. One of these chapters deals with uncertainty analysis and the other with geostatistics. Both of these topics have received great emphasis in

xvi

PREFACE TO THE SECOND EDITION

the past decade. Uncertainty analysis is a growing concern as it is increasingly recognized that both statistical and deterministic analyses result in estimates that are far from absolute answers. Increasingly. attempts are made to evaluate how much uncertainty should be associated with various types of analyses. Rather than providing a point estimate of some quantity, confidence limits are sought, such that one can assert with various degrees of confidence bounds within which the sought after quantity is thought to be. Geostatistics has become of increasing importance as geographically referenced information becomes available and is used in geographical information systems (GISs) to produce hydrologic estimates. The chapter on uncertainty was written by Aditya Tyagi, a former PhD candidate at Oklahoma State University and currently a water resources engineer with CH2M Hill. Jason Vogel, a research engineer and PhD candidate at Oklahoma State University, was a coauthor of the chapter on geostatistics.

Preface to the First Edition THE RANDOM variability of such hydrologic variables as streamflow and precipitation has been recognized for centuries. The general field of hydrology was one of the first areas of science and engineering to use statistical concepts in an effort to analyze natural phenomena. Many papers have been published that amply demonstrate the value of statistical tools in analyzing and solving hydrologic problems. In spite of the long history and proven utility of statistical techniques in hydrology, relatively few comprehensive and basic treatments of statistical methods in hydrology have been published. This book has been prepared to assist engineers and hydrologists develop an elementary knowledge of some statistical tools that have been successfully applied to hydrologic problems. The intent of the book is to familiarize the reader with various statistical techniques, point out their strengths and weaknesses and demonstrate their usefulness. The serious reader will want to supplement the material with formal courses or independent study of those individual topics that are major interests. No single topic has been developed completely. Books have been written covering many of the topics discussed as single chapters in this presentation. Again the purpose here is to develop understanding and illustrate the usefulness of the techniques. Most of the techniques are discussed in sufficient detail for a thorough understanding and application to problem situations. The philosophy of the presentation has been that one does not have to understand hydrodynamics to swim even though it could help one to become a more proficient swimmer. The book has not been written for statisticians or for those primarily interested in statistical theory. Rather it has been prepared for hydrologists and engineers interested in learning how statistical models and methods can be valuable tools in the analysis and solution of many hydrologic and engineering problems. The basic premise has been taken (and justifiably so) that

xviii

PREFACE TO THE FIRST EDITION

statisticians are competent so that many statistical results are presented without developing a rigorous proof of their validity. Proofs for most results can be found in mathematical statistics books many of which are listed in the bibliography. No prior knowledge of statistics is required if one starts with Chapter 2. Those with varying degrees of statistical knowledge may choose to start with later chapters. A knowledge of calculus is required throughout and some familiarity with matrices is needed for material in later chapters. Appendix D is a review of the basic matrix manipulation used in the book (not in this new edition). This is not a statistical "cookbook" for hydrologists. It does non contain step-by-step calculation procedures for "standard hydrologic problems. Basic statistical concepts are discussed and illustrated in enough detail so that one can develop his own computational procedures or methods. Most of the computations in actual work situations would be done on digital computers. Computer programs have not been included because it is felt that most computer centers will have programs or programmers available. Likewise computational techniques are not emphasized. For example, in the chapter on multiple regression, efficient techniques for matrix inversion are not presented as it is felt that these techniques are readily available at most computer tenters. The emphasis is thus retained on the statistical technique being used and not on the computational aspects of the problem. Some liberties have been taken in that many terms are not precisely defined in a mathematical sense unless such a definition is warranted. Where terms are loosely defined, it is hoped that the meticulous reader will accept the general connotation of the terms for purposes of simplicity and to avoid placing emphasis on terms rather than concepts. Many of the problems require sets of data. Those data may be supplied by the reader or selected from the data in Appendix C. I am grateful to the Literary Executor of the late Sir Ronald A. Fisher, F. R. S., to Dr. Frank Yates, F. R. S. and to London Group Ltd., London, for permission to reprint Table E.5 from their book Statistical Tablesfor Biological, Agricultural arzd Medical Research, 6~ Edition ( 1 974) (not in this new edition).

Acknowledgments for the Second Edition IT HAS been nearly a quarter century since I wrote the first edition of this book. During that time I have become indebted to many people. I have spent nearly this entire period with the Biosystems and Agricultural Engineering Department at Oklahoma State University. This Department has provided a wonderful atmosphere for intellectual growth and accomplishment. The faculty, staff, and students that I have been associated with have helped to create a working environment that was challenging, friendly, and one in which my only limitation was myself. I am grateful to many individuals. Bill Barfield has continued to be a valued friend and coworker. Dan Storm, Bruce Wilson, and many graduate students have been especially instrumental in much of my research and teaching in the field of statistical hydrology. My daughter, Dr. Patricia Haan, assistant professor in the Biological and Agricultural Engineering Department at Texas A&M University, has been very helpful in clarifying some points in the text and correcting errors. Certainly my wife of 34 years, Jan, has been most supportive and forgiving as I have devoted far too much time to work. As is true of all of us, I owe whatever I have accomplished to my Creator without Whom I could accomplish nothing.

Acknowledgments for the First Edition MUCH OF the material presented in this book was developed for a course taught to students in the Agricultural Engineering and Civil Engineering Departments at the University of Kentucky. The suggestions and clarifications made by the students in this course over the past 8 years have been a great aid in attempting to make this book more understandable. Special acknowledgment must be given to Dan Carey for his careful readings of the entire manuscript. These readings resulted in several corrections and clarifications. Several individuals have read parts of the book and made valuable suggestions for its improvement. Among those reviewing parts of the manuscript were Donn DeCoursey, David Allen, David Culver, and personnel of the U.S. Soil Conservation Service under the direction of Neil Bogner. Several individuals in the Agricultural Engineering Department at the University of Kentucky offered valuable suggestions and considerable encouragement. Deserving special mention are Billy Barfield, Blaine Parker. and John Walker. This undertaking has required sacrifice on the part of my family and especially my wife Janice. She not only typed the early drafts of the book but offered continued encouragement over the years as work and revisions were done on the book. This manuscript was reproduced from photo-ready copy. The excellent typing involved in preparing this final draft as well as an earlier draft was done by Pat Owens. Buren Plaster drafted all of the figures. Of course any failings and shortcomings of this book must be credited to me. My hope is that it will be found useful in at least partially meeting the need for an elementary treatment of statistical methods in hydrology. Whatever is accomplished along these lines I owe to our Father for giving me the will to see this project through and the ability to withstand the setbacks experienced along the way. Finally I express my appreciation to all of the members of the Agricultural Engineering Department at the University of Kentucky for their understanding during the preparation of this manuscript.

Statistical Methods i n

HYDROLOGY

1. Introduction MORE THAN 25 years ago I set about writing a book on the application of statistical techniques to hydrology. That book, published in 1977, became the first edition of this current work and was appropriately titled Statistical Methods in Hydrology. Although soundly criticized for ," that was little more than a "relevant producing a book of the general type "Statistics for Schuam's Outline series" on statistics with a little hydrology thrown in (Burges 1978), the book has had a very wide reception, has gone through several printings and has been widely quoted in the literature. However, as I have reflected on this critique over the years, and as I have used statistics to address problems in hydrology and observed others doing the same, I have come to the conclusion that this critique contained a large element of truth. There is no shortage of very fine books at many levels of complexity on statistics. The theory of statistical procedures and the assumptions in statistical procedures are well explained and widely available. The same statistical techniques might be applied to hydrologic data or to the comparison of the value of the Japanese yen to the U.S. dollar. Statistical techniques are based in mathematics and probability. The units attached to the data being studied are immaterial from a statistical standpoint. What is important is the degree to which the data agree with the assumptions inherent in the statistical procedure being applied. Similarly, there are many books on hydrology. Some of these books are quite general, some are quite theoretical, some are quite empirical, and none are really exhaustive. The problem with hydrology is that it is, in practice, very messy. For example, we can present in great detail the mathematical development of equations describing the overland flow of water on planes of various types and how flow profiles develop and how runoff hydrographs result at the lower end of these planes. There exist very elegant solutions for these problems-albeit often numerical

4

CHAPTER 1

procedures are required to arrive at these solutions. With rapid advances in computing technology, this presents a rapidly diminishing problem. The real problem as I see it is that we have developed an elegant solution to a nonexistent problem. In my lifetime I have observed many rainfall-runoff events and have rarely seen the type of flow described above except in artificial situations such as parts of parking lots or streets covering a tiny fraction of a drainage basin. If there is any overland flow, before it goes very far flow concentration develops and the overland flow "planes" become very nonuniform. Does that mean it is wrong to develop and present these idealized equations? Does that mean it is wrong to use models that contain these equations to develop runoff hydrographs? NO! It simply means that one must be aware of the relationships between the mathematics of the model and the actual hydrology that is occurring. Through proper selection of roughness coefficients and other coefficients in such models, good estimates of runoff hydrographs may result. Yet that does not mean that the model actually describes in exact detail the hydrologic processes that are occurring. We must not confuse actual hydrologic processes with models of these processes. On numerous occasions I have seen those practicing hydrology confusing hydrologic models with actual hydrologic systems. The complexity, the nonhomogeneity, the dynamic nature of actual hydrologic systems are not recognized. The uncertainty inherent in parameters used by hydrologic models to particularize the model to a specific catchment or hydrologic problem are not recognized. The numbers produced by the model are taken as the true hydrologic response of the actual hydrologic system. More disturbing, the algorithms that make up the model are taken as true and exact representations of the hydrologic systems they purport to represent. Quite likely the one using the hydrologic model has great skills in modeling and in computers but little understanding of the complexity of hydrologic systems. At this point one might be wondering why I have jumped on mathematical models when this book is about statistics. The answer lies in my experience over the years that statistical methods are often criticized for not being physically based and not representing what is actually occumng in the field. Yet all hydrologic models, not just statistical models, are susceptible to this criticism. Statistical models are often applied just as are mathematical models with little regard to the assumptions in the models. Some take model results as truth, especially if the statistical or mathematical technique is complex. Others will reject model results on the basis that all assumptions are not met. So basically, in hydrology, we face the same dilemmas whether we use mathematical or statistical models. No model describes the actual and complete hydrology of anything but the simplest of settings. Regardless of what approach we use toward solving an actual hydrologic problem, compromise must be made with the methodology employed. One can never turn professional judgment over to any particular hydrologic model whether the model is mathematical, statistical, or some combination of the two. Any model must be seen as an aid to judgment and not as a replacement for it. There are no completely theoretical models and no completely statistical models. All models have components of both theory and statistics. Both are techniques for quantifying our understanding and our observations of hydrologic processes. The presence of theory or statistics may not be a formal presence, but it is there. This leads to the conclusion that all models have

INTRODUCTION

5

statistical components to some degree. Any constants that are estimated based on observations, even observations formalized into tables like Manning's n values, have been determined by formal or informal application of statistics. Any statistical model should be formulated based on some understanding of the system being modeled. This understanding may be brought into the model through a conceptual structure of the model. These conceptual components are what bring hydrology into the model as opposed to having a purely statistical model. In my view, one should not ignore hydrology when developing models for use in hydrology no matter how sophisticated the statistical techniques that are being used. To the extent that hydrologic knowledge is used in structuring a statistical model, the model may be said to contain conceptual components. Statistical models should not be developed by simply throwing data on every conceivable variable into some computerized statistical routine and hoping for the best. As far as the hydrologist is concerned, statistics is not an end in itself. Statistics is a tool that may help one to understand hydrological data. The fact that to hydrology, statistics is a tool must be kept foremost in mind. It must also be kept in mind that statistics is just one of several tools available for application in hydrology. Hydrologic processes are not driven by principles of statistics but by physical, chemical, and biological principles, the so-called "Laws of Nature". Often the hydrologic setting is of such complexity that the underlying component hydrologic processes cannot be expressed in such a way as to yield a suitable computational framework for describing the system. Perhaps the mix of surface soil properties, land uses, topography, and so forth are such that the setting of a particular hydrologic problem cannot be adequately described. Perhaps the complexity and heterogeneity of the system is such as to preclude deterministic modeling. Perhaps data are available on a response variable such as stream flow, water quality, or ground water level, but not on the causative variables of rainfall, evaporation, infiltration, and so on. In such a case statistical techniques may be needed in an effort to uncover descriptive behavioral relationships among the data. Such relationships are not cause-effect relationships but descriptive relationships. The relationships may support hypotheses concerning cause and effect but do not conclusively establish such relationships. Over the past 20 years I have seen many inappropriate applications of statistics in hydrology. I have seen hydrologists stake their reputation as hydrologists on statements made based on poor knowledge of statistics. I have also seen statisticians make far-reaching conclusions with a very elementary knowledge of hydrology; here the argument goes "the data show . . .". The data are separated from their hydrologic reality and analyzed as pure numbers! One thing that has compounded the problem of inappropriate use of statistics in hydrology (or any other field, I suspect) is the ready availability of powerful statistical software that is easy to use. I applaud the availability of this software but shudder at some of the applications that are made with it. Sometimes a statistical procedure is improperly applied or applied in inappropriate circumstances. The numbers generated by a statistical analysis are then venerated as absolute truth. It would be better to apply a technique recognizing and admitting its shortcomings and then using the results as a guide rather than religiously adopting the results and claiming they represent reality. This long introduction has been composed to impart some of my hydrologic-statistical modeling philosophy and to alert the reader that this book will emphasize the assumptions

inherent in statistical techniques and the consequences of violating these assumptions. Statistical techniques will be explained at the practical level without many derivations and proofs. References to these will be given. The book will be most useful to someone having at least an elementary knowledge of mathematical statistics and hydrology. This book addresses the interface of these two disciplines. The question naturally arises as to what is meant by hydrology in this book. Hydrology broadly defined is the study of water. The Federal Council for Science and Technology (1962) defined hydrology as the science that treats of the waters of the Earth, their occurrence, circulation, and distribution, their chemical and physical properties, and their reaction with their environment, including their relation to living things. The domain of hydrology embraces the full life history of water on the earth. This definition is more or less used in this book. The definition is broad and includes topics some may consider to be more proper to geology, engineering, environmental science, biology, chemistry, paleontology, or some other science. Some may even feel it includes aspects which are nonscientific. By using this definition, when the word "hydrology" is used, it includes these other areas as well. Statistics will be considered in a limited sense in the context of this book. Statistics will be defined as a science devoted to developing an understanding of a system that can be used to make inferences about the system based on observation relative to that system. Models are often used in developing this understanding and in making inferences. Model is a general term that will be taken to mean a collection of physical laws and empirical observations written in mathematical terms and combined in such a way as to produce estimates based on a set of known and/or assumed conditions. There are many ways of collecting physical laws and empirical observations and of combining them to produce a model. Models can generally be represented as

where 0 represents the outputs or quantities to be estimated; f(...) represents the mathematical structure of the model; I represents inputs to the model, boundary conditions, and initial conditions; P represents parameters that help particularize the model to a specific situation; and e represents differences between what actually occurs, 0, and what the model predicts, 0,.

INTRODUCTION

7

There are many ways of classifying models. Some people draw sharp distinctions between statistical models and other models. In practice one cannot do a thorough modeling exercise without drawing on statistics in some way. Often some type of statistical work has to be done to come up with values for parameters for a model that might otherwise be considered a nonstatistical model. Thus, the parameters of the model become some function of observations. If another set of observations were used presumably different parameter values would result. Since observations (data) in hydrology are generally thought of as random variables and any function of a random variable is a random variable, the parameters for the model effectively become random variables and thus a statistical element enters a model that might otherwise not be considered as a statistical model. Broadly speaking, quantitative hydrologic models fall on a continuous spectrum of model "types" ranging from completely deterministic on the one hand to completely stochastic on the other. A completely deterministic model would be one arrived at through consideration of the underlying physical relationships and would require no experimental data for its application. Statistical models range in complexity from estimating the most likely outcome or result of an experiment to describing in detail a sequence (time series) of outcomes that mimic actual outcomes. All statistical approaches rely on observations. The mathematical techniques used to extract the information contained in the observations may be as simple as computing an average or so complex as to require thousands of stochastic simulations. Most hydrologic models fall somewhere between the extremities of this model spectrum. Often such models are termed parametric models. A parametric model may be thought of as deterministic in the sense that once model parameters are determined, the model always produces the same output from a given input. On the other hand, a parametric model is stochastic in the sense that parameter estimates depend on observed data and will change as the observed data changes. A stochastic model is one whose outputs are predictable only in a probabilistic sense. With a stochastic model, repeated use of a given set of model inputs produces outputs that are not the same but follow certain statistical patterns. A statistical model is one arrived at by applying statistical methods to a set of data to produce an estimation procedure. Multiple regression models are examples of statistical models. In this sense, all stochastic models are statistical models but all statistical models are not stochastic models. No matter how simple the hydrologic system or how complex the hydrologic model, the model is always an approximation to the system. There are no hydrologic models-deterministic, stochastic, or combined-that represent exactly anything but the most trivial of hydrologic systems. The digital computer has made possible great advances in all types of hydrologic models. These advancements are noteworthy for both stochastic and deterministic models and have led some hydrologists to vigorously adopt the philosophy that all hydrologic problems should be attacked stochastically and some the philosophy that they should be attacked deterministically. The purpose of this book is not to promote statistical or stochastic models but to present some basic statistical concepts that have been found useful as aids for the solution of hydrologic problems. Many hydrologic problems can best be solved through the joint application of the various modeling methods. For instance, it may be possible to adequately predict the runoff hydrograph

8

CHAPTER 1

from a simple watershed deterministically given the rainfall input. It is unlikely, however, that rainfalls that will occur during the life of a water resources project will be deterministically predictable. Thus, one approach to project evaluation would be a stochastic simulation of rainfall, deterministic conversion of the rainfall to streamflow, and a statistical analysis of the resulting streamflows. Regardless of the type of model that is used, model parameters must be determined in some way from observed hydrologic data. The validity and applicability of a model depend directly on the characteristics of the data used to estimate model parameters. A model can be no better than the data available for parameter estimation. The data used for parameter estimation must be representative of the situation in which the model is going to be used. Obviously, if one is attempting to model streamflow from an urban area, model parameters cannot be estimated from forested watersheds. Similarly, future hydrologic behavior of a watershed can be modeled based on past observations only if available historical data are representative of future conditions. If drastic land use changes are to be made, then the model parameters must be adjusted accordingly. All techniques used for hydrologic analysis rely on assumptions. Often the strict validity of the analysis depends on how well the true system meets these assumptions. This is certainly true of statistical models and statistical methods applied to hydrologic systems. There are no statistical procedures whose assumptions exactly match particular hydrologic systems. Likewise there are no hydrologic systems that exactly meet the assumptions made in any particular hydrologic model. With this in mind one is forced to the conclusion that models cannot yield an exact solution to any realistic hydrologic problem. Models must be treated as a tool that can be used to gain insight and to arrive at potential outcomes in a given hydrologic setting, but the final decision regarding any hydrologic process rests with the hydrologist, not the models. The hydrologist may choose to adopt a solution generated from modeling considerations, but this decision must be based on the hydrologist's convictions that the solution is hydrologically sound and not simply on how well the model describes the data. How close the final real solution is to the model solution will certainly depend on how well the physical setting matches the assumptions of the modeling techniques employed. It is the hydrologist who must make the determination as to the relationship between the model result and hydrologic reality. The fact that a statistical modeling procedure requires assumptions that are not strictly met in a particular hydrologic setting does not mean that statistically derived results are of no value. Again, the statistical modeling technique is used to provide insight into the problem at hand and not the final result. Even when it is known that certain assumptions are violated, useful information can often be obtained from a statistical modeling effort. Throughout this book, assumptions that accompany the statistical technique being discussed will be set forth and discussed from a hydrologic standpoint. The potential problems associated with violating the assumptions will be discussed. One of the frustrations that is constantly faced in using statistical models to represent hydrologic systems is trying to determine if assumptions are met or to what extent assumptions are not met for a particular set of data and the effect of not meeting assumptions on conclusions reached using the method. One might come away feeling that it is inappropriate to use statistics in hydrology. That is not the case at all. What is inappropriate is for an analyst to relegate absolute hydrologic authority to

a statistical analysis at the expense of hydrologic knowledge of the system and to give no weight to other tools available, such as mathematical models and common sense. Deterministic hydrologic models, whether numerical or conceptual, suffer the same problems in terms of assumptions as do statistical models. Rarely are hydrologic models adequately tested over the full range of conditions for which they will be applied. Rarely are all of the assumptions associated with hydrologic models actually set forth. For instance, one assumption inherent in hydrologic models is that a basin's hydrologic response to a rare or extreme event can be modeled with the same algorithms used to model common or predominate events. In hydrologic frequency analysis, the criticism is often justifiably leveled that estimating a rare flood-say a 500-year flood, from a record of 20 or 30 years, none of which are extraordinarily large-is fraught with the possibilities of errors. The question is asked, how could relatively common flow levels have information embedded in them that would determine the magnitude of a 500-year event? Said in another way by example, in Oklahoma most annual peak flows from smaller watersheds are generated from thunderstorms that arise over the Great Plains of the central United States. The really big floods may be the result of a hurricane sweeping in from the Gulf of Mexico and traveling over Oklahoma. How can flow data from thunderstorms predict flow magnitudes of hurricane-related floods? But the same questions apply to deterministic hydrologic models. If a model is formulated and parameters estimated based on common flow levels, how can one be sure these same pararneter values and algorithms apply to extreme events? In both cases, flood frequency analysis and modeling, information is gained about the possible magnitude of the 500-year event. For certain neither estimate is exact! In addition to these estimates the hydrologist should do some field work, look at channel capacities, possibly look for evidence of extreme floods in the geologic past (paleohydrology), and rely on as much hydrologic reasoning as possible to arrive at the final estimate of the 500-year event. One should additionally attempt to place some type of uncertainty bands on the estimate. What is being suggested is that responsibility for a hydrologic estimate rests squarely on the hydrologist rather than on some analytic technique. One cannot blame the log-Pearson type 111 distribution for making a bad flood frequency estimate. The problem is not the distribution itself (after all the distribution is just a mathematical equation) but the inappropriate application of the distribution in making the estimate. One cannot blame a hydrologic model if a hydraulic structure fails because the flow estimated by the model was in error. One may conclude that the model was inappropriate but it was the hydrologist that made the estimate using the model as a tool.

HYDROLOGIC DATA Hydrologic data seems to be simultaneously abundant and scarce. We are deluged with data on rainfall, temperature, snowfall, and relative humidity from around the world on a daily basis in newspapers, radio and television reports, and on world-wide computer information networks. Many agencies worldwide collect and archive hydrologic data on streamflow, lake and reservoir levels, ground water elevations, water quality measures, and other aspects of the hydrologic cycle. These data are available in many different forms. Currently access to hydrologic data is being rapidly improved as the data is made available over electronic networks.

Yet in the face of this apparent abundance, data on a particular aspect of the hydrologic cycle at a particular location for a particular time period are often inadequate or completely lacking. It is often the task of the hydrologist to use any data that can be found having some application to the problem at hand, hydrologic models of various kinds, plus their own hydrologic knowledge to explain past, present, or anticipated hydrologic behavior of the system under study. Statistical procedures are used to evaluate the data, transfer the data to the problem at hand, select models and model parameters, evaluate model predictions, organize one's personal conception of how available data and knowledge come to bear on the problem, make predictions of future behavior of the system, and many other aspects of hydrologic problem-solving. Hydrologic data are generally presented as values at particular times, such as a river stage at a particular time, or values averaged over time, such as the annual flow for a stream for a particular year. Aggregating data into averages over time intervals may cause a loss of information if the variability of the process within the time period is of interest. Conversely, aggregation may make it possible to more clearly visualize long-term trends because short-term variations about the trend may be removed. The variability from observation to observation in a time series of hydrologic data may be very rapid and significant or very minor. Generally systems having a lot of storage vary more slowly than systems lacking that storage. Figure 1.1 is a plot of the water surface elevation of the Great Salt Lake near Salt Lake City, Utah. This figure shows that during the period of this record, water level changes of about 20 feet have occurred but year-to-year change is relatively slow with the exception of 1982-1 984 when a rise of about 4 feet per year occurred and in the late 1980s when the level dropped rather quickly. Figure 1.2 shows the annual peak discharge for the Kentucky River near Salvisa, Kentucky. There is little year-to-year carry-over or storage in this river system, so the flows vary more or less randomly from one year to the next. Figure 1.3 shows the water surface elevation of Devils Lake in North Dakota. The behavior of this lake is puzzling in that it has gone from nearly 1440 feet in elevation in 1867 to 1401 feet in 1940 in an almost continuous decline, at which point an erratic but steady increase in elevation began until it reached 1447 feet in 1999.

1840

1860

1880

1x0

1920

1910

198D

1980

aOOO

Year

Fig. 1.1. Water surface elevation of the Great Salt Lake near Salt Lake City, Utah.

INTRODUCTION

11

0 1895

1915

1935

1955

1975

1995

Year

Fig. 1.2. Annual peak flows on the Kentucky River near Salvisa, Kentucky.

1850

1870

1890

1910

1930

1950

1970

1990

Year

Fig. 1.3. Water surface elevation of Devils Lake, North Dakota.

In the case of the Salt Lake data, a model that estimated the water level in one year based solely on the level the previous year might produce reasonable estimates. The form of such a model would be y, = y,-, where y, is the water level at time t and y,-, is the water level at the previous time t - 1. Such a model may give a better prediction of the lake level in year t than would a model y, = y where y is the average lake level. The opposite is the case in.the Kentucky River peak flow data. Here y, = would be better than y, = y,- The previous year's flow is of little value in predicting the current year's flow. A model for Devils Lake would be difficult to surmise based simply on lake level data, because even a reasonable estimate for the long-term average lake level could not be determined on this record of over 100 years. Simply based on the data, one cannot determine the maximum elevation reached prior to 1867 or what elevation the lake might achieve in the absence of human interference after 1999. Presumably, physical and hydrologic information would shed some light on this problem. These considerations will be discussed in detail and quantified later in the book.

,.

In selecting data for model parameter estimation, it is important to establish that the data are representative and homogeneous over time or can be adjusted for any nonhomogeneities that may be present. L€ anything has occurred to cause a change in the characteristic being analyzed, the data must either be adjusted to account for the change or analyzed in two sections: one before the change and one after. Some common causes of nonhomogeneities are relocating gages (especially rain gages), diverting streamflows, constructing dams, watershed changes such as urbanization or deforestation, stream channel alterations and possibly weather modification, as well as natural events of a catastrophic nature such as earthquakes, humcane floods, and so forth. In some instances the data can be corrected for changes. One possible adjustment would be by reverse reservoir routing to determine what streamflows would have been had a reservoir not been constructed. Some changes such as gradual urbanization of a watershed are difficult to correct. The statement that the data must be representative means, for example, that data from only unusually wet or dry periods should not be used alone as this will bias the results of the analysis. If there are only a few years of record available for analysis, the chances are good that the data are not representative of the long-term variability that actually exists. Most stochastic models assume that the data being considered are homogeneous and representative. The concept of the return period of hydrologic events plays an important role in hydrology. The return period of an event is defined as the average elapsed time between occurrences of an event with a certain magnitude or greater. For example, a 25-year peak discharge is a discharge that is equaled or exceeded on average once every 25 years over a long period of time. It does not mean that an exceedance occurs every 25 years, but that the average time between exceedances is 25 years. An exceedance is an event with a magnitude equal to or greater than a certain value. Sometimes the actual time between exceedances is called the recurrence interval. With this definition for recurrence interval, the average recurrence interval for a certain event is equal to the return period of that event. In this book, recurrence interval is used in the same sense as return period. Of course, the concept of return period can also be applied to low flows, droughts, shortages, and so on. In this case the return period would be the average time between events with a certain magnitude or less. Such an event might still be called an exceedance in the sense that the severity of a drought exceeds some preset level. Regardless of whether the return period is refemng to an event greater than some value or to an event less than some value, the return period can be related to a probability of an exceedance. If an exceedance occurs on the average once every 25 years, then the probability or chance that the event occurs in any given year is & = 0.04 or 4%. Probability, p, of an event occurring in any one year and return period, T, in years, are thus related by

This is a fundamental definition in statistical hydrology. The concept of a random sample is used throughout this book. A sample might be thought of as a collection of objects selected from a larger collection of these same objects. The larger

INTRODUCTION

13

collection of objects, if it contains all of the objects possible, is called the population. For example, 20 years of peak flow data from a certain river is a sample of the possible peak flows on the river. A random sample is one that is selected in such a fashion that any other sample could have resulted with equal likelihood. If the 20 years of peak flow data are considered a random sample, then one is assuming that these 20 years of data are just as likely as any other possible 20 years of data and vise versa. In some types of analysis it is assumed that the order of occurrence of the data is not important, only the data values are important. The traditional hydrologic frequency analysis is an example of this. If a sample contains elements that are independent of each other, then the order of occurrence of the data is not important. This is the same as saying that the magnitude of an element in the sample is not affected by the temporal pattern of the other elements in the sample. Each element in the sample might be thought of as a random sample of size 1. On the other hand, there are situations where the order of occurrence of the events is important. In designing a storage reservoir to meet projected water demands, the fact that low flows tend to follow low flows makes it necessary to have a larger reservoir than would be required if the low flows occurred randomly throughout time. This is known as persistence and indicates the elements of the sample are not independent of each other. In this case the entire sequence of data values must be considered the random sample. That is, the sequence contained in the sample is assumed to be as likely as any other sequence. The individual events in the sample are not independent. If one wanted a random sample consisting of 7 observations of daily flows on a river during a particular year, the daily flows in a particular week of that year could not be used. This is because the flow on the second, third, and so on, day of the week would be dependent on the flows on the preceding days. The flow on day 2, for example, would not represent all possible daily flows but would be highly dependent on the flow during day 1. To get a random sample of daily flows, each of the 365 daily flows would have to have an equal chance of being selected. The sample of flows during the 7 consecutive days could be considered as a random sample of size I of weekly flows (if the week was randomly selected) but not a random sample of size 7 of daily flows. In any hydrologic data there are errors of various kinds. The errors include measurement errors, data transmittal errors, processing errors, and others. The errors may be systematic errors and show up as a bias in the data or they may be random errors. In most error analysis it is assumed that the errors are random errors and follow the normal distribution. The treatment of hydrologic data contained in this book is not concerned so much with these types of errors as it is with sampling errors. Sampling error is a misnomer in that there are no errors in the usual sense involved. Sampling errors should more properly be called sampling variability, sampling fluctuation, or sample uncertainty. What is meant by sampling error is simply that a random sample has statistical properties that are similar to the population parameters but only equal to the population properties as the sample size gets very large (or the entire population is sampled). If two samples are selected from the same population, their statistical properties will again be similar but equal to each other only as the sample size gets very large. For example, we may desire to know the average annual rainfall at a given location. Assume we can measure exactly, that is with out any measurement error, the rainfall at the desired

location. Measurements are collected over a 5-year period and the average annual rainfall is calculated without error in the calculations. A second 5-year period elapses and data from this period is used to calculate the average annual rainfall. The two estimates will be different. Neither will equal the true average annual rainfall. The difference in the estimated values and the true values are the sampling errors. Note we cannot exactly determine the sampling error since the true average is not known. Thus, variability or uncertainty in the statistical properties of a population based on estimates of the properties from sampIes is called sampling error. It is clear that errors in the sense of mistakes, faulty data, or carelessness are not involved in sampling errors. Sampling error is simply an inherent property of random samples. If it weren't for sampling errors, this book or hundreds of others on statistics would not be needed since populations would then be completely specified by any sample from that population. Example 1.I. The mean annual suspended sediment load for the Green River near Munfordville, ~ e n t u c k ican , be estimated from the data contained in Appendix B. This data and the resulting estimated mean annual suspended sediment load may contain many types of errors. Systematic errors could result if the flow was sampled for sediment only when the depth of flow exceeded a preset stage. This is because low flows would not be sampled. Generally, the sediment concentration in low flows is less than that in higher flows. Thus a built in bias or systematic error is produced. Measurement errors could result from plugged samplers, samplers not properly aligned with the direction of flow, allowing the sampler to pick up some bed load, and a number of other reasons. Data transmittal errors and processing errors can result from mistakes in transcribing data from data forms, placing data in the wrong columns on spreadsheets or data entry forms, illegibly written data, and other sources. Sampling error can be illustrated by assuming that the tabulated data are exactly correct (contain no systematic, measurement, transmittal, or processing errors). If the mean annual suspended sediment load is calculated for each successive 5-year period, the results are 640,827; 484,739; 497,604; and 460,392 tons per year. Under the no error assumption, 4 different values of the mean annual suspended sediment discharge have been calculated each of which contains no errors yet none of which are the same. The difference in the 4 estimates is caused by natural variability in the phenomena (sediment) being sampled. This difference is called sampling error. If conditions on the watershed contributing to the Green River near Munfordville never changed and if the climatic conditions do not change, then theoretically the sampling error can be made as small as desired by an increase in the sample size above the 5 years used in this illustration. Practical limitation is imposed by the length of the available sediment load data record. Much of the statistical machinery discussed in this book is concerned with sampling errors and the estimation of population characteristics from samples of data. The fact that sampling errors are inherent in random data does not mean, however, that statistical manipulations and sophistication can in any way overcome faulty data. The quality of any statistical analysis is no better than the quality of the data used. It can be worse but no better. Furthermore, statistical considerations should not be used to replace judgement and careful thought in analyzing hydrologic data. In many instances some intelligent thought is worth reams of computer output based

INTRODUCTION

15

on a statistical analysis of some data. Statistics should be regarded as a tool, an aid to understanding, but never as a replacement for useful thought. Rarely will one find a hydrologic problem that exactly fulfills all of the requirements for the application of one statistical technique or another. Two choices are thus available. One can redefine the problem so that it meets the requirement of the statistical theory and thus produce an "exact" answer to the artificial problem. The second approach is to alter the statistical technique where possible and then apply it to the real problem realizing that the results will be an approximate answer to the real problem. In this case the degree of the approximation depends on the severity of the violated assumptions. This latter approach is preferable and requires knowledge of available statistical techniques, of assumptions and theory underlying the techniques, and of the consequences of violating the assumptions. It is toward this latter approach that this book is oriented. Most of the examples and exercises used in this book were selected for pedagogical reasons, not to promote a particular technique. Thus, when a problem involves fitting a normal distribution to annual peak flow, the purpose of the problem revolves around learning about the normal distribution and is not to demonstrate that a normal distribution is applicable to peak flows. Similarly, many examples and problems had to be simplified so that they could be realistically solved with attention focused on the statistical technique and not the many fascinating intricacies of most real problems. That is not to say the techniques do not apply to real problems-uite the contrary. However, most real problems involve multiple aspects, lots of data, and many considerations other than statistical ones. Rather than get involved in these other important aspects, many of the examples and problems are idealizations of real situations. Because the exercises were selected as a learning aid, it will be instructive to at least read the problems at the end of each chapter. Many of the problems present useful results that supplement the material in that chapter. Many actual problems in hydrology require considerable computation. Digital computers are used for this purpose. Special statistical-numerical procedures have been developed to simplify the computations involved and improve the accuracy of the results obtained from many of the analyses presented in this book. These procedures are not presented here. Rather the emphasis is on the principles involved. Some statistical techniques such as geostatistics and multivariate techniques often require extensive calculation and considerable efficiency is gained by using specialpurpose programs incorporating numerical shortcuts and safeguards against roundoff errors. Finally, there are many important areas of statistical analysis applicable to hydrology that are not included in this book. These omitted techniques for the most part require knowledge of the material contained in this book before they can be applied. Thus, this book is an introduction to statistical methods in hydrology. Furthermore, the book is not intended as a handbook or statistical "cookbook for hydrologists. The purpose of this book is to enable the reader to better apply statistical methods to hydrologic problems through a knowledge of the methods, their foundations and limitations.

2. Probability and

Probability DistributionsBasic Concepts HYDROLOGIC PROCESSES may be thought of as stochastic processes. Stochastic in this sense means involving an element of chance or uncertainty where the process may take on any of the values of a specified set with a certain probability. An example of a stochastic hydrologic process is the annual maximum daily rainfall over a period of several years. Here the variable would be the maximum daily rainfall for each year and the specified set would be the set of positive numbers. The instantaneous maximum peak flow observed during a year would be another example of a stochastic hydrologic process. Table 2.1 contains such a listing for the Kentucky River near Salvisa, Kentucky. By examining this table it can be seen that there is some order to the values yet a great deal of randomness exists as well. Even though the peak flow for each of the 99 years is listed, one cannot estimate with certainty what the peak flow for 1998 was. From the tabulated data one could surmise that the 1998 peak flow was "probably" between 20,600 cfs and 144,000 cfs. We would like to be able to estimate the magnitude of this "probably". The stochastic nature of the process, however, means that one can never estimate with certainty the exact value for the process (peak discharge) based solely on past observations. The definition of stochastic given above has some theoretical drawbacks, as we shall see. Hydrologic processes are continuous processes. The probability of realizing a given value from a continuous probability distribution is zero. Thus, the probability that a variable will take on a certain value from a specified set is zero, if the variable is continuous. Practically this presents no problem because we are generally interested in the probabilities that the variate will be in some range of values. For instance, we are generally not interested in the probability that the flow rate will be exactly 100,000 cfs but may desire to estimate the probability that the flow will exceed 100,000 cfs, or be less than 100,000 cfs, or be between 90,000 and 120,000 cfs.

Table 2.1. Peak discharge (cfs) Kentucky River near Salvisa, Kentucky Year

Flow

Year

Flow

Year

Flow

With this introduction, several concepts such as probability, continuous, and probability distribution have been introduced. We will now define these concepts and others as a basis for considering statistical methods in hydrology.

PROBABILITY In the mathematical development of probability theory, the concern has been not so much how to assign probability to events, but what can be done with probability once these assignments are made. In most applied problems in hydrology, one of the most important and difficult tasks is the initial assignment of probability. We may be interested in the probability that a certain flood level will be exceeded in any year or that the elevation of a piezometric head may be

more than 30 feet below the ground surface for 20 consecutive months. We may want to determine the capacity required in a storage reservoir so that the probability of being able to meet the projected water demand is 0.97. To address these problems we must understand what probability means and how to relate magnitude to probabilities. The definition of probability has been labored over for many years. One definition that is easy to grasp is the classical or a priori definition: If a random event can occur in n equally likely and mutually exclusive ways, and if na of these ways have an attribute A, then the probability of the occurrence of the event having attribute A is n d n written as

This definition is an a priori definition because it assumes that one can determine before the fact all of the equally likely and mutually exclusive ways that an event can occur and all of the ways that an event with attribute A can occur. The definition is somewhat circular in that "equally likely" is another way of saying "equally probable" and we end up using the word "probable" to define probability. This classical definition is widely used in games of chance such as card games and dice and in selecting objects with certain characteristics from a larger group of objects. This definition is difficult to apply in hydrology because we generally cannot divide hydrologic events into equally likely categories. To do that would require knowledge of the likelihood or probability of the events, which is generally the objective of our analysis and not known before the analysis. The classical definition of probability takes on more utility in hydrology in terms of relative frequencies and limits. If a random event occurs a large number of times n and the event has attribute A in na of these occurrences, then the probability of the occurrence of the event having attribute A is

na prob(A) = limit n+m n The relative frequency approach to estimating probabilities is empirical in that it is based on observations. Obviously, we will not have an infinite number of observations. For this probability estimate to be very accurate, n may have to be quite large. This is frequently a limitation in hydrology. The relative frequency concept of probability is the source of the relationship given in chapter 1 between the return period, T, of an event and its probability of occurrence, p. These two definitions of probability can be illustrated by considering the probability of getting heads in a single flip of a coin. If we know a priori that the coin is balanced and not biased toward heads or tails, we can apply the first definition. There are two possible and equally likely

PROBABILITY

19

Probability of getting a "head" from a coin flipping experiment

Number of trials Fig. 2.1. Coin flipping experiment. outcomes-heads or tails-so n is 2. There is one outcome with heads so n, is 1. Thus the probability of a head is !4. If the coin is not balanced so that the two outcomes are not equally likely, we could not use the a priori definition. We had to know the answer to our question before we could apply the a priori definition. This is not the case when the relative frequency definition is used. Obviously we cannot flip the coin an infinite number of times. We have to resort to a finite sample of flips. Figure 2.1 shows how the estimate of the probability of a head changes as the number of trials (flips) changes. A trend toward K is noted. This is called stochastic convergence towards %. One question that might be asked is, "is the coin unbiased?" One's initial reaction is that more trials are needed. It can be seen that the probability is slowly converging toward K but after 250 trials is not exactly equal to !4. This is the plight of the hydrologist. Many times more trials or observations are needed but are not available. Still, the data does not clearly indicate a single answer. This is where probability and statistics come into play. Equation 2.2 allows us to estimate probabilities based on observations and does not require that outcomes be equally likely or that they all be enumerated. This advantage is somewhat offset in that estimates of probability based on observations are empirical and will only stochastically converge to the true probability as the number of observations becomes large. For example, in the case of annual flood peaks, only one value per year is realized. Figure 2.2 shows the probability of an annual peak flow on the Kentucky River exceeding the mean annual flow as a function of time starting in 1895. Note that each year additional data becomes available to determine both the mean annual flow and the probability of exceeding that value. Here again, a convergence toward % is noted yet not assured. In fact, there is no reason to believe that K is the "correct"

20

CHAPTER 2

I

Year

I

Fig. 2.2. Probability that the annual peak flow on the Kentucky River exceeds the mean annual peak flow.

probability since the probability distribution of annual peak flows is likely not symmetrical about the mean. If two independent sets of observations are available (samples), an estimate of the probability of an event A could be determined from each set of observations. These two estimates of prob(A) would not necessarily equal each other nor would either estimate necessarily equal the true (population) prob(A) based on an infinitely large sample. This dilemma results in an important area of concern to hydrologists-how many observations are required to produce "acceptable" estimates for the probabilities of events? From either equation 2.1 or 2.2 it can be seen that the probability scale ranges from zero to one. An event having a probability of zero is impossible, whereas one having a probability of one will happen with certainty. Many hydrologists like to avoid the endpoints of the probability scale, zero and one, because they cannot be absolutely certain regarding the occurrence or nonoccurrence of an event. Sometimes probability is expressed as a percent chance with a scale ranging from 0% to 100%. Care must be taken to not confuse the percent chance values with true probabilities. A probability of one is very different from a 1% chance of occurrence as the former implies the event will certainly happen while the latter means it will happen only one time in 100. In mathematical statistics and probability, set and measure theory are used in defining and manipulating probabilities. An experiment is any process that generates values of random variables. All possible outcomes of an experiment constitute the total sample space known as the population. Any particular point in the sample space is a sample point or element. An event is a collection of elements known as a subset. To each element in the sample space of an experiment a non-negative weight is assigned such that the sum of the weights on all of the elements is 1. The magnitude of the weight is proportional to the likelihood that the experiment will result in a particular element. If an element is

quite likely to occur, that element would have a weight of near 1. If an element was quite unlikely to occur, that element would have a weight of near zero. For elements outside the sample space, a weight of zero is assigned. The weights assigned to the elements of the sample space are known as probabilities. Here again, the word likelihood is used to define probability so that the definition becomes circular. Letting S represent the sample space; Ei for i = 1,2, ..., represents elements in S; A and B represent events in S; and prob(Ei) represents the probability of Ei, it follows that

Since the sample space is made up of the totality of elements in S, we have

where Ui represents the union or total collection of all of the Ei and

An event, A, is made up of a subset of elements in S so that

and

These concepts are illustrated in figure 2.3 as a Venn diagram.

Fig. 2.3. Venn diagram illustrating a sample space, elements, and events.

22

CHAPTER 2

Using notation from set theory and Venn diagrams, several probabilistic relationships can be illustrated. If A and B are two events in S, then the probability of A or B, shown as the shaded areas of Figure 2.3, is given by

Note that in probability the word "or" means "either or both". The notation U represents a union so that A U B represents all elements in A or B or both. The notation n represents an intersection so that A n B represents all elements in both A and B. The last term of equation 2.6 is needed since prob(A) and prob(B) both include prob(A n B). Thus, prob(A r l B) must be subtracted once so the net result is only one inclusion of prob(A n B) on the right-hand side of the equation. If A and B are mutually exclusive, then both cannot occur and prob(A n B) = 0. In this case

Figure 2.3 illustrates the case where event A and B are mutually exclusive and figure 2.4 shows A and B when they are not mutually exclusive. If A" represents all elements in the sample space S that are not in A, then

A" is known as the complement of A. Equation 2.4 indicates that

This statement says that the probability of A or Ac is certainty since one or the other must occur. All of the possibilities have been exhausted. Since A and A" are mutually exclusive

or we have the very useful result that the probability of an event A is

Equation 2.7 often makes it easy to evaluate probability by first evaluating the probability that an outcome will not occur. An example is evaluating the probability that a peak flow q exceeds some particular flow q,. A would be all q's greater than q, and A" would be all q's less than q,. Because q must be either greater than or less than q,, prob(q > q,) = 1 - prob(q < q,). We show later that for continuous random variables prob(q = q,) = 0. If the probability of an event B depends on the occurrence of an event A, then we write prob(B IA), read as the probability of B given A or the conditional probability of B given A has occurred. The prob(B) is conditioned on the fact that A has occurred. Referring to figure 2.4 it is apparent that conditioning on the occurrence of A restricts consideration to A. Our total sample

PROBABILITY

23

Fig. 2.4. Venn diagram showing A U B and A fl B. space is now A. The occurrence of B given that A has occurred is represented by A fl B. Thus the prob(B I A) is given by

assuming of course that prob(A) f 0. Equation 2.8 can be rearranged to give the probability of A a n d B as

Now if prob(B1A) = prob(B), we say that B is independent of A. Thus the joint probability of two independent events is the product of their individual probabilities.

Example 2.1. Using the data shown in table 2.1, estimate the probability that a peak flow in excess of 100,000 cfs will occur in 2 consecutive years on the Kentucky River near Salvisa, Kentucky. Solution: From table 2.1 it can be seen that a peak flow of 100,000 cfs was exceeded 7 times in the 99-year record. If it is assumed that the peak flows from year to year are independent, then the probability of exceeding 100,000 cfs in any one year is approximately 7/99 or 0.0707. Applying equation 2.9, the probability of exceeding 100,000 cfs in two successive years is found to be 0.0707 X 0.0707 or 0.0050. Example 2.2. A study of daily rainfall at Ashland, Kentucky, has shown that in July the probability of a rainy day following a rainy day is 0.444, a dry day following a dry day is 0.724, a rainy day following a dry day is 0.276, and a dry day following a rainy day is 0.556. If it is observed that a certain July day is rainy, what is the probability that the next two days will also be rainy?

24

CHAPTER 2

Solution: Let A be a rainy day 1 and B be a rainy day 2 following the initial rainy day. The probability of A is 0.444 since this is the probability of a rainy day following a rainy day. prob(A r l B) = prob(A) prob(B IA) Now, the prob(B1A) is also 0.444 since this is the probability of a rainy day following a rainy day. Therefore

The probability of two rainy days following a dry day would be 0.276 X 0.444, or 0.122. Note that the probabilities of wet and dry days are dependent on the previous day. Independence does not exist. It can be shown that over a long period of time, 67% of the days will be dry and 33% will be rainy with the conditional probabilities as stated. If one had assumed independence, then the probability of two consecutive rainy days would have been 0.33 X 0.33 = 0.1089, regardless of whether the preceding day had been rainy or dry. Since the probability of a rainy day following a rainy day is much greater than following a dry day, persistence is said to exist.

TOTAL PROBABILITY THEOREM If B1, B,, ..., B, represents a set of mutually exclusive and collectively exhaustive events, one can determine the probability of another event A from

This is called the theorem of total probability. Equation 2.10 is illustrated by figure 2.5.

Fig. 2.5. Venn diagram for theorem of total probability.

PROBAB ILlTY

25

Example 2.3. It is known that the probability that the solar radiation intensity will reach a threshold value is 0.25 for rainy days and 0.80 for nonrainy days. It is also known that for this particular location the probability that a day picked at random will be rainy is 0.36. What is the probability the threshold intensity of solar radiation will be reached on a day picked at random? Solution: Let A represent the threshold solar radiation intensity, B1 represent a rainy day and B, a nonrainy day. From equation 2.10, we know that

This is an example of a weighted probability. The probability of A given a condition is weighted by the probability of the condition and summed.

BAYES THEOREM By rewriting equation 2.8 in the form

and then substituting from equation 2.10 for prob(A), we get what is called Bayes Theorem:

As pointed out by Benjamin and Cornell (1970), this simple derivation of Bayes Theorem belies its importance. It provides a method for incorporating new information with previous or so-called prior probability assessments to yield new values for the relative likelihood of events of interest. These new (conditional) probabilities are called posterior probabilities. Equation 2.11 is the basis of Bayesian Decision Theory. Bayes theorem provides a means of estimating probabilities of one event by observing a second event. Such an application is illustrated in example 2.4.

Example 2.4. The manager of a recreational facility has determined that the probability of 1000 or more visitors on any Sunday in July depends upon the maximum temperature for that Sunday as shown in the following table. The table also gives the probabilities that the maximum temperature will fall in the indicated ranges. On a certain Sunday in July, the facility has more than 1000

visitors. What is the probability that the maximum temperature was in the various temperature classes? Temp Ti

Prob of 1000 or more visitors

Prob of being in temp class

Prob of TjllOOO or more visitors

lo0

0.05 0.20 0.50 0.75 0.50 0.25

0.05 0.15 0.20 0.35 0.20 0.05

0.005 0.059 0.197 0.5 17 0.197 0.025 Total 1.000

O F

Solution: Let Tj for j = 1, 2, ..., 6 represent the 6 intervals of temperature. Then from equation 2.11 prob(Tj1 1000 or more) = prob(TjI 1000 or more) =

prob( 1000 or more 1 Tj) prob(Tj)

Zf= prob(1000 or more 1 Ti) prob(Ti) prob(1000 or more 1 Tj)prob(Tj) .05(.05)

+ .20(. 15) + ..- + .25(.05)

For example

prob( 6) = 1 - Px(6) Px(x)

for

X >5 = 1

Piecewise continuous distributions satisfying the requirements for a probability distribution in which the prob(X = d) is not zero are possible. Such a distribution could be defined by Px(x) = P,(x) = P2(x)

for X < d for

X

2

(2.24)

d

where Pz(d) > P,(d), P,(x,) = 0, Pz(x,) = 1, and P,(x) and P2(x) are nondecreasing functions of X. Figure 2.14 is a plot of such a distribution. For this situation the prob(X = d) equals the magnitude of the jump AP at X = d or is equal to Pz(d) - P,(d). Any finite number of discontinuities of this type are possible. An example of a distribution as shown in figure 2.14 is the distribution of daily rainfall amounts. The probability that no rainfall is received, prob(X = 0), is finite, whereas the probability distribution of rain on rainy days would form a continuous distribution. A second example would be the probability distribution of the water level in some reservoir. The water level may be maintained at a constant level d as much as possible but may fluctuate below or above d at times. The distribution shown in figure 2.14 could represent this situation. The relationship between relative frequency and probability can be envisioned by considering an experiment whose outcome is a value of the random variable X. Let px(x) be the probability density function of X. The probability that a single trial of the experiment will result in an outcome between X = a and X = b is given by

-2

0

2

4

6

8

10

X

Fig. 2.14. A possible piecewise continuous pdf for the case prob (X = d) # 0.

In N independent trials of the experiment, the expected number of outcomes in the interval a to b would be

and the expected relative frequency of outcomes in the interval a to b is

+

In general, if xi represents the midpoint of an interval of X given by xi - Axi/2 to xi Axi/2, then the expected relative frequency of outcomes in this interval of repeated, independent trials of the experiment is given by

Because the right-hand side of this equation represents the area under px(x) between xi - Axi/2 and xi Axi/2, it can be approximated by

+

Equation 2.25 can be used to determine the expected relative frequency of repeated, independent outcomes of a random experiment whose outcome is a value of the random variable X. If N independent observations of X are available, the actual relative frequency of outcomes in an interval of width Axi centered on xi may not equal fxi as given by equation 2.25 because X

39

PROBABILITY

is a random variable whose behavior can only be described probabilistically. The most probable outcome or the expected outcome will equal the observed outcome only if px(x) is truly the probability density function for X and for an infinitely large number of observations. Even if the true probability density function is being used, the actual frequency of outcomes in the interval Ax, approaches the expected number only as the number of trials or observations becomes very large. Example 2.9. Plot the expected frequency histogram using the probability density function of example 2.8 and a class interval of !4. Solution: fx, = Axipx(xi)

The desired plot is shown in figure 2.15.

.00075 .00675 .01875 .03675 .06075 .09075 .I2657 .I6875 .2 1675 .27075 Sum .99750

0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 X

Fig. 2.15. Plot for example 2.10.

40

CHAPTER 2

BIVARIATE DISTRIBUTIONS The situation frequently arises where one is interested in the simultaneous behavior of two or more random variables. An example might be the flow rates on two streams near their confluence. One might like to know the probability of both streams having peak flows exceeding a given value. A second example might be the probability of a rainfall exceeding 2.5 inches at the same time the soil is nearly saturated. Rainfall depth and soil water content would be two random variables. Example 2.10. The magnitude of peak flows from small watersheds is often estimated from the "Rational Equation" given by Q = CIA where Q is the estimated flow, C is a coefficient, I is a rainfall intensity, and A is the watershed area. The assumption is made that the return period of flow will be the same as the return period of the rainfall that is used. To verify this assumption it is necessary to study the joint probabilities of the two random variables Q and I. If X and Y are continuous random variables, their joint probability density function is pX,y(x,y) and the corresponding cumulative probability distribution is P, .(x, y). These two are related by

and P,,y(x, y) = prob(X 5 x and Y 5 y) = m:J

JTm

px,,(t, s) ds dt

The corresponding relationships for X and Y being discrete random variables are (2.28)

fX,Y(~i, yj) = prob(X = xi and Y = yj) F X , y ( ~y), = prob(X

5

x and Y 5 y) =

2 x,sx y,sy

fx,y(xi, yj)

It should be noted that the bivariate analogy of equation 2.16 is

Some of the properties of continuous bivariate distributions are 1) PXSy(x,m) is a cumulative univariate probability function of X only (the cumulative marginal distribution of X).

2) PX,y(m,y) is a cumulative univariate probability function of Y only (the cumulative marginal distribution of Y).

41

PROBABILITY

MARGINAL DISTRIBUTIONS If one is interested in the behavior of one of a pair of random variables regardless of the value of the second random variable, the marginal distribution may be used. For instance, the marginal density of X, px(x), is obtained by integrating px,,(x, y) over all possible values of Y.

The cumulative marginal distribution is given by Px(x) = P X , Y (m) ~ , = prob(X

I

x and Y

I

m)

(2.32a)

Similarly, the marginal density and cumulative marginal distribution of Y are

and

The corresponding relationships for a discrete bivariate distribution are

CONDITIONAL DISTRIBUTIONS A marginal distribution is the distribution of one variable regardless of the value of the second variable. The distribution of one variable with restrictions or conditions placed on the second variable is called a conditional distribution. Such a distribution might be the distribution of X given that Y equals yoor the distribution of Y given that x, 5 X Ix,.

42

CHAPTER 2

In general, the conditional distribution of X given that Y is in some region R is arrived at using the same reasoning that was used in obtaining equation 2.8. The total sample space of Y is now the region R. Because

then

represents a probability density function of X given that Y is in R. The conditional density of X given Y is in R is given by

for X and Y continuous. Similarly the conditional distribution of (xIY is in R) for X and Y discrete is

The determination of conditional probabilities from equations 2.37 and 2.38 are done in the usual way.

for X and Y continuous and

for X and Y discrete. For the special case where X and Y are continuous and the conditional density of X given Y = yo is desired, equation 2.37 breaks down because both the numerator and denominator become zero. In this case

The proof of this may be found in Neuts (1973). In most statistics books pXly(xIY= yo) is simply written as

and called the conditional density of X given Y.

All of the above results are symmetrical with respect to X and Y. For example

for X and Y continuous where

If the region R of equation 2.37 is the entire region of definition with respect to Y, then

SRPY(S)ds = 1 and

-&

s) ds = pX(x)

so that

This results from the fact that the condition that Y is in R when R encompasses the entire region of definition of Y is really no restriction but simply a condition stating that Y may take on any value in its range. In this case, pXIy(xIYis in R) is identical to the marginal density of X.

INDEPENDENCE From equation 2.37 or 2.38 it can be seen that in general the conditional density of X given Y is a function of y. If the random variables X and Y are independent, this functional relationship is)not a function of y). In fact, in this case disappears (i.e., p X l y ( x l ~

or the conditional density equals the marginal density. Furthermore, if X and Y are independent (continuous or discrete) random variables, their joint density is equal to the product of their marginal densities.

The random variables X and Y are independent in the probabilistic sense (stochastically independent) if and only if their joint density is equal to the product of their marginal densities. Independence is an extremely important property. A bivariate distribution is much more difficult to define and to work with than is a univariate distribution. If independence exists and the proper pdf for X and for Y can be determined, the bivariate distribution for X and Y is given as the product of these two univariate distributions.

DERIVED DISTRIBUTIONS Situations often arise where the joint probability distribution of a set of random variables is known and the distribution of some function or transformation of these variables is desired. For example, the joint probability distribution of the flows in two tributaries of a stream may be known whereas the item of interest may be the sum of the flows in the two tributaries. Some commonly used transformations are translation or rotation of axes, logarithmic transformations, nthroot transformations for n equal 2 and 3 and certain trigonometric transformations. Thomas (197 1) presents the developments that lead to the results presented here concerning transformations and derived distributions for continuous random variables. The procedures for discrete random variables is simply an accounting procedure. Example 2.1 1. Let X have the distribution function fx(x) = c/x for X = 2 , 3 , 4 , 5 Let Y = x2- 7X + 12. The probability distribution and possible values of Y can be determined from the following table.

Thus fy(y) = c/3 + c/4 = 3 5 ~ 1 6 0for Y = 0 = c/2 c/5 = 4 2 ~ 1 6 0for Y = 2 = 0 elsewhere The value for c can be evaluated from either the requirement that

+

In either case, the value of c will be found to be 60177.

PROBABILITY

45

For a univariate continuous distribution of the random variable X, the distribution of U where

is a monotonic function u(X) is monotonically increasing if u(x,) 1 u(x,) for x2 > x, and monotonically decreasing if u(x2) 5 u(x,) for x, > x,) can be found from

Example 2.12. Find the probability of 0 < U < 10 if U = X' and X is a continuous random variable with

Solution:

or, since U = x',

A check to see that pU(u) is a probability density can be made by integrating pU(u) from 0 to 25

Now prob(0

< U < 10) =

J'OO

103' 3 6 du = 250 125

The same result could have been obtained by noting that

In the case of a continuous bivariate density, the transformation from p,,,(x, y) to pUvv(u,v) where U = u(X, Y) and V = v(X, Y) are one-to-one continuously differentiable transformations can be made by the relationship

where J(li) is the Jacobian of the transformation computed as the determinant of the matrix of u, v partial derivatives

The limits on U and V must be determined from the individual problem at hand. Example 2.13. Given that p,,,(x, y) = (5 - y/2 - x)/14 for 0 < X < 2 and 0 < Y < 2. If U = X + Y and V = Y/2, what is the joint probability density function for U and V? What are the proper limits on U and V? Solution:

The limits on U and V can be determined by noting that Y = 2V and X = U - 2V. Therefore, the limitofY=OmapstoV=O,Y=2mapstoV= 1,X=OmapstoU=2VandX=2mapsto U = 2V + 2. These limits are shown in figure 2.16. A check can be made by integrating pu.v(u, v) overtheregion0 < V < 1,2V < U < 2V + 2.

PROBABILITY

47

Fig. 2.16. Mapping from X, Y to U, V for example 2.13.

Therefore P ~ , ~ ~V)( over U , the above defined region is indeed a probability density function. A special case of a bivariate transformation is when the distribution of U = u(X, Y) is desired. In this case, one method of obtaining pu(u) is to define a dummy random variable V = v(X, Y). Equation 2.48 is then used to find the joint density of U and V, pU,,(u, v). The univariate density of U is now the marginal distribution of U found by integrating out V. Other special cases of bivariate transformations involve the sums, products and quotients of random variables. If the joint distribution of X and Y is PX,,(x, y) for X > 0 and Y > 0, the following results are obtained for the distribution of U as a function of X and Y. Function

Ddf

In some cases, the function U = u(X) may be such that it is difficult to analytically determine the distribution of U from the distribution of X. In this case it may be possible to generate a large sample of X's (chapter 13), calculate the corresponding U's and then fit a probability distribution

to the U's (chapter 6). It should be noted, however, that this empirical method will not in general satisfy equations 2.47 or 2.48. MIXED DISTRIBUTIONS If pi(x) for i = 1, 2, ..., m represent probability density functions and Xi for i = 1, 2, ..., m represent parameters satisfying Xi 2 0 and Xy= Xi = 1, then

is a probability density function known as a mixture or mixed distribution because it is composed of a mixture of pi(x). The parameter Xi may be thought of as the probability that a random variable is from the probability distribution pi(x) and pi(x) is the probability distribution of X given that X is from the ithdistribution. The cumulative distribution of X is given by

Mixed distributions in hydrology may be applicable in situations where more than one distinct cause for an event may exist. For example, flood peaks from convective storms might be described by pl(x) and from hurricane storms by p2(x)-If A1 is the proportion of flood peaks generated by convective storms and X2 = (1 - XI), is the proportion generated by hurricane storms, then equations 2.54 and 2.55 would describe the probability distribution of flood peaks. Singh (1974), Hawkins (1974), Singh (1987a, 1987b) Hirschboeck (1987), and Diehl and Potter (1987) discuss procedures for applying mixed distribution in the form of equation 2.54 to flood frequency determinations. Two general approaches are used. One is to allow the data and statistical estimation procedures to determine the mixing parameter, A, and the parameters of the distributions, pi(x). The other is to use physical information on the actual events to classify them and thus determine A. Once classified, the two sets of data can independently be used to determine the parameters of the pdfs. Example 2.14. A certain event has probability 0.3 of being from the distribution pl(x) = e-", x > 0. The event may also be from the distribution p2(x) = 2e-2x, x > 0. What is the probability that a random observation will be less than I? Solution:

PROBABILITY

49

Exercises 2.1. (a) Construct the theoretical relative frequency histogram for the sum of values obtained in tossing two dice. (b) Toss two dice 100 times and tabulate the frequency of occurrence of the sums of the two dice. Plot the results on the histogram of part a. (c) Why do the results of part b not equal the theoretical results of part a? What possible kinds of errors are involved? Which kind of error was the largest in your case? 2.2. Select a set of data consisting of 50 or more observations. Construct a relative frequency plot using at least two different groupings of the data. Which of the two groupings do you prefer? Why? 2.3. In a period of one week, 3 rainy days were observed. If the occurrence of a rainy day is an independent event, how may ways could the sequence consisting of 4 dry and 3 wet days be arranged? 2.4. If the occurrence of a rainy day is an independent event with probability equal to 0.3, what is the probability of (a) exactly 3 rainy days in one week? (b) the next 3 days will be rain? (c) 3 rainy days in a row during any week with the other 4 days dry? 2.5. Consider a coin with the probability of a head equal to p and the probability of a tail equal to q = 1 - p. (a) What is the probability of the sequence HHTHTTH in 7 flips of the coin? (b) What is the probability of a specified sequence resulting in r H's and s T's? (c) How many ways can r H's and s T's be arranged? (d) What is the probability of r H's and s T's without regard to the order of the sequence? 2.6. The distribution given by fx(x) = l / N for X = 1'2'3, . . . ,N is known as the discrete uniform distribution. In the following consider N 1 5. (a) What is the probability that a random value from fx(x) will be equal to 5? (b) What is the probability that a random value from fx(x) will be between 3 and 5 inclusive? (c) What is the probability that in a random sample of 3 values from fx(x) all will be less than 5? (d) What is the probability that the 3 random values from fx(x) will all be less than 5 given that 1 of the values is less than 5? (e) If 2 random values are selected from fx(x), what is the probability that one will be less than 5 and the other greater than 5? (f') For what X from fx(x) is prob(X 5 x) = 0.5? 2.7. Consider the continuous probability density function px(x) = a sin2 mx for 0 < X < 7c. (a) What must be the value of a and m? (b) What is Px(x)? (c) What is prob(0 < X < 7c/2)? (d) What is prob(X > a/2 I X < a/4)? 2.8. Consider the continuous probability density function given by px(x) = 0.25 for 0 < X < a. (a) What is a? (b) What is prob(X > a/2)? (c) What is prob(X > a/2 I X > a/4)? (d) What is prob(X > a/2 I X < a/4)?

50

CHAPTER 2

2.9. Let px(x) = 0.25 for 0 < X < a as in exercise 2.8. What is the distribution of Y = In X? Sketch py(y). 2.10. Many probability distributions can be defined simply by consulting a table of definite integrals. For example 5," xn-' e-' dx is equal to r(n) where r(n) is defined as the gamma function (see chapter 6). Therefore one can define px(x) = xn-' e-'/T(n) to be a probability density function for n > 0 and 0 < X < m. This distribution is known as the l-parameter gamma distribution. Using a table of definite integrals, define several possible continuous probability distributions. Give the appropriate range on X and any parameters. 2.11. The annual inflow into a reservoir (acre-feet) follows a probability density given by p,(x) = l/(P1- al). The total annual outflow in acre-feet follows a probability distribution given by py(y) = 1/(P2 - %). Consider that P1 > P2 and al < 04~.(a) Calculate the expression for the probability distribution of the annual change in storage. (b) Plot the probability distribution of the annual change in storage. (c) If P1 = 100,000, a, = 20,000, P, = 70,000 and % = 50,000, what is the probability that the change in storage will be i) negative and ii) greater than 15,000 acre-feet? 2.12. The probability of receiving more than 1 inch of rain in each month is given in the following table. If a monthly rainfall record selected at random is found to have more than 1 inch of rain, what is the probability the record is for July? April? Jan -25 Feb .30 Mar .35

Apr .40 May -20 Jun .10

Jul .05 Aug .05 Sept .O5

Oct .05 Nov -10 Dec -20

2.13. It is known that the discharge from a certain plant has a probability of 0.001 of containing a fish killing pollutant. An instrument used to monitor the discharge will indicate the presence of the pollutant with probability 0.999 if the pollutant is present and with probability 0.01 if the pollutant is not present. If the instrument indicates the presence of the pollutant, what is the probability that the pollutant is really present? 2.14. A potential purchaser of a ferry across a river knows that if a flow of 100,000 cfs or more occurs, the ferry will be washed down stream, go over a low dam, and be destroyed. He knows that the probability of a flow of this kind in any year is 0.05. He also knows that for each year that the ferry operates a net profit of $10,000 is realized. The purchase price of the ferry is $50,000. Sketch the probability distribution of the potential net profit over a period of years neglecting interest rates and other complications. Assume that if a flow of 100,000 cfs or more occurs in a year, the profit for that year is zero. 2.15. Assume that the probability density function of daily rainfall is given by

(a) Is this a proper probability density function? (b) What is prob(X > 0.5)? (c) What is prob (X > 0.5 1 X # O)?

2.16. Consider the probability density function given by

This is a mixture of two uniform distributions. (a) Sketch p,(x) for A, = 0.5. (b) Sketch px(x) for A, = 0.1. (c) Sketch px(x) for A, = 0.333. (d) In a random sample from px(x), 60% of the values were between 0 and 2. What would be an estimate for the value of A,? 2.17. Show that equations 2.50 through 2.53 are valid.

3. Properties of Random Variables IN CHAPTER 2 random variables and their probability density functions were discussed in general and somewhat abstract terms. Actually, nearly every hydrologic variable is a random variable. This includes rainfall, streamflow, infiltration rates, evaporation, reservoir storage, and so on. Any process whose outcome is a random variable can be thought of as an experiment. A single outcome from an experiment is a realization of the experiment or an observation from the experiment. Thus, daily rainfall values are observations generated by a set of meteorologic conditions that comprise the experiment. The terms realization and observation can be used interchangeably; however, an observation is generally taken to be a single value of a random variable and a realization is generally taken as a time series of random variables generated by a random experiment. A 10-year record of daily rainfall might be considered as a single realization of a stochastic process (daily rainfall). A second 10-year record of daily rainfall from the same location would then be a second realization of the process. In this chapter we will be concerned mainly with observations of random variables and with the collection of possible values that these observations may take on. The complete assemblage of all of the values representative of a particular random process is called a population. Any subset of these values would be a sample from the population. For example, the pages of this book could represent a population while the pages of this chapter are a sample of that population. All of the books in a library might be taken as a population and should this book be found in the library, it would be a sample from the total population. Generally, one has at hand a sample of observations or data from which inferences about the originating population are to be made, and then possibly inferences about another sample from

RANDOM VARIABLES

53

this population. Streamflow records for the past 50 years on a particular stream would be a sample from which inferences about the behavior of the stream for all time (the population) could be made. This information could also be used to estimate the behavior of the stream during some future period of years (another but yet unrealized sample) so that a structure could be properly designed for the stream. Thus, one might use information gleaned from one sample to make decisions regarding another sample. Quantities that are descriptive of a population are called parameters. In most situations these parameters must be estimated from samples of data. Sample statistics are estimates for population parameters. Sample statistics are estimated from samples of data and as such are functions of random variables (the sample values) and thus are themselves random variables. The average number of pages in all of the books in a particular library would be a parameter representing the population (the books in the library). This parameter could be estimated by determining the average number of pages in all of the books on a particular shelf in the library (a sample of the population). This estimate of the parameter would be a statistic. As pointed out in chapter 1, for a decision based on a sample to be valid in terms of the population, the sample statistics must be representative of the population parameters. This in turn requires that the sample itself be representative of the population and that "good" parameter estimation procedures are used. One could not get a "good" estimate of the average number of pages per book in a library by sampling a shelf that contained only fat, engineering handbooks. By the same token, one cannot get "good" estimates for the parameters of a streamflow synthesis model if the estimates are based on a short period of record during which an extreme drought occurred. One rarely, if ever, has available a population of observations on a hydrologic variable. What is generally available is a sample (of observations) from the population. Thus, population parameters are rarely, if ever, known and must be estimated by sample statistics. By the same token, the true probability density function that generated the available sample of data is not known. Thus, it is necessary to not only estimate population parameters, but it is also necessary to estimate the form of the random process (experiment) that generated the data. This chapter is devoted to a discussion of parameters descriptive of populations and how estimates (statistics) for these parameters can be obtained from samples drawn from populations.

MOMENTS AND EXPECTATION-UNIVARIATE DISTRIBUTIONS A convenient way of quantifying the location and some measures of the shape of a probability distribution is by computing the moments of the distribution. Refemng to figure 3.1, the first moment of the elemental area dA about the origin is given by

and the first moment of the total area about the origin is

X Fig. 3.1. Moment of arbitrary area.

In case of a random variable and its associated probability density function such as shown in figure 3.2, the first moment about the origin is again given by

In this case dA = px(x) dx so that

Fig. 3.2. Moment of probability distribution.

55

RANDOM VARIABLES Generalizing the situation, the ithmoment about the origin is

In the case of a discrete distribution

The ithcentral moment is defined as the ithmoment about the mean, p, of a distribution and is given by

The expected value of the random variable X is defined to be E(X) =

Jro,x px(x) dx

E(X) =

xjxj fx(xj)

X continuous X discrete

If g(X) is a function of X, then the expected value of g(X) is given by E[g(X)] =

J_",g(x) px(x) dx

E[g(x)] =

xjg(xj) fx(xj)

X continuous

X discrete

(3.8) (3.9)

It is apparent that the expected value of (x - p,)' is equal to the ithcentral moment

and the E(X) = p; is the first moment about the origin. Some rules for finding expected values are

MEASURES OF CENTRAL TENDENCY Arithmetic Mean Generally, the first property of a random variable that is of interest is its mean or average value. The mean, px, of a random variable, X, is its expected value. Thus

A sample estimate of the population mean is the arithmetic average, X,calculated from

where n is the number of observations or items in the sample. The arithmetic mean can be estimated from grouped data by

where k is the number of groups, n is the number of observations, ni is the number of observations in the ithgroup and xi is the class mark of the ithgroup. Geometric Mean The sample geometric mean,

K, is defined as

where l-Ir= xi = x, x2 x3 ... x,. The logarithm of X , is equal to the arithmetic average of the logarithms of the x{s. The logarithm of the population geometric mean would be the expected value of logarithm of X. Median The sample median, Xmd,is the observation such that half of the values lie on either side of Xmd.The population median, I J . ~ ,would , be the value satisfying

J-kpx(x) dx = 0.5

X continuous

(3.18)

or pmd= x, where p is determined from

Xr='=, fx(xi) = 0.5

X discrete

(3.19)

The median of a sample or a population may not exist. Mode The mode is the most frequently occurring value. Thus the population mode, IJ.,,, would be a value of X maximizing px(x) and thus satisfying the equations dpx(x) --0 dx

and

d2px(x) < 0 dx2

x continuous

or the value of X associated with Maxi2 fx(xi)

X discrete

(3.21)

RANDOM VARIABLES

57

The sample mode, X,,, would simply be the most frequently occurring value in the sample. A sample or a population may have none, one, or more than one mode. Weighted Mean The calculation of the arithmetic mean of grouped data is an example of calculating a weighted mean where ni /n is the weighting factor. In general, the weighted mean is

where wi is the weight associated with the ithobservation or group and k is the number of observations or groups. MEASURES OF DISPERSION Range The two most common measures of dispersion are the range and the variance. The range of a sample is simply the difference between the largest and smallest sample values. The range of a population is many times the interval from -m to or from 0 to m. The sample range is a function of only two of the sample values but does convey some idea of the spread of the data. The population range of many continuous hydrologic variables would be 0 to m and would convey little information. The range has the disadvantage of not reflecting the frequency or magnitude of values that deviate either positively or negatively from the mean because only the largest and smallest values are used in its determination. Occasionally, the relative range-the range divided by the mean-is used. Variance By far the most common measure of dispersion is the variance, or its positive square rootthe standard deviation. The variance of the random variable X is defined as the second moment about the mean and is denoted by 0;.

Thus, the variance is the average squared deviation from the mean. For a discrete population of size n, equation 3.23 becomes

The sample estimate of cr; is denoted by S; and calculated from

Two basic differences should be noted between equations 3.24 and 3.25. First, in 3.25 F is used instead of p. This is because in dealing with a sample, the population mean would not be known. Second, n - 1 is used as the denominator in determining S; rather than n when calculating 0;. Ci(xi - x ) ~ would result in a biased estimate for 0;. The reason for this is that n The variance for grouped data can be estimated from

where k is the number of groups, n is the number of observations, xi is the class mark and ni the number of observations in the i" group. The variance of some functions of the rambmvariable X can be determined from the following relationships:

The units on the variance are the same as the units on x2.The units on the standard deviation are the same as the units on the random variable. A dimensionless measure of dispersion is the coefficient of variation, defined as the standard deviation divided by the mean. The coefficient of variation is estimated from

MEASURES OF SYMMETRY As is apparent from figure 2.13, many distributions are not symmetrical. They may tail off to the right or to the left and as such are said to be skewed. A distribution tailing to the right is said to be positively skewed and one tailing to the left is negatively skewed. The skewness is the third moment about the mean and is given by skewness = J_"m (x - p)3 px(x) dx One measure of absolute skewness would be the difference in the mean and the mode. A measure such as this would not be too meaningful, however, because it would depend on the units of measurement. A relative measure of skewness, known as Pearson's first coefficient of skewness, can be obtained by dividing the difference in the mean and the mode by the standard deviation. population measure of skewness =

P

-

Pmo

(7

(3.32)

59

RANDOM VARIABLES Mean = Mode = Median

Symmetrical

Positive skew

Negative skew

Fig. 3.3. Location of mean, median, and mode.

which can be estimated by

sample measure of skewness =

x -

Xmo

Sx

The mode of moderately skewed distributions can be estimated from (Par1 1967) Xmo

-

=X -

3(x - xmd)

so that s a m ~ l emeasure of skewness =

-

~md)

If sample estimates are replaced by population values in equation 3.35, Pearson's second coefficient of skewness results. The most commonly used measure of skewness is the coefficient of skew given by

An unbiased estimate for the coefficient of skew based on a sample of size n is

where M, is the sample estimate for p3.The sample coefficient of skew has the advantage of being a function of all of the observations in the sample. Figure 3.3 shows symmetrical, positively and negatively skewed distributions.

MEASURES OF PEAKEDNESS A fourth property of random variables based on moments is the kurtosis. Kurtosis refers to the extent of peakedness or flatness of a probability distribution in comparison with the normal

60

CHAPTER 3 Leptokurtic, K >3,E >O /-- Normal, K = 3,E = 0

5-

Platykurtic, K E ) can be made as small as desired. For small samples (as are many times used in practice) consistency does not guarantee that a small error will be made. In spite of this, one feels more comfortable knowing that 6 would converge to 0 if a larger sample were used. A single estimate of 0 from a small sample is a problem because neither unbiasedness nor consistency give us much comfort. In choosing between several methods for estimating 0, in addition to being unbiased and consistent it would be desirable if the ~ a r ( 6 were ) as small as possible. This would mean that the probability distribution of 6 would be more concentrated about 0.

RANDOM VARIABLES

71

Efficiency An estimator 6 is said to be the most efficient estimator for 0 if it is unbiased and its variance is at least as small as that of any other unbiased estimator for 0. The relative efficiency of 6, with respect to 6, for estimating 0 is the ratio of ~ar(6,)to ~ a r ( 6 , ) . Sufficiency Finally, it is desirable that 6 use all of the information contained in the sample relative to 0. If only a fraction of the observations in a sample are used for estimating 0, then some information about 0 is lost. An estimator 6 is said to be a sufficient estimator for 9 if 6 uses all of the information relevant to 0 that is contained in the sample. More formal statements of the above four properties of estimators and procedures for determining if an estimator has these properties can be found in books on mathematical statistics (Lindgren 1968; Freund 1962; Mood et al. 1974). There are many ways for estimating population parameters from samples of data. A few of these are graphical procedures, matching selected points, method of moments, maximum likelihood, and minimum chi-square. The graphical procedure consists of drawing a line through plotted points and then using certain points on the line to calculate the parameters. This procedure is very arbitrary and is dependent upon the individual doing the analysis. Frequently, the method is employed when few observations are available-with the thought that few observations will not produce good parameter estimates anyway. When few points are available is precisely the time when the best methods of parameter estimation should be used. The method of matching points is not a commonly used method but can produce reasonable first approximations to the parameters. The procedure can be valuable in getting initial estimates for the parameters to be employed in iterative solutions that can arise when the method of moments or maximum likelihood are used. Example 3.1. A certain set of data is thought to follow the distribution p,(x) = Xe-" for X In this particular data set, 75% of the values are less than 3.0. Estimate the parameter X.

X

0.

Solution: px(x) = hepAx Px(x) =

Jt Xe-"

dt = 1

-

e-""

1 - Px(x) = e-Ax Xx = -In( 1 - Px(x))

Comment: If a sample of size n is available this procedure could be used to obtain n estimates for h. These n estimates could then be averaged to obtain i. If the probability distribution of interest had m parameters, then the value of P,(x) and x at m points would be used to obtain m equations

in the m unknown parameters. The method of matching points is not recommended for general use in getting final parameter estimates. Certainly this method would not use all of the information in the sample. Also, several different estimates for the parameters could be obtained from the same sample depending on which observations were used in the estimation process.

Method of Moments One of the most commonly used methods for estimating the parameters of a probability distribution is the method of moments. For a distribution with m parameters, the procedure is to equate the first m moments of the distribution to the first m sample moments. This results in m equations which can be solved for the m unknown parameters. Moments about the origin, the mean, or any other point can be used. Generally, for 1-parameter distributions the first moment about the origin, the mean, is used. For 2-parameter distributions the mean and the variance are generally used. If a third parameter is required, the skewness may be used. Similarly, L-moments may be used in parameter estimation by equating sample estimates of the L-moments to the population expression for the corresponding L-moment depending on the particular pdf being used. Again, for m parameters, m L-moments would be required. This technique will be illustrated in chapter 6 for some particular pdfs. -

Example 3.2. Estimate the parameter A of the distribution px(x) = he-" method of moments. Solution: The first moment about the origin of px(x) is

1 Thus, the mean of px(x) is 1/A so that A can be estimated by ); = =. X Example 3.3. Use the method of moments to estimate the parameters of

Solution:

let

Y=-

x - 0, 02

so that dx = 0, dy

for X > 0 by the

RANDOM VARIABLES

73

and

The first integral has an integrand h(y) such that h(-y) = -h(y) and is therefore zero. The second integral can be written as

Therefore kx = 4 , or the parameter 8, of this distribution is equal to the mean of the distribution and can be estimated by

The second moment about the mean is equal to the variance.

let y =

-

fie2

so that

dx = f i g 2 dy

and

= 9;

Thus, the parameter 0; is equal to the variance and can be estimated by s; (the sample variance). 622 -

2

sx

Substituting the parameter estimates in terms of their population values into the expression for px(x), the result is

which is the normal distribution. Maximum Likelihood Assume we have in hand n random observations x,, x,, ..., xn. Their joint probability distribution is p,- (x,, x2, ..., xn; 01, 02, ..., 0,). Because for a random sample the xi's are independent, their joint distribution can be written

Now, this latter expression is proportional to the probability that the particular random sample would be obtained from the population and is known as the likelihood function.

The m parameters are unknown. The values of these m parameters that maximize the likelihood that the particular sample in hand is the one that would be obtained if n random observations were selected from px(x; €I1,€I2,..., 0,) are known as the maximum likelihood estimators. The parameter estimation procedure becomes one of finding the values of €I,,€I2,..., 0, that maximize the likelihood function. This can be done by taking the partial derivative of L(0,, O,, ..., 0,) with respect to each of the Oi's and setting the resulting expressions equal to zero. These m equations in m unknowns are then solved for the m unknown parameters. Because many probability distributions involve the exponential function, it is many times easier to maximize the natural logarithm of the likelihood function. The logarithmic function is monotonic, thus the values of the 0's that maximize the logarithm of the likelihood function also maximize the likelihood function. Example 3.4. Find the maximum likelihood estimator for the parameter A of the distribution px(x) = Ae-'" for X > 0. Solution:

RANDOM VARIABLES

75

Note that this is the same estimate as obtained in example 3.2 using the method of moments. The two methods do not always produce the same estimates. Example 3.5. Find the maximum likelihood estimators for the parameters distribution

el, and

0; of the

Solution (all summations from 1 to n):

Therefore 2 ( x i

-

0,) = 0

Example 3.5 shows that the maximum likelihood estimators are not unbiased. It can be shown, however, that the maximum likelihood estimators are asymptotically (as n +m) unbiased. Maximum likelihood estimators are sufficient and consistent. If an efficient estimator exists, maximum likelihood estimators, adjusted for bias, will be efficient. In addition to these four

properties, maximum likelihood estimators are said to be invariant, that is, if (6) is a maximum likelihood estimator of 0 and the function hie) is continuous, then h(6) is a maximum likelihood estimator of h(0). The method of moments and the method of maximum likelihood do not always produce the same estimates for the parameters. In view of the properties of the maximum likelihood estimators, this method is generally preferred over the method of moments. Cases arise, however, where one can get maximum likelihood estimators only by iterative numerical solutions (if at all), thus leaving room for the use of more readily obtainable estimates possibly by the method of moments. The accuracy of the method of moments is severely affected if the data contains errors in the tails of the distribution where the moment arms are long (Chow 1954). This is especially troublesome with highly skewed distributions. Finally, it should be kept in mind that the properties of maximum likelihood estimators are asymptotic properties (for large n) and there well may exist better estimation procedures for small samples for particular distributions.

CHEBYSHEV INEQUALITY Certain general statements about random variables can be made without placing restrictions on their distributions. More precise probabilistic statements require more restrictions on the distribution of the random variables. Exact probabilistic statements require complete knowledge of the probability distribution of the random variable. One general result that applies to random variables is known as the Chebyshev inequality. This inequality states that a single observation selected at random from any probability distribution will deviate more than k a from the mean, k, of the distribution with probability less than or equal to l/k2.

For most situations this is a very conservative statement. The Chebyshev inequality produces an upper bound on the probability of a deviation of a given magnitude from the mean.

Example 3.6. The data of table 2.1 has a mean of 66,540 cfs and a standard deviation of 22,322 cfs. Without making any distributional assumptions regarding the data, what can be said of the probability that the peak flow in a year selected at random will deviate more than 40,000 cfs from the mean? Solution: Applying Chebyshev's inequality we have k a = 40,000 cfs. Using 22,322 cfs as an estimate for a we obtain k = 1.79.

RANDOM VARIABLES

77

The probability that the peak flow in any year will deviate more than 40,000 cfs from the mean is thus less than or equal to 0.3 11. Comment: One can see that this is a very conservative figure by noting that only 6 values out of 99 (6/99 = 0.061) lie outside the interval 66,540 5 40,000. By not making any distributional assumptions, we are forced to accept very conservative probability estimates. In later chapters we will again look at this problem making use of selected probability distributions. LAW OF LARGE NUMBERS Chebyshev's inequality is sometimes written in terms of the mean Z of a random sample of size n. In such a case equation 3.77 becomes

a;c If we now let S = l/k2 and choose n so that n 1 7 , we have the (weak) Law of Large Numbers se(Mood and Graybill 1963) which states:

Let px(x) be a probability density function with mean yx and finite variance a;. Let x, be the mean of a random sample of size n from px(x). Let E and S be any two specified small numbers a;,

such that (E > 0 , 0 < S < 1. Then for n any integer greater than e2s

This statement assures us that we can estimate the population mean with whatever accuracy we desire by selecting a large enough sample. The actual application of equation 3.79 requires knowledge of population parameters and is thus of limited usefulness. Example 3.7. Assume that the standard deviation of peak flows on the Kentucky River near Salvisa, Kentucky, is 22,322 cfs. How many observations would be required to be at least 95% sure that the estimated mean peak flow was within 10,000 cfs of its true value if we know nothing of the distribution of peak flows? Solution: Applying equation 3.79 we have

We must have at least 100 observations to be 95% sure that the sample mean is within 10,000 cfs of the population mean if we know nothing of the population distribution except its standard deviation. This happens to be very close to the number of observations in the sample (99). Comment: We will look at this ~roblemaoain later making certain distributional assum~tions.

78

CHAPTER 3 Exercises

3.1. What is the expected mean and variance of the sum of values obtained by tossing two dice? What is the coefficient of skew and kurtosis? xt 3.2. Modular coefficients defined as Kt = T are occasionally used in hydrology. What is the X mean, variance, and coefficient of variation of modular coefficients in terms of the original data? 3.3. What effect does the addition of a constant to each observation from a random sample have on the mean, variance, and coefficient of variation? 3.4. What effect does multiplying each observation in a random sample by a constant have on the mean, variance, and coefficient of variation? 3.5. Without any knowledge of the probability distribution of peak flows on the Kentucky River 0 - kQj is greater than 10,000 cfs? (table 2.1), what can be said about the probability that 1 3.6. Without any knowledge of the probability distribution of peak flows on the Kentucky River (table 2.1), what can be said about the probability that a single random observation will deviate more than 10,000 cfs from pQ? 3.7. Using the data of exercise 2.2 calculate the mean and variance from the grouped data. How do the grouped data mean and variance compare to the ungrouped mean and variance? Which estimate do you prefer? 3.8. Calculate the covariance between the peak discharge Q in thousands of cfs and the area A in thousands of square miles for the following data.

3.9. Calculate the correlation coefficient between Q and A for the data in exercise 3.8. 3.10. Calculate the coefficient of skew for Q in exercise 3.8. Note that this estimate is relatively unreliable because of the small sample. 3.11. Calculate the kurtosis and the coefficient of excess for Q in exercise 3.8. Note that these estimates are unreliable because of the small sample size.

RANDOM VARIABLES

79

3.12. Complete the steps necessary to arrive at equation 3.56 from 3.55. 3.13. Show that o,oy

2 loxyl

3.14. A convenient relationship for calculating the estimated variance of a sample of data is

s; =

2 x? - nx2 n-1

-

C xi' - ( 2 xi)' n

n-1

Derive this relationship from equation 3.25. 3.15. The estimated covariance between X and Y of a bivariate random sample can be calculated from

Derive this expression from equations 3.49. Note that this estimated covariance is biased. In practice, the final divisor of n is replaced by n - 1 to correct for bias. 3.16. In exercise 2.14, if the future maximum life of the ferry is 15 years, what is the expected net profit? Neglect the interest or discount rate. 3.17. What are the maximum likelihood estimates for the parameters of the two parameter exponential distribution? This distribution is given by

3.18. What are the moment estimates for the parameters of the exponential distribution given in exercise 3.17? 3.19. For the following data, what are the moment and maximum likelihood estimates for the parameters of the distribution given in exercise 3.17? x = 15.0, 10.5, 11.O, 12.0, 18.0, 10.5, 19.5. 3.20. Calculate the coefficient of skew for the Kentucky River data of table 2.1. 3.21. Calculate the kurtosis of the Kentucky River data of table 2.1. 3.22. Using the data of exercise 2.2, calculate the coefficient of skew from the grouped data. 3.23. Using the data of exercise 2.2, calculate the kurtosis from the grouped data.

3.24. What are the maximum likelihood estimates for cx and

3.25. What are the mean and variance of fx(x)

1

= - for x

N

P in the distribution

= 1, 2, ..., N?

3.26. What are the mean and variance of px(x) = a sin2x for 0 < X < n? 3.27. Use the method of moments to estimate a in px(x) = a sin2 x for 0 < X < n based on the random sample given by X = 0.5, 2.0, 3.0, 2.5, 1.5, 1.8, l.0,0.8, 2.5, 2.2. 3.28. The r~ moment about xo can be written as E(X possible second moment.

- xo)'.

Show that the variance is the smallest

4. Some Discrete Probability Distributions and Their Applications THUS FAR, probability distributions have been considered in general terms. This chapter is devoted to some particular discrete distributions and their applications. The following two chapters are devoted to selected continuous distributions. These chapters are by no means exhaustive treatments of probability distributions; only some of the more common distributions are considered. HYPERGEOMETRIC DISTRIBUTION Drawing a random sample of size n (without replacement) from a finite population of size N, with the elements of the population divided into two groups with k elements belonging to one group, is an example of sampling from a hypergeometric distribution. The two groups may be defective or nondefective objects, rainy or nonrainy days, success or failure of a project, and so forth. For discussion purposes we will consider that an element (or outcome) from the population is either a success or a failure. The probability of x successes in a sample of size n selected from a population of size N containing k successes can be determined by applying equation 2.1. The total number of possible outcomes or ways of selecting a sample of size n from N objects is (F). The number of ways of selecting x successes and n - x failures from the population containing k successes and N - k failures is (,k) (:I~~) . Thus the probability is

The distribution given by equation 4.1 is known as the hypergeometric distribution where fx(x; N, n, k) is the probability of obtaining x success in a sample of size n drawn from a population of size N containing k successes. The cumulative hypergeometric distribution giving the probability of x or fewer successes is

There are certain natural restrictions on this distribution. For example: x cannot exceed k, x cannot exceed n, k cannot exceed N, and n cannot exceed N. N, n, k, and x are all nonnegative integers. Furthermore, the outcomes must be random and equally likely. The mean of the hypergeometric distribution is

and the variance is

Example 4.1. The hypergeometric applies in example 2.5. In this example, a success is selecting a bad record and N = 10, k = 3, n = 4. The solutions can be written in terms of the hypergeometric as

(a) fx(l; 10,4, 3) =

(:)(:) (I;'t)(&) -

-

-

,3)(35) 210

=

0.500

DISCRETE DISTRIBUTIONS

83

Example 4.2. Assume that during a certain September, 10 rainy days occurred. Also assume that at this particular location the occurrence of rain on any day is independent of whether or not it rained on any previous day. (This is often not a good assumption). A sample of 10 September days is selected at random. (a) What is the probability that 4 of these days will have been rainy? (b) What is the probability that less than 4 of these days were rainy? Solution: Use the hypergeometric distribution with

= 0.560

(b) F,(3; 30, 10, 10) =

-

-

-

Example 4.3. Examples of the hypergeometric distribution commonly found in statistics books include card sampling problems (What is the probability of exactly 2 aces in a 5-card hand selected at random from a 52-card deck?) and acceptance sampling problems (What is the probability of selecting 5 defective items from a lot of 50 items if 20 items are selected and the lot actually contains 12 defectives?) Solution: Card problem

Acceptance Sampling Problem

84

CHAPTER 4

BERNOULLI PROCESSES Binomial Distribution Consider a discrete time scale. At each point on this time scale an event may either occur or not occur. Let the probability of the event occumng be p for every point on the time scale; thus, the occurrence of the event at any point on the time scale is independent of the history of any prior occurrences or nonoccurrences. The probability of an occurrence at the ithpoint on the time scale is p for i = 1,2, ... A process having these properties is said to be a Bernoulli process. An example of a Bernoulli process might be the occurrence of rainy days. The time scale has units of days. On any particular day, rainfall may or may not occur. If the occurrence of rainfall on any given day is independent of the past history of rainfall occurrences, the sequence of rainy and dry days can be considered a Bernoulli process. As an example of another Bemoulli process, consider that during any year the probability of the maximum flow exceeding 10,000 cfs on a particular stream is p. Common terminology for a flow exceeding a given value is an exceedance. Further consider that the peak flow in any year is independent from year to year (a necessary condition for the process to be a Bernoulli process). Let q = 1 - p be the probability of not exceeding 10,000 cfs. We can neglect the probability of a peak of exactly 10,000 cfs since the peak flow rates would be a continuous process. In this example the time scale is discrete with the points being nominally 1 year in time apart. We can now make certain probabilistic statements about the occurrence of a peak flow in excess of 10,000 cfs (an exceedance). For example, the probability of an exceedance occumng in year 3 and not in years 1 or 2 can be evaluated from equation 2.9 as qqp since the process is independent from year to year. The probability of (exactly) one exceedance in any 3-year period is pqq qpq + qqp since the exceedance could occur in either the first, second, or third year. Thus, the probability of (exactly) one exceedance in three years is 3pq2 In a similar manner, the probability of 2 exceedances in 5 years can be found from the summation of the terms ppqqq, pqpqq, pqqpq, ..., qqqpp. It can be seen that each of these terms is equivalent to p2q3 and that the number of terms is equal to the number of ways of arranging 2 or 10, items (the p's) among 5 items (the p's and q's). Therefore, the total number of terms is so that the probability of exactly 2 exceedances in 5 years is This result can be generalized so that the probability of X exceedances in n years is (:) pxqn-".The result is applicable to any Bemoulli process so that the probability of X occurrences of an event in n independent trials if p is the probability of an occurrence in a single trial is given by

+

(z),

This equation is known as the binomial distribution. The binomial distribution and the Bernoulli process are not limited to a time scale. Any process that may occur with probability p at discrete points in time or space or in individual trials may be a Bernoulli process and follow the binomial distribution.

DISCRETE DISTRIBUTIONS

85

The cumulative binomial distribution is

and gives the probability of X or fewer occurrences of an event in n independent trials if the probability of an occurrence in any trial is p. Continuing the above example, the probability of less than 3 exceedances in 5 years is

The mean, variance, and coefficient of skew of the binomial distribution are

Var(X) = npq

(4.8)

The distribution is symmetrical for p = q, skewed to the right for q > p and skewed to the left for q < p. Because the probability of a success on any trial is independent of past history, the origin of the time scale of a Bernoulli process can be taken at any time point. Thus the probability of any combination of successes or failures is the same for any sequence of n points regardless of their location with respect to the origin. Example 4.4. On the average, how many times will a 10-year flood occur in a 40-year period? What is the probability that exactly this number of 10-year floods will occur in a 40-year period? Solution: A 10-year flood has p = 1/10 = 0.1

Comment: This problem illustrates the difficulty of explaining the concept of return period. On the average a 10-year event occurs once every 10 years and in a 40-year period is expected to occur 4 times. Yet in about 80% (100[1 - 0.20591) of all possible independent 40-year periods, the 10-year event will not occur exactly 4 times. As a matter of fact the probability that it will occur 3 times is nearly identical to the probability it will occur 4 times (0.2003 vs. 0.2059). The number of occurrences, X, is truly a random variable (with a binomial distribution).

The binomial distribution has an additive property (Gibra 1973). That is, if X has a binomial distribution with parameters n, and p and Y has a binomial distribution with parameters n, and p, then Z = X + Y has a binomial distribution with parameters n = n, + n, and p. A useful property of the binomial distribution is that

The binomial distribution can be used to approximate the hypergeometric distribution if the sample selected is small in comparison to the number of items N from which the sample is drawn. In this case, the probability of a success would be about the same for each trial, and sampling without replacement (hypergeometric) would be very similar to sampling with replacement (binomial). Example 4.5. Compare the hypergeometric and binomial for N = 40, n = 5, k = 10 and X = 0, 1,2,3,4,5. Solution:

X

Hypergeometric fx(x; N, n, k) = fx(x; 40, 5, 10)

Binomial fx(x; n, p) = fx(x; 5, 10/40)

Comment: This merely indicates that drawing a small sample without replacement from a large population and drawing the same sample with replacement (so probabilities in each trial are constant) are nearly equivalent. Example 4.6. The operator of a boat dock has decided to put in a new facility along a certain river. In an economic analysis of the situation it was decided to have the facility designed to withstand floods up to 75,000 cfs. Furthermore, it was determined that if one flood greater than this occurs in a 5-year period, repairs can be made and the operator will still break even on its operation during the 5-year period. If more than one flow in excess of 75,000 cfs occurs, money will be lost. If the probability of exceeding 75,000 cfs is 0.15, what is the probability the operator will make money? Solution: Money will be made if no floods exceeding 75,000 cfs occur during the 5-year period. Let X be the number of floods. From the binomial distribution

DISCRETE DISTRIBUTIOPiS

87

Comment: The probability that the operator will make the investment, work for 5 years, and just break even is very high

Thus, even though the risk or probability of losing money is low (1 - 0.39 15 - 0.4437 the investment may not be an attractive one.

=

0.1648),

Whenever a decision is made based on uncertain information or relative to a system subject to random inputs or behavior, there is a chance that the decision will result in an adverse outcome. A bridge that may be underdesigned, a water supply reservoir that may be too small, and an investment that may fail are examples of decisions made in the face of uncertainty. These decisions are said to be risky decisions with risk defined as the probability of an adverse outcome. Generally, all decisions dependent on hydrologic data and hydrologic analysis are risky in this sense. A risky decision is not a bad decision. Risk must be balanced against costs and available alternatives. For informed decisions to be made under uncertainty, quantitative estimates of the resulting risk are desirable. Risk and uncertainty are treated in detail in chapter 17. Example 4.7. In order to be 90% sure that a design storm is not exceeded in a 10-year period, what should be the return period of the design storm? Solution: Let p be the probability of the design storm being exceeded. Based on the binomial distribution, the probability of no exceedances is given by

1 T = - = 95 years P Comment: To be 90% sure that a design storm is not exceeded in a 10-year period, a 95-year return period storm must be used. If a 10-year return period storm is used, the chances of it being exceeded are

In general, the chance of at least one occurrence of a T-year event in T years is

It can be shown that as T gets large, this expression approaches 1 - l/e or 0.632. For T = 5, 10, and 25, the probability is 0.67,0.65, and 0.64, respectively. Thus, if the design life of a structure

--

and its design return period are the same, the chances are very great that the capacity of the structure will be exceeded during its design life. The risk associated with a return period over n years is risk = 1 - (1 - l/Ty. The procedure outlined in example 4.7 can be used to determine a design return period when the allowable risk is stated. Note that the design return period must be much greater than the life of the project to be reasonably sure that an exceedance will not occur. No matter what design return period is selected, there is still a chance that an exceedance will occur. Some may argue that there is an upper limit to the magnitude of natural events, such as flood peaks. They would argue that a peak of 100,000 cfs from a 1-acre watershed would be impossible. In practice the probability that would be assigned to an event of this sort is so small that it can be neglected for most practical purposes. Figure 4.1 shows the design return period that must be used to be a certain percent confident that the design will not be exceeded during the design life of the project. The parameters on the curves are the percent chance of no exceedance during the design life. For example, to be 90% sure that a design condition will not be exceeded during a project whose design life is 100 years, the project would have to be designed on the basis of a 900-year event. Figure 4.1 is derived from calculations like those contained in example 4.7. Figure 4.1 can also be used to evaluate the risk or percent chance of an event in excess of the design event during the design life. For example, if a project is designed on the basis of a 50-year

Fig.

Design return period required as a function of design life to be a given percent confident (curve parameter) that the design condition is not exceeded.

DISCRETE DTSTRIB UTIONS

I

89

I

I I

h

event and the design life of the project is 10 years, the designer is taking a 19% c ance (100 that the design condition will be exceeded.

0.4. What is the probability of 3 successes in the next 5 trials?

1

Solution:

I

-

81)

Comment: What has occurred prior to the trials of interest is of no concern since the Bernoulli process is based on the assumption of independence from trial to trial.

I

Geometric Distribution The probability that the first exceedance (or success) of a Bernoulli tqal occurs on the xthtrial can be found by noting that for the first exceedance to be on the xth there must be X - 1 preceding trials without an exceedance followed by 1 trial with an the desired probability is pqx-' This is known as the geometric distribution 1

The mean and variance of the geometric distribution are

1

1

E(X) = l/p means that on the average a T-year event occurs on the T~~y ar, which agrees with our intuitive concept of a return period. Example 4.9. What is the probability that a 10-year flood will occur for the fir time during the fifth year after the completion of a project? What is the probability it will be at the fifth year before a 10-year flood occurs? Solution: The probability that the first exceedance is in year 5 is

I

The probability that it will be at least the fifth year before the first occurrence is not the same as the probability of the first occurrence in the fifth year. The expression "at leas& implies the first occurrence might be in the fifth year or some later year. The desired probability is equal to the probability of no occurrences in the first 4 years, which is (0.9)~= 0.6561.

Solution: This is the same as the probability of the first occurrence on the tenth year or

Negative Binomial Distribution The probability that the kth exceedance (success) occurs on the xthtrial (X > k) of a Bernoulli process can be found by noting that there must be k - 1 exceedances in the X - 1 trials preceding the kth exceedance on the X" trial. The probability of k - 1 exceedances in :I i) pk-'qx-k. The probability that the X - 1 trials is given by the binomial distribution as ( X" trial results in an exceedance is p, so the desired probability is given by the negative binomial distribution.

The mean and variance of the negative binomial distribution are

As might be expected because the negative binomial is based on the binomial, the additive feature holds. Thus, if X and Y are described by fx(x; k,, p) and f,(y; k,, p) respectively, then Z = X + Y follows the negative binomial f,(z; k, + k,, p). -

Example 4.1 1. What is the probability that the fourth occurrence of a 10-year flood will be on the fortieth year? Solution:

Summarv of Bernoulli Process Ln a Bernoulli process at each instant of time (or location, or trial) an event may either occur with probability p or not occur with probability q = 1 - p. The probability of the event occurring is independent of the time and independent of the past history of occurrences. The number of occurrences in a given time interval (or distance or number of trials) follows the binomial distribution. The probability that the first occurrence is at the xth time is described by the

DISCRETE DISTRIBUTIONS

91

geometric distribution. The probability that the kthoccurrence was at the xthtime is described by the negative binomial distribution. It was also found that the probability distribution of the length of time between occurrences can be found from the geometric distribution by noting that the probability that X trials elapse between occurrences is the same as the probability that the first occurrence is at the X + first time or fx(x + 1; p) = pqx.

POISSON PROCESS Poisson Distribution Consider a Bernoulli process defined over an interval of time (or space) so that p is the probability that an event may occur during the time interval. If the time interval is allowed to become shorter and shorter so that the probability, p, of an event occurring in the interval gets smaller and the number of trials, n, increases in such a fashion that np remains constant, then the expected number of occurrences in any total time interval remains the same. It can be shown that as n gets large and p gets small so that np remains a constant, A, the binomial distribution approaches the Poisson distribution given by

The mean, variance, and coefficient of skew of the Poisson distribution are

As A gets large, the distribution goes from a positively skewed distribution to a nearly symmetrical distribution. The cumulative Poisson distribution is

Example 4.12. What is the probability that a storm with a return period of 20 years will occur once in a 10-year period? Solution: Using the binomial distribution the exact answer is

Approximating with the Poisson

Thus the solutions are not identical but are quite close to each other. Example 4.13. What is the probability of 5 occurrences of a 2-year storm in a 10-year period? Solution: Using the binomial

Approximating with the Poisson

Comment: For this situation n is not large enough and p not small enough for a good approximation. Example 4.14. What is the probability of fewer than 5 occurrences of a 20-year storm in a 100year period? Solution: n is relatively large and p small so the Poisson will be used.

The Poisson distribution possesses the additive property that the sum of two Poisson random variables with parameters A, and A, is a Poisson random variable with parameter A = A, + A,. A Poisson process for a continuous time scale can be defined analogous to a Bernoulli process on a discrete time scale. The Poisson process refers to the occurrence of events along a continuous time (or location) scale. The assumptions underlying the process are:

+

1. The probability of an event in any short interval t to t At is AAt (proportional to the length of the interval) for all values oft. This property is known as stationarity.

93

DISCRETE DISTRIBUTIONS 2. The probability of more than one event in any short interval t to t parison to AAt.

+ At is negligible in com-

3. The number of events in any interval of time is independent of the number of events in any other non-overlapping interval of time. The probability distribution of the number of events X in time t for a Poisson process is given by

fx(x; At) =

(~t)'e-~~ x!

A>O;

t>O;

x=1,2,

...

where fx(x; At) is the probability of X events in time t. Equation 4.20 is a Poisson distribution with parameter At. The mean and variance of fx(x; At) are E(X) = At and Var(X) = At. The parameter A is the average rate of occurrence of the event. Exponential Distribution The probability distribution of the time, T, between occurrences of the event can be found by noting that the prob(T < t) is equal to 1 - prob(T > t). The prob(T > t) is equal to the probability of no occurrences in time t which is fx(O; At) or e-". Thus

which is a cumulative distribution known as the exponential distribution. The probability density function is

and is the probability distribution of the length of the time interval between occurrences of the ~ , event. The mean and variance of the exponential distribution are 1/A and 1 / ~respectively. Gamma Distribution The probability distribution of the time to the nthoccurrence can be found by noting that the + T, from time to the nthoccurrence is the sum of n independent random variables, TI + T2 + the exponential distribution. The method of derived distributions can be used with the result that the probability density function of the time to the n" occurrence is

which is the gamma distribution for integer values of the parameter n. The gamma distribution has E(T) = n/A and Var(T) = n / ~ ~ .

Example 4.15. Barges arrive at a lock an average of 4 each hour. (a) If the arrival of barges at the lock can be considered to follow a Poisson process, what is the probability that 6 barges will arrive in 2 hours? (b) If the lock master has just locked through all of the barges at the lock, what is the probability she can take a 15-minute break without another barge arriving? (c) If the operation of the lock is such that 4 barges can be locked through at once and the lock master insists that this always be the case, what is the probability that the first barge to arrive after 4 previous barges have been locked through will have to wait at least 1 hour before being locked through? Solution: (a) For this problem the rate constant is 4 hours-'. The probability of 6 arrivals in 2 hours can be determined from the Poisson distribution 86e-8 fx(x; At) = fx(6; 8) = -= 0.1221 6! (b) The probability of no arrivals in 15 minutes is also from the Poisson

Note that this is not the same as the probability that it will be 15 minutes until the next amval. The time scale is continuous so the probability that it will be exactly 15 minutes until the next arrival is zero. We can only talk of probabilities associated with time intervals, not specific times. (c) The barge must wait for the arrival of 3 additional barges. The probability that the time T for 3 barges to arrive is greater than 1 hour prob(T3 > 1) is 1 - prob(T3 5 I). The probability that T 5 1 for 3 arrivals comes from the gamma distribution

The desired probability is 1 - 0.762 = 0.238. Summary of Poisson Process The Poisson process is a discrete process on a continuous time scale. Therefore, the probability distribution of the number of events in a time T is a discrete distribution, whereas the probability distributions for the time between events and the time to the n" event are continuous distributions.

DISCRETE DISTRIBUTIONS

95

For a Poisson process, the probability that an event will occur in a short time interval t to t + At is hAt for all t. The probability that more than one event occurs in At is negligible. The probability distribution of the number of events in a given time T is the Poisson distribution. The exponential distribution describes the time between events and the gamma distribution the time to the n" event. Example 4.16. It has been proposed that an event-based rainfall simulation model can be constructed by modeling the occurrence of rainstorms by a Poisson process and the amount of rain in each storm by some continuous probability distribution. In this way, the time between rainstorms would follow an exponential distribution, the time for X rainstorms would follow a gamma distribution, and the number of rainstorms in a time interval would follow a Poisson distribution. Duckstein et al. (1975) and Fogel et al. (1974) used a modification of this approach. Part of Fogel et al.'s results are shown as figure 4.2.

0

5

10

15

20

25

30

Number of events per year

Fig. 4.2. Distribution of occurrences of warm season rainfall in which the areal mean of five gages in New Orleans, Louisiana, exceeded 0.50 inches and at least one gage recorded more than 1.O inch. (Fogel et al. 1974).

MULTINOMIAL DISTRIBUTION The binomial distribution can be generalized to include the probabilities of outcomes of several types rather than the two possible outcomes of the binomial. If the probabilities associated with each of k distinct outcomes are p,, p2, ...,p,, then in independent trials the probability of XI outcomes of type 1, X2 outcomes of type 2, ...,Xkoutcomes of type k is given by the multinomial distribution as

where X, x and p are 1 X k vectors. Some restrictions on this distribution are k

zi=lpi=l

and

'Cf='=lxi=n

The mean and variance of the multinomial distribution are E(X,) = npi Var(Xi) = npi(l

(4.25) -

(4.26)

pi)

Example 4.17. On a certain stream the probability that the maximum peak flow during a l-year period will be less than 5,000 cfs is 0.2 and the probability that it will be between 5,000 cfs and 10,000 cfs is 0.4. In a 20-year period, what is the probability of 4 peak flows less than 5,000 cfs and 8 peak flows between 5,000 and 10,000 cfs? Solution: To apply the multinomial distribution we define the third event as a peak flow in excess of 10,000 cfs. This event has probability 1 - 0.2 - 0.4 = 0.4. The event of a peak flow greater than 10,000 cfs must occur 20 - 4 - 8 = 8 times. The desired probability is

Comment: The expected result from 20 years of flood peak data would be E(X,) = npl = 20(0.2) = 4 E(X2) = np, = 8 E(X,) = np, = 8 This problem demonstrates that even though the expected results are 4, 8, and 8, the probability of this happening is very low. Exercises 4.1. Compute the terms of the binomial distribution with n = 10 and p = 0.2. Plot in the form of a histogram. 4.2. Compute the terms of the cumulative binomial with n = 10 and p = 0.2. Plot the terms. 4.3. If a project is designed on a 10-year retum period, what is the probability of at least 1 exceedance during the 10-year life of the project? 4.4. What design retum period should be used to ensure a 95% chance that the design will not be exceeded in a 25-year period?

DISCRETE DISTRIBUTIONS

97

4.5. Construct a curve relating the design return period to the life of a project when a 90 percent chance of no exceedance is used. 4.6. What design return period should be used to ensure a 50% chance of no exceedance in a 10-year period? 4.7. What design return period should be used to ensure a 75% chance of no more than 1 exceedance in 10 years? 4.8. Construct an example where the Poisson is not a good approximation for the binomial. 4.9. In a certain locality contractors A, B, and C get about 50%, 25% and 25% respectively of all water resources projects. Five contracts are coming up for bid. What is the probability that contractor A will get all 5 jobs? What is the probability that A will get 2 jobs and B will get 2 jobs? 4.10. In 100 years the following number of floods were recorded at a specific location. Draw a relative frequency histogram of the data. Fit a Poisson distribution to the data and plot the relative frequencies according to the Poisson distribution on the histogram. Is the Poisson a good approximation for the data? No. of floods

No. of occurrences

4.1 1. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of 5 successive years without a flood? 4.12. Based on a Poisson approximation to the data of exercise 4.10, what is the probability of exactly five years between floods? 4.13. Compute the probability of at least 1 n-year event in a k-year period using (a) n k = 20; (b) n = 500, k = 50.

=

100,

4.14. Using the Poisson approximation to the binomial distribution show that the probability of at least one occurrence of a T-year event in T years is 0.632.

98

CHAPTER 4

4.15. The Bernoulli distribution is given by

What is E(X) and Var(X) for this distribution? 4.16. Use the Poisson distribution to approximate the binomial distribution of exercise 4.1. Plot the terms of this Poisson distribution on the histogram of exercise 4.1. 4.17. Two widely separated watersheds are selected for a study on peak discharges. If the occurrence of flood flows on the two basins can be considered as independent events, what is the probability of experiencing a total of 5,20-year events on the two watersheds in a 10-year period? 4.18. A well-known scientist has predicted that during a certain 3-year period a severe drought will occur on the plains east of the Rocky Mountains. He made this prediction based on his observance of sunspot activity. If the probability of a drought is 0.10 in any year, what is the probability that the scientist's prediction will come true if the occurrence of a drought is a strictly random phenomena unrelated to sunspot activity? 4.19. In a certain region there are 20 possible small watersheds suitable for a research project. Unknown to the project manager, 6 of these basins have subsurface geological features that permit large quantities of surface water to enter underground formations and leave the basin via subsurface flow. The project manager wants to select 6 watersheds from the 20 for study. (a) What is the probability that 1 of the basins having the above described geologic features will be selected? (b) What is the probability that 3 of these basins will be selected? (c) What is the probability that at least one of the basins will be selected? (d) What is the probability that all of these basins will be selected? 4.20. In the situation described in exercise 4.19 the project manager wants to pick 3 pairs of watersheds for the evaluation of an evapotranspiration suppressant. One basin in each pair will be used for a control and one will be treated with the suppressant. What is the probability that all of the control watersheds will have the geologic problem while all of the rest will not? 4.21. It is desired to model the number of rainy days in July and August as a Bernoulli process. Based on the data below and the assumption that the Bernoulli model is applicable: (a) What is the probability of 10 or more rainy days in each of the months of July and August? (b) What is the probability of 20 rainy days in the 2-month period? (c) What assumptions concerning the Bernoulli process are likely violated by this problem? For this problem write answers in terms of summations. Do not evaluate the summations. Year July August

1 10 4

2

15 9

3

17 8

4 8 3

7

8

9

10

No. of rainy days 9 19 17 12 0 10

14 2

20 8

4 6

5

6

DISCRETE DISTRIBUTIONS

99

4.22. For the binomial distribution show that f,(x; n, p) = f,(x - 1; n - 1, p) f,(l; 1, p) fx(x; n - 1, p) fx(O; 1, p). Write out a narrative description of the meaning of this equation.

+

4.23. Work exercise 4.21 using the Poisson distribution to approximate the binomial. 4.24. Pool the data of exercise 4.21 so that a single estimate is obtained for p of the binomial distribution. Compute the probability of 20 rainy days in the 2-month period of July-August. Compare this probability to the one computed in part b of exercise 4.21. Which answer would you prefer? 4.25. Using the data of exercise 4.21, what is the probability that the sixth wet day of August occurs on August 29,30, or 3 1 ? 4.26. Show that for the Poisson process the time for n occurrences follows the gamma distribution. (Hint: Use the method of derived distributions to find the distribution of the time to 2 occurrences. Using the distribution of the time to 2 occurrences the method of derived distributions can be used to get the time to 3 occurrences. This process can then be repeated until a pattern emerges. Induction could also be used by showing that if the time for n - 1occurrences is given by equation 4.20 by substituting n - 1for n then the time for n occurrences is given by equation 4.20. Also, the time for 1 occurrence is given by equation 4.19, which is the same as equation 4.20 with n = 1.)

5. Normal Distribution THE MOST widely used and most important continuous probability distribution is the Gaussian, or normal distribution. The normal distribution has been widely used because of its early connection with the "Theory of Errors" and because it has certain useful mathematical properties. Many statistical techniques such as analysis of variance and the testing of certain hypotheses rely on the assumption of normality. The errors involved in incorrectly assuming normality (purposely or unknowingly) depend on the use under consideration. Many statistical methods derived under the assumption of normality remain approximately valid when moderate departures from normality are present and as such are said to be robust. The very name "normal" distribution is misleading in that it implies that random variables that are not normally distributed are abnormal in some sense. The Central Limit Theorem indicates the conditions under which a random variable can be expected to be normally distributed. In a strict theoretical sense, most hydrologic variables cannot be normally distributed because the range on any random variable that is normally distributed is the entire real line (-03 to +a). Thus non-negative variables such as rainfall, streamflow, reservoir storage, and so on, cannot be strictly normally distributed. However, if the mean of a random variable is 3 or 4 times greater than its standard deviation, the probability of a normal random variable being less than zero is very small and can in many cases be neglected. GENERAL N O W DISTRIBUTION The normal distribution is a 2-parameter distribution whose density function is

NORMAL DISTRIBUTION

101

P

X

Fig. 5.1. Normal distributions with same mean and different variances.

PI P2 $3 Fig. 5.2. Normal distributions with same variance and different means.

In examples 3.3 and 3.5 it was shown that if either the method of moments or the method of maximum likelihood is used to estimate the two parameters of this distribution, the result is 8, = p and 822 = u2 where p and u2 are the mean and variance of X, respectively. For this reason the normal distribution is generally written as

Thus, the normal distribution is a 2-parameter distribution which is bell-shaped, continuous, and symmetrical about y (the coefficient of skew is zero). If y is held constant and u2varied, the distribution changes as in figure 5.1. If u2is held constant and (I varied, the distribution does not change scale but does change location as in figure 5.2. The parameters y and u2 are sometimes denoted as location and scale parameters. A common notation for indicating that a random variable is normally distributed with mean p and variance u2is N(y, u2).

REPRODUCTIVE PROPERTIES If a random variable X is N(p, u2) and Y = a + bX, the distribution of Y can be shown to be N(a + by, b2u2). Furthermore, if Xi for i = 1, 2, ..., n, are independently and normally distributed with mean pi and variance ui2,then Y = a + blX, + b2X2+ - - - + b,X, is normally

distributed with py = a

+ Cy,lbipi

(5.2)

and 2 UY

-

Cb2" i = 1

iui2

(5.3)

Any linear function of independent normal random variables is also a normal random variable. Example 5.1. If xiis a random observation from the distribution N(p, u2), what is the distriXi bution of Z = C:= -? n Solution: X is a linear function of xi given by 51 = (xl + x2 + . - -+ xn)/n. From equations 5.2 and 5.3 and the reproductive properties of the normal distribution, % is normally distributed with mean

and variance

Therefore, X is N(p, u2/n).

STANDARD NORMAL DISTRIBUTION The cumulative distribution function for the normal distribution is

Unfortunately, equation 5.4 cannot be evaluated analytically. Approximate methods of integration are required. If a tabulation of the integral was made, a separate table would be required for each value of p and u2. By using the linear transformation

the random variable Z will be N(0, I). This is a special case of a + bX with a = - p/u and b = l / u . The random variable Z is said to be standardized (has P = 0 and u2 = 1) and N(0,l) is said to be the standard normal distribution. The standard normal distribution is given by

NORMAL DISTRIB UTTON

-2

103

-1

0

+I

+2

Fig. 5.3. Standard normal distribution ( p = 0, u2 = 1).

and the cumulative standard normal is given by

Figure 5.3 shows the standard normal distribution which along with the transformation Z = (X - p)/u contains all of the information shown in figures 5.1 and 5.2. Both pZ(z)and Pz(z) are widely tabulated. Most tables utilize the symmetry of the normal distribution so that only positive values of Z are shown. Tables of Pz(z) may show prob(Z < z), prob(0 < Z < z), or prob(-z < Z < z). Care must be exercised when using normal probability tables to see what values are tabulated. The table of Pz(z) in the appendix gives prob (Z < z). There are many routines programmed into computer software to evaluate the normal pdf and cdf. Some approximations for the standard normal distribution are given below. A table of Pz(z) shows that 68.26% of the normal distribution is within 1 standard deviation of the mean, 95.44% within 2 standard deviations of the mean, and 99.74% within 3 standard deviations of the mean. These are called the 1,2, and 3 sigma bounds of the normal distribution. The fact that only 0.26% of the area of the normal distribution lies outside the 3 sigma bound demonstrates that the probability of a value less than p - 3 0 is only 0.0013 and is the justification for using the normal distribution in some instances even though the random variable under consideration may be bounded by X = 0. If p is greater than 30, the chance that X is less than zero is many times negligible (this is not always true, however). Example 5.2. Compare the 1, 2, and 3 sigma bounds under the assumption of normality and under no distributional assumptions using Chebyshev's inequality. Solution: The 1, 2, and 3 sigma bounds of N(p, u2) contain 68.26, 95.44, and 99.72% of the distribution. Thus, the probability that X deviates more than a , 2u, and 3u from p is 0.3174, 0.0456, and 0.0028 respectively.

104

CHAPTER 5

Chebyshev's inequality states that the prob(1X - pI > ka) < l/k2. This corresponds to a probability that X deviates more than a , 20, and 3 a from p of less than 1.00, less than 0.25, and less than 0.1 1, respectively. Comment: By making no distributional assumptions, we are forced to make very conservative probability statements. It is emphasized that Chebyshev's inequality gives an upper bound to the probability and not the probability itself. ~

-~

Example 5.3. As an example of using tables of the normal distribution consider a sample drawn from an N(15,25). What is the prob(15.6 < X < 20.4)? Solution: The desired probability could be evaluated from

However, this integral is difficult to evaluate. Making use of the standard normal distribution, we can transform the limits on X to limits on Z and then use standard normal tables. x = 15.6 transforms t o z = (15.6 - 15.0)/5 = 0.12 x = 20.4 transforms to z = (20.4 - 15.0)/5 = 1.08 The desired probability is

From the standard normal table Pz(l-08) = 0.860 and P,(0.12) = 0.548. The desired probability is 0.860 - 0.548, or 0.312.

APPROXIMATIONS FOR STANDARD NORMAL DISTRIBUTION Maidment (1993) presents several approximations for the normal distribution. Let Pz(z) = p for 0.005 5 Pz(z) 5 0.995 where Z is the standard normal variate. Then z can be approximated from

Let y =

-In (2p). For 0.005 < Pz(z) < 0.5, an approximation for z is given by

NORMAL DISTRIBUTIOW

105

An approximation for Pz(z) for positive values of z is given by Pz(z)

=

1

1 - 0.5 exp -

(83z ;3351)z -

+ 562

+ 165

1

Of course, for negative values of z, P,(z) for the absolute value of z can be obtained and then Pz(z> = 1 - Pz(lzl).

Example 5.4. Use a normal approximation to determine prob(l0.5 < X < 20.4) if X is distributed N(15,25). Solution: Using the approximation for PZ(z) [(83)(1.08) + 3511 1.08

prob(z < 1.08) = 1 - 0.5 exp

'03 + 165 1.08 prob(0 < z < 1.08) = 0.85987 Similarly, prob(z < 0.9) prob(0 < z < 0.9)

= =

-

0.50000

=

+ 562

1

=

0.860

0.360

0.816 so that 0.316 and

prob (-0.9 < z < 1.08) = 0.360

+ 0.316 = 0.676

Comment: Often, in solving problems of this type, it is useful to sketch a normal distribution and then shade in the area corresponding to the desired probability. For this problem the sketch would be as in figure 5.4.

X

Fig. 5.4. Prob(-0.9 < z < 1.08).

106 -

CHAPTER 5

--

Example 5.5. kepeat example 3.7 assuming the Kentucky River data is -

Solution: Since X is assumed normal, X is N(p, 22,3222/n). Therefore, Z From the problem statement

=

-

'

22,322/6

is N(O, 1).

[X- pI < 10,000. So n must be determined so that

From the standard normal table it is seen that 95% of the normal distribution is enclosed by - 1.96 < Z < 1.96. From this n is calculated as

or at least 19 observations are required to be 95% sure that X is within 10,000 cfs of p if X is N(p, 22,322')Comment: By assuming normality, the required minimum number of observations has been reduced from 100 to 19. The Law of Large Numbers has placed a lower limit on n without knowledge of the distribution of X. The price for this ignorance of the distribution of X is seen to be very great if in fact X is normally distributed.

CENTRAL LIMIT THEOREM The conditions under which a random variable might be expected to follow a normal distribution are specified by the Central Limit Theorem.

If S, is the sum of n independently and identically distributed random variables Xi each having a mean, p, and variance, a', then in the limit as n approaches infinity, the distribution of S, approaches a normal distribution with mean n p and variance nu2.

In practice, if the Xi are identically and independently distributed, n does not have to be very large for S, to be approximated by a normal distribution. If interest lies in the central part of the distribution of S, ,values of n as small as 5 or 6 will result in the normal distribution producing reasonable approximations to the true distribution of S,. If interest lies in the tails of the distribution of S,, as it often does in hydrology, larger values of n may be required. As stated above, the Central Limit Theorem is of limited value in hydrology since most hydrologic variables are not the sum of a large number of independently and identically distributed random variables. Fortunately, under some very general conditions it can be shown that if Xi for i = 1,2, ..., n is a random variable independent of Xj for j # i and E(Xi) = pi and Var(Xi) = ai2, then the sum S, = X, + X2 + - - .+ X, approaches a normal distribution with E(S,) = 2:=pi and Var(S, ) = Z:= a"s n approaches infinity (Thomas 1971). One condition for this generalized Central Limit Theorem is that each Xi has a negligible effect on the distribution of S, (i.e., there cannot be one or two dominating Xi's).

NORlMAL DISTRIBUTION

107

This general theorem is very useful in that it says that if a hydrologic random variable is the sum of n independent effects and n is relatively large, the distribution of the variable will be approximately normal. Again, how large n must be depends on the area of interest (central part or tail of the distribution) and on how good an approximation is needed. Example 5.6. In the last chapter the gamma distribution for integer values of n was derived as the sum of n exponentially distributed random variables. The mean and variance of the exponential distribution are given as 1/X and 1/X2, respectively. The Central Limit Theorem gives the mean and variance of the sum of n values from the exponential distribution as n/X and n/X2 for large n. This agrees with the mean and variance of the gamma distribution. In chapter 6, the coefficient of skew of the gamma distribution is given as 2 / f i , which approaches zero as n gets large. Thus, the sum of n random variables from an exponential distribution is a gamma distribution which approaches a normal distribution (with y approaching 0) as n gets large.

CONSTRUCTING PDF CURVES FOR DATA Frequently, the histogram of a set of observed data suggests that the data may be approximated by a particular probability density function. One way to investigate the goodness of this approximation is by superimposing a pdf on the frequency histogram and then visually comparing the two distributions. Statistical procedures for testing the hypothesis that a set of data can be approximated by a particular distribution are given in chapter 8. Consider the data of table 2.1 and the frequency histogram of figure 2.6. The probability (or relative frequency) of a peak flow in any one of the class intervals assuming a normal distribution can be obtained by integrating the normal distribution over the limits of the class interval. For example, the expected (according to the normal distribution) relative frequency in the first interval can be calculated from

because the mean of the data is 66,540 cfs and the standard deviation is 22,322 cfs. This integral is easily evaluated using standard normal tables as 0.0322. An approximation to the relative frequency in a class interval can also be made by using equation 2.25b.

Using the standard normal distribution through the transformation

108

CHAPTER 5

Table 5.1. Expected relative frequencies according to the normal distribution for the Kentucky River data Class Mark Xi

Expected Relative Frequencies

Zi

Pz(zi)

f xi

Observed Relative Frequencies

0.03 16 0.0659 0.1122 0.1564 0.1783 0.1663 0.1270 0.0793 0.0405 0.0169

Sum 0.9744

for the first class interval Axi = 10,000, zi = (25,000 - 66,540)/22,322 = 0.0706 (from equation 5.5) and cr is estimated by s = 22,322.

- 1.8609, pZ(zi) =

Similar calculations for each of the class intervals are shown in table 5.1, with the results plotted in figure 5.5. The sum of the expected relative frequencies is not 1 because the entire range of the normal distribution was not covered.

0

20

40

60

80 100 Peak flow (1 000 cfs)

120

140

Fig. 5.5. Comparison of normal distribution with the observed distribution, Kentucky River peak flows.

NORMAL DISTRIBUTION

109

The procedure of integrating p,(x) over each class interval or of using equation 2.25b can be used for any continuous probability distribution to get the expected relative frequencies for that distribution.

NORMAL APPROXIMATIONS FOR OTHER DISTRIBUTIONS The normal distribution can be shown to be a good approximation to several other distributions both discrete and continuous. Before using the normal to approximate some other distribution, care must be taken to see that the conditions for the approximation to be valid are met. Generally, the approximations are quite good in the central part of the distribution with the accuracy dropping off in the tails of the distribution. Throughout our study of distributions, the sensitivity of the tails of distributions to distributional assumptions will be of concern. This is of particular importance in hydrology, when the magnitude of a rare event is to be estimated, because this estimate must come from the tail of the distribution being used. Whenever a continuous distribution is used to approximate a discrete distribution, halfinterval corrections must be applied to the continuous distribution. For example, the probability that X is equal to some positive integer X can be evaluated for a discrete distribution. This same probability is zero if a continuous distribution is used. When a continuous distribution is used to approximate the prob(X = x), the prob(x - % < X < x + %) must be evaluated. This illustrates the general rule that a % interval correction must be added to the upper limit and subtracted fi-om the lower limit. The prob(X = x, x + 1, x + 2, ..., y) in a discrete case is approximated by prob(x - % < X < y + %) in the continuous case. The prob(X 5 x) in a discrete case is approximated by prob(X < x + %) in the continuous case. More examples of these corrections are shown in table 5.2. The Central Limit Theorem provides the mechanism by which the normal distribution becomes an approximation for several other distributions. Binomial Distribution It was stated in chapter 4 that if X is a binomial random variable with parameters n, and p and Y is a binomial random variable with parameters n2 and p, then Z = X + Y is a binomial random variable with parameters n = n, + n2 and p. Extending this to the sum of several binomial random variables, the Central Limit Theorem would indicate that the normal

Table 5.2. Corrections for approximating a discrete random variable by a continuous random variable Discrete

Continuous

110

CHAPTER 5

distribution approximates the binomial distribution if n is large. Thus, as n gets large the distribution of

approaches a N(0, 1). This is sometimes known as the DeMoivre-Laplace limit theorem (Mood et al. 1974). Example 5.7. X is a binomial random variable with n = 25 and p = 0.3. Compare the binomial and normal approximation to the binomial for evaluating the prob(5 < X 5 8). Solution: Using the binomial distribution this is equivalent to

Using the normal approximation, the probability is determined as prob(5.5 < X < 8.5), which is 0.476. Therefore, the exact probability of 0.483 is approximated by the normal to be 0.476 for an n of 25. Negative Binomial Distribution Following reasoning similar to that given for the binomial distribution, the negative binomial distribution with large k can be approximated by a normal distribution. In the case of the negative binomial, the distribution of

approaches N(0, 1) as k gets large. Example 5.8. Work example 4.11 using the normal approximation for the negative binomial. Solution: The desired probability is prob(39.5 < X < 40.5). Using the standard normal distribution, the limits on Z are

This compares favorably with the 0.0206 computed using the negative binomial.

NORMAL DISTRIBUTION

111

Poisson Distribution The sum of two Poisson random variables with parameters A, and A, is also a Poisson random variable with parameter h = h, + A,. Extending this to the sum of a large number of Poisson random variables, the Central Limit Theorem indicates that for large h, the Poisson may be approximated by a normal distribution. In this case the distribution of

approaches an N(0, 1). Since the Poisson is the limiting form of the binomial and the binomial can be approximated by the normal, it is no surprise that the Poisson can also be approximated by the normal. Continuous Distributions Many continuous distributions can be approximated by the normal distribution for certain values of their parameters. For instance, in example 5.6, it was shown that for large n the gamma distribution approaches the normal distribution. To make these approximations one merely equates the mean and variance of the distribution to be approximated to the mean and variance of the normal and then uses the fact that

is N(0, 1) if X is N(p, u2).Not all continuous distributions can be approximated by the normal and for those that can the approximation is only valid for certain parameter values. Things to look for are parameters that produce near zero skew, symmetry, and tails that asymptotically approach p,(x) = 0 as X approaches large and small values. Again, it is emphasized that approximations in the tails of the distributions may not be as good as in the central region of the distribution. Exercises 5.1. Consider sampling from a normal distribution with a mean of 0 and a variance of 1. What is the probability of selecting (a) an observation between 0.5 and 1.5? (b) an observation outside the interval -0.5 to +0.5? (c) 3 observations inside and 2 observations outside the interval of 0.5 and 1.5? (d) 4 observations inside the interval 0.5 to 1.5 exactly two of which are not in the interval -0.5 to l.O? 5.2. What is the probability of selecting an observation at random from an N(100,2500) that is (a) less than 75? (b) equal to 75? 5.3. For the Kentucky River data of table 2.1, what is the probability of a peak flow exceeding 100,000 cfs if the peaks are assumed to be normally distributed?

5.4. Construct the theoretical distribution for the data of exercise 2.2 if it is assumed that the data are normally distributed. From a visual comparison with the data histogram, would you say the data are normally distributed? 5.5. Work exercise 4.1 using the normal approximation to the binomial and plot the results on the histogram developed for exercise 4.1. 5.6. Show that if X is N(p,

0')

then Y = a

+ bX is N(a + bp, b'(r2).

5.7. For a particular set of data the coefficient of variation is 0.4. If the data are normally distributed, what percent of the data will be less than 0.0? 5.8. A sample of 150 observations has a mean of 10,000, a standard deviation of 2,500 and is normally distributed. Plot a frequency histogram showing the number of observations expected in each interval. 5.9. The appendix contains a listing of the annual runoff from Cave Creek watershed near Fort Spring, Kentucky. What is the probability that the true mean annual runoff is less than 14.0 in. if one can assume the true variance is 22.56 in.'? What other assumptions are needed? 5.10. Random digits are the numbers 0, 1, 2, ..., 9 selected in such a fashion that each is equally likely (i.e., has probability 1/10 of being selected). An experiment is performed by selecting 5 random digits, adding them together and calling their sum X. The experiment is repeated 10 times and X is calculated. What is the probability that X is less than 21.5? (Exercise 13.9 requires that this experiment be carried out.) 5.1 1. Plot the individual terms of the Poisson distribution for A = 2. Approximate the Poisson by the normal and plot the normal approximations on the same graph. 5.12. Repeat exercise 5.11 for A = 9. 5.13. Assume the data of exercise 4.21 is normally distributed. (a) Within each month what is the probability of 10 or more rainy days? (b) What is the probability of 20 or more rainy days in the July-August period? (c) What is the difference in assuming the data are normally distributed, and in assuming the data are binomially distributed and approximating the binomial with the normal? 5.14. Plot the observed frequency histogram and the frequency histogram expected from the normal distribution for the annual peak flows for the following rivers. Discuss how well the normal approximates the data in terms of the coefficient of variation and skewness. (Note: data are in the appendix or may be obtained from the Internet). a) North Llano River near Junction, Texas

NORMAL DISTRIBUTION

113

b) Cumberland River at Cumberland Falls, Kentucky C)Piscataquis River near Dover-Foxcroft, Maine 5.15. The occurrence of rainstorms is sometimes considered to be a Poisson process so that the time between rainstorms is exponentially distributed. If for a certain locality the mean of this exponential distribution is 10 days, what is the probability that the elapsed time for 15 storms to occur will exceed 120 days? 5.16. Lane and Osborn (1973) present the following data for the mean number of days with more than 0.10 inches of precipitation at Tombstone, Arizona. If the occurrence of more than 0.10 inches of rain in any month can be considered as an independent Poisson process, what is the probability of fewer than 30 days with more than 0.10 inches of rain in one year at Tombstone? Month

No. of days

Month

Jan. Feb. Mar. Apr. May June

2 2 2 1

July Aug. Sept. Oct. Nov. Dec.

0 2

No. of days

7 7 3 2 2 2 Total 32

5.17. An experimenter is measuring the water level in an experimental towing channel. Because of waves and surges, a single measurement of the water level is known to be inaccurate. Past experience indicates the variance of these measurements is 0.0025 ft2. How many independent observations are required to be 90% confident that the mean of all the measurements will be within .02 feet of the true water level? 5.18. At a certain location the annual precipitation is approximately normally distributed with a mean of 45 in. and a standard deviation of 15 in. Annual runoff can be approximated by R = -7.5 + 0.5P where R is annual runoff and P is annual precipitation. What is the mean and variance of annual runoff? What is the probability that the annual runoff will exceed 20 in.? 5.19. Plot a frequency distribution for a mixture of two normal distributions. Use as the first distribution an N(0, 1) and as the second an N(l, 1). Use as values for the mixing parameter 0.2, 0.5, and 0.8.

6. Continuous Probability Distributions THERE ARE many continuous probability distributions in addition to the normal distribution. This chapter covers some of these distributions, methods for estimating their parameters, properties of the distributions, and potential applications for them. Further discussion on distribution selection is contained in chapter 7. Other books may be consulted for more detailed treatment of the various distributions (Kececioglu, 1991). Rao and Harned (2000) is particularly applicable to hydrology.

UNIFORM DISTRIBUTION If a continuous random process is defined over an interval a to P and the probability of an outcome of this process being in a subinterval of a to P is proportional to the length of the subinterval, the process is said to be uniformly distributed over the interval a to p (figure 6.1). The probability density function for the continuous uniform distribution is 1 PX(X)

=

fora < X < p

and the cumulative distribution function is X - a

P,(x) = ---

P-a

fora < X <

p

CONTINUOUS DISTRIBUTIONS

115

Fig. 6.1. Uniform distribution.

The mean and variance of the uniform distribution are

The skewness is zero since the distribution is symmetrical about the mean. The methods of moments yields the following estimators for the parameters a and P:

The method of maximum likelihood when applied to the uniform distribution results in the estimators for a and p being the smallest and largest sample values respectively. That this is the case can be seen by writing out the likelihood function and then selecting those values of a and p (within the constraints that a < X < p for all X) that maximize the function. The uniform distribution finds its greatest application as the distribution of Px(x) for all probability density functions. That is the prob(Px(x) < y) is uniformly distributed over the interval 0 < y < 1 for any continuous probability distribution. This fact is used in generating random observations from some probability distributions.

Example 6.1. Use the method of moments to estimate the parameters of the uniform distribution based on the following sample: 1, 4, 3, 4, 5, 6, 7, 6, 9, 5. What are the maximum likelihood estimators for this sample?

116

CHAPTER 6

Solution: By method of moments -

x = 5.00 and s = 2.26

By maximum likelihood

&

= 1.00 (smallest sample value)

fi = 9.00 (largest sample value) Comment: This problem illustrates that the method of moments and the method of maximum likelihood do not always produce the same parameter estimates. In this case, the parameters estimated by moments are not reasonable since values of X outside the limits of & and are present in the sample. This is a common problem when the method of moments is used to estimate the pararneters of the uniform distribution for small samples. Of course, for large samples neither the moment nor the maximum likelihood estimates will be "good if the sample is not truly a random sample from a uniform distribution.

fi

TRIANGULAR DISTRIBUTION The triangular distribution shown in Figure 6.2 is given by

It is unlikely that any natural hydrologic process would exactly follow a triangular distribution. The distribution may be a reasonable approximation to the actual but unknown distribution of some hydrologic quantities. The triangular distribution has been used in simulation studies involving bounded random variables whose central tendencies are known. The mean, variance, and coefficient of skew of the triangular distribution are

CONTINUOUS DISTRIE3UTIONS

117

X Fig. 6.2. Triangular distribution (here y is 6 of equation 6.6). The parameter 6 gives the mode of the triangular distribution. If 6 is known, the parameters ci and p may be estimated based on the method of moments as

where A =3

B

=

36

C

=

9K2 - 96%+ 3S2 - 18s;

-

9K

Some special cases of the triangular distribution yield the following estimators: Mode

A

a

I3

A treatment of a generalized triangular distribution is contained in chapter 16 beginning with equation 16.59. EXPONENTIAL DISTRIBUTION In chapter 4 it was shown that the exponential distribution arises as the probability distribution of the time between occurrences of events of a Poisson process. Among other things, the exponential distribution has been used as the distribution of the time between rainfall events in

118

CHAPTER 6

stochastic precipitation models. The exponential density function is given by px(x) = ~ e - "

X > 0, A > 0

and the cumulative exponential by

The mean and variance of the exponential distribution are

The coefficient of skew is a constant, 2, indicating the exponential is skewed to the right for all values of A. The curve labeled = 1 in figure 6.4 is an exponential distribution with A = 1. Examples 3.2 and 3.4 demonstrated that when either the method of moments or maximum likelihood is used for parameter estimation, the result is

or the parameter A may be estimated by the reciprocal of the sample mean. Example 6.2. Haan and Johnson (1967) studied the physical characteristics of depressions in north-central Iowa. The data tabulated below shows the number of depressions falling into various classes based on the surface area of the depression. Plot a relative frequency histogram of the data. Superimpose on the histogram the best fitting exponential distribution. Estimate the probability that a depression selected at random will have an area greater than 2.25 acres. Area (acres)

No. of depressions 106 36 18

9 12 2 5 1 4 5 2 6 3 1 1 1 Total

212

C0NTI;tiOUS DISTRIBUTIONS

119

Solution: The relative frequencies are computed by dividing the number of depressions in each class by the total number of depressions. The best fitting exponential is estimated by using equation 6.15 to estimate the exponential parameter A. X is calculated from equation 3.16 as 1.27 acres. Then ); = 1fi = 0.787. The expected relative frequency in each class is then calculated from equation 2.25b as

where xi is the midpoint of the class interval, Ax, = %, and pA(xi)is the exponential distribution of area given by A

-

pA(xi)= hepAxi Therefore fxi = (1/2)0.7S7e-0.787xi For example, for the second class interval f0.75

=

0.393 e-0.787(.75) = 0 22

compared to an observed value of 36/212, or 0.17. The estimated probability that a depression will have an area in excess of 2.25 acres is

The observed fraction of depressions with areas in excess of 2.25 acres is 31/212, or 0.146.

0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75

Area (acres)

Fig. 6.3. Observed and expected (according to the exponential distribution) number of depressions in various size categories for example 6.2.

GAMMA DISTRBUTION The distribution of the sum of n exponentially distributed random variables each with parameter A is a gamma distribution with parameters T = n and A. In general, -q does not have to be an integer. A comprehensive treatment of the gamma distribution and other distributions in I& gamma family of distributions is given by Bobee and Ashkar (1991). The gamma density function is given by

T(T) is the gamma function having the properties (

)=(

-

1

For

-q = 1, 2, 3, ...

+ 1 = ( ) For -q > 0 qT)= Jr tq- 'e-' dt For T > 0 (

The mean, variance and coefficient of skew for the gamma distribution are

The gamma distribution is positively skewed with y decreasing as -q increases. Plots of the distribution for various values of -q and A are shown in figure 6.4. A wide variety of shapes ranging from reverse J-shaped for -q < 1 to single peaked with the peak (mode) at x = (-q - 1)/A for T > 1 can be produced by the gamma density function. Changing A and holding -q constant changes the scale of the distribution, whereas changing -q and holding A constant changes the shape of the distribution. Thus, A and -q are sometimes known as scale and shape parameters. The cumulative gamma distribution is

If T is an integer, the cumulative gamma distribution is given by (Mood et al. 1974)

Some computer spreadsheets will evaluate Px(x) for the gamma distribution.

CONTINUOUS DISTRIBUTIONS

121 Gamma pdf

Fig. 6.4. Gamma distribution with several values for -q and A. The exponential distribution is a special case of the gamma distribution with -q = I . If X and Y are independent gamma random variables with parameters -ql, A and q2, A respectively, then Z = X + Y is a gamma variable with parameters T = T, + -q2 and A = A. This can be extended to the sum of any number of independent gamma random variables having a common parameter A. It is an expected result because in chapter 4 the gamma distribution was shown to arise as the distribution of the sum of n independent exponential random variables. The moment estimators for the parameters of the gamma distribution result from equations 6.18 and 6.19 as

The maximum likelihood estimators for A and q are given by

where x, is the sample geometric mean and +(x) = d In r(x)/dx is the psi-function. Thom (1958) has proposed an approximate relationship based on the truncation of a series expansion of the maximum likelihood estimator for -q given by

Table 6.1. Correction factor for the maximum likelihood estimator for the parameter -q of the gamma distribution

where y is In Z - E,A: is a correction term arising because of the truncation and Inx is the mean natural logarithm of the observations. Table 6.1 contains the values A+ for 4 for ranging from 0:2 to 5.6. For 4 > 5.6 the correction is negligible (as it is anyway for many practical situations regardless of the value of 4). The procedure for finding the correcf on factor is to assume that 4 is equal to the first term of equation 6.24 and use the A: from table 6.1 corresponding to

4.

this initial estimate for The parameter h is then estimated by

Thom (1958) states that for -q < 10 the method of moments produces unacceptable estimates for both h and T. For near 1 the method of moments uses only 50% of the sample information for estimating h and only 40% for q. This means the maximum likelihood estimators would do as well with one half the number of observations. Greenwood and Durand (1960) present the following rational fraction approximations for the maximum likelihood estimators

for

and

for

CONTTNUOUS DISTRIBUTIONS

123

where

A is then estimated from equation 6.25. Greenwood and Durand (1960) state that the maximum error in equation 6.26 is 0.0088% and in equation 6.27 is 0.0054%. Equations 6.24-6.27 produce estimates for T and A that have a slight asymptotic bias. For small samples the bias may be appreciable (Shenton and Bowman 1970). Bowman and Shenton (1968) present the following approximate relationship for estimating the bias in the parameter T when equations 6.24-6.27 are used. 317 - 0.677

E($ - T)

E

0.1 11 0.032 ++7 T

for n r 4 and T r 1

n-3

where E($ - T) is the bias in T with error of less than 1.4%. The result of using this relationship for estimating the bias in $ for a sample size n from a gamma distribution having a population parameter of T = 2 is shown in figure 6.5. In practice, equation 6.29 can be used to correct $ for bias. If the population T were known, there would of course be no need for estimating T. Bowman and Shenton (1968) suggest that the bias in $ can be approximated from

which yields

The gamma distribution has been widely used in hydrology (Bobee and Ashkar 1991). Rainfall probabilities for durations of days, weeks, months, and years have been estimated by the

0

10

20

30

40

50

60

70

Ba

Sample size n

Fig. 6.5. Expected bias in $ for the gamma distribution with T = 2.

90

100

gamma distribution (Barger and Thom 1949; Barger, Shaw and Dale 1959; Friedman and Janes 1957; Mooley and Crutcher 1968). Annual runoff (Markovic 1965) has been described by the gamma distribution. , Example 6.3. The annual water yield for Cave Creek near Fort Spring, Kentucky (USGS # 03288500) is shown in the following table. Estimate the parameters of the gamma distribution for this data using both the method of moments and the method of maximum likelihood. Assuming the data follows a gamma distribution, estimate the probability of an annual water yield exceeding 20.00 inches. Year

Annual Runoff (inches)

Solution: Method of Moments

Method of Maximum Likelihood (Thom procedure)

Year

Annual Runoff (inches)

CONTINUOUS DISTRIBUTIONS

125

Method of Maximum Likelihood (Greenwood and Durand procedure)

Thus, the maximum likelihood estimators are ); = 0.485 and $ = 7.107. These estimates may be corrected for bias using either equation 6.29 or 6.30. If 6.30 is used

Note that E($

-

T) = E($) - E(T)

=

7.107

-

5.922 = 1.I85

If q = 5.922 is substituted into equation 6.29, the result is E($ - -q)= 1.141 which is in good agreement with the 1.185 produced by equation 6.30. The final estimated for q is now $ = 5.922 and ); = = 0.404. Using the method of moments the parameter estimates are

$/x

$

9.513 and ); = 0.649, whereas the maximum likelihood estimates are +j= 5.922 and ); = 0.404. Following the recommendation of Thom (1958), the latter estimates will be used in estimating the probability of an annual water yield in excess of 20.00 inches. =

Thus 1 - Px(20.00) is 0.176, which is the desired probability. The prob(yie1d > 20.00) = 0.176 if the annual water yield follows a gamma distribution with parameters q = 5.922 and A = 0.404. In these calculations Microsoft Excel 97 was used to evaluate the gamma distribution. Comment: If the moment parameter estimates had been used, the resulting probability would have been 0.132, which is reasonably close to 0.176. This is because -q is reasonably close to the 10.00 that Thom (1958) suggested is the smallest value of -q for which the method of moments results in good parameter estimates. For this data C, = 2/* = 0.82, so that the distribution is moderately skewed to the right. If the normal distribution had been used to estimate prob(X > 20.00), the result would have been 0.126, which again is a reasonable approximation. However, if the annual water yield with a return period of 100 years or a 1% chance of being exceeded is evaluated by the gamma with q = 5.922 and A = 0.404 and by the normal with p = 14.56 and

a = 4.75,the results are 32.2 inches and 25.6 inches-again showing the sensitivity of estimates of rare events to the distributional assumption even though in the main body of distribution the agreement is good. Generally, 18 observations are not enough to make reliable probability estimates or to determine the proper probability distribution to use. It is a small enough number that one can follow through all of the needed calculations for this example in a short time on a desk calculator, however. The fact that the gamma and normal estimates differ greatly for this data at large return periods does not mean the gamma (or the normal) is a better approximation for the data. This question will be taken up later. Exercise 6.21 should be consulted for another approximate solution to this example.

LOGNORMAL DISTRIBUTION The Central Limit Theorem was used in deriving the general result that if a random variable X is made up of the sum of many small effects, then X might be expected to be normally distributed. Similarly, if X is equal to the product of many small effects, that is if X = X, X2...Xn,then the logarithm of X, In X, can be expected to be normally distributed. This can be seen by letting Y = In X so that Y = In(XlX2...X,) = In XI + In X2 + + In X,. Because the Xi are random variables, the In Xi are also random variables and Y = In X is a random variable made up from the sum of many other random variables. From the Central Limit Theorem, Y can be expected to be normally distributed with mean py and variance u;. a - -

The distribution of X can be found from

Because Y = In X

and

Note that equation 6.31 gives the distribution of Y as a normal distribution with mean py and variance a;. Equation 6.32 gives the distribution of X as the lognormal distribution with parameters py and a;. Y = In X is normally distributed while X is lognormally distributed.

CONTINUOUS DISTRBUTIONS The parameters py and o; can be estimated by forming all of the Xi's to Yi's by

127

Y and S$ in the usual manner by first trans-

then

and

with all of the summations from 1 to n. If a digital computer is used the above equations are easily applied. Y and S; may be determined without taking the logarithms of all of the data from

where C, is the coefficient of variation of the original data (C, = S,/X). These relationships are not general results but depend on data being lognormally distributed. The mean, variance, and coefficient of variation of the lognormal distribution are

The coefficient of skew of the X's is

Thus, the lognormal distribution is positively skewed with the skew decreasing as the coefficient of variation decreases. Based on the properties of the normal distribution, the skewness of the logarithms of lognormal data is zero. Tables of the standard normal distribution can be used to evaluate the lognormal distribution. From equation 6.32 we have p,(x) = py(y)/x. But py(y) is a normal density function. From equation 5.7, py(y) = pz(z)/sy or

The prob(X 5 x) is equal to the prob(Y 5 y) because Y = In X is a monotonic, single valued function. Since Y is normally distributed, prob(Y 5 y) = prob(Z 5 z) where

Therefore, standard normal tables can be used with the proper transformations to evaluate px(x) and Px(x) for the lognormal distribution. Certain reproductive properties of the lognormal follow directly from the reproductive properties of the normal distribution. For example, if X is lognormally distributed then Y = a x b is lognormally distributed with Pln Y = In a + bPln x and

aty = b u h x 2 2

This follows from the fact that In Y = In a + b In X, In X is normally distributed, and In Y is a linear function of In X so is also normally distributed. Thus Y is lognormally distributed. This can be extended so that if X,, X2, ..., Xn are independent and lognormally distributed, then x$. .. X: is lognormally distributed with Y=

and

Two special cases of the above are if Z = XY and Z = X/Y with X and Y being independently and lognormally distributed, then Z is lognormally distributed with its mean and variance easily determined from equations 6.43 and 6.44. Because of its simplicity, its ready availability in tables for its evaluation, and the fact that many hydrologic variables are bounded by zero on the left and positively skewed, the lognormal distribution has received wide usage in hydrology. Example 6.4. Use the lognormal distribution and calculate the expected relative frequency for the third class interval of the data in table 5.1. Solution: The expected relative frequency according to the lognormal distribution is

The evaluation of px(x) from equation 6.41 requires an estimate for py and 0,. These are estimated from equations 6.35 and 6.36.

CONTINUOUS DISTRIBUTIONS

129

or the expected relative frequency in the interval 40,000 to 50,000 according to the lognormal distribution is

Example 6.5. Assume the data of table 5.1 follow the lognormal distribution. Calculate the magnitude of the 100-year peak flow. Solution: The 100-year peak flow corresponds to a prob(X > x) of 0.01. X must be evaluated such that P,(x) = 0.99. This can be accomplished by evaluating Z such that Pz(z) = 0.99 and then transforming to X. From the standard normal tables the value of Z corresponding to Pz(z) of 0.99 is 2.326. From equation 6.37

The values of s, and y = 0.326(2.326)

are given in example 6.4.

+ 11.0524 = 11.812

x = exp(y) = 134,683 cfs The 100-vear ~ e a flow k according to the lognormal distribution is about 134.700 cfs.

EXTREME VALUE DISTRIBUTIONS Often, interest exists in extreme events such as the maximum peak discharge of a stream or minimum daily flows. The extreme value of a set of random variables is also a random variable. The probability distribution of this extreme value random variable will in general depend on the sample size and the parent distribution from which the sample was obtained. Hahn and Shapiro (1967), Ang and Tang (1984), Kececioglu (1991), Rao and Hamed (2000), and Benjamin and Cornell (1970) contain very readable treatments of some of the extreme value distributions.

130

CHAPTER 6

Consider a random sample of size n consisting of x,, x,, ..., x,. Let Y be the largest of the sample values. Let Py(y) be the prob(Y 5 y) and Pxi(x) be the prob(Xi 5 x). Let py(y) and pxi(x) be the corresponding probability density functions. Py(y) = prob(Y 5 y) = prob(al1 of the x's 5 y). If the x's are independently and identically distributed we have

Therefore, the probability distribution of the maximum of n independently and identically distributed random variables depends on the sample size n and the parent distribution Px(x) of the sample. A similar result can be derived for the distribution of the smallest of n independently and identically distributed random variables. Example 6.6. Assume that the time between rains follows an exponential distribution with a mean of 4 days. Also assume that the time between rains is independent from one rain to the next. Irrigators may be interested in the maximum time between rains. Over a period of 10 rains, what is the probability that the maximum time between rains exceeds 8 days? Solution: 10 rains means 9 interrain periods, or n = 9. From equation 6.45 the probability that the maximum interrain time is less than 8 days is

In this example, Px(y) is the cumulative exponential with parameter A = 1 / x = 1/4.

Therefore, the probability that the maximum interrain time will be greater than 8 is 1 - 0.27 1 = 0.729. Comment: The probability density function for the maximum interrain time is from equation 6.46

This distribution is plotted in figure 6.6 for various values of n. Note that for even moderately large n, the probability is very high that the extreme value (longest intkain time) will be from the tail of the parent (exponential) distribution. Frequently the parent distribution from which the extreme is an observation is not known and cannot be determined. If the sample size is large, use can be made of certain general asymptotic results that depend on limited assumptions concerning the parent distribution to find the

CONTINUOUS DISTRIBUTIONS

131

Fig. 6.6. Distribution of the largest sample value from a sample of size n from an exponential distribution.

distribution of extreme values. Much of the work on extreme value distributions is due to Gumbel (1954, 1958).Three types of asymptotic distributions have been developed based on different (but not all) parent distributions. The types are: a. Type I-parent distribution unbounded in direction of the desired extreme and all moments of the distribution exist (exponential type distributions). b. Type 11-parent distribution unbounded in direction of the desired extreme and all moments of the distribution do not exist (Cauchy type distributions). c. Type 111-parent distributions).

distribution bounded in the direction of the desired extreme (limited

Interest may exist in either the distribution of the largest or smallest extreme values. Exarnples of parent distributions falling under the various types are: a. Type I-extreme value largest: normal, lognormal, exponential, gamma b. Type I-extreme value smallest: normal c. Type II-extreme value largest or smallest: Cauchy distribution (Hahn and Shapiro 1967; Thomas 1971) d. Type III-extreme value largest: beta distribution (Hahn and Shapiro 1967; Gibra 1973; Benjamin and Cornell 1970) e. Type III-extreme value smallest: beta, lognormal, gamma, exponential

The type I1 or Cauchy type extreme value distributions have found little application in hydrology. The distribution of the largest extreme value in hydrology generally arises as a type I extreme value largest distribution because most hydrologic variables are unbounded on the right. (See Van Montfort [I9701 for a test to determine whether a type I or type I1 extreme value largest best fits the observed data.) The distribution of extreme value smallest commonly found in hydrologic work is the type 111extreme value smallest since many hydrologic variables are bounded on the left by zero. The following is a treatment of these two (type I largest and type III smallest) extreme value distributions plus the type I smallest because of its symmetry with the type I largest. Extreme Value Type I The type I extreme value has been referred to as Gumbel's extreme value distribution, the extreme value distribution, the Fisher-Tippet. type I distribution, and the double exponential distribution. The type I asymptotic distribution for maximum (minimum) values is the limiting model as n approaches infinity for the distribution of the maximum (minimum) of n independent values fiom an initial distribution whose right (left) tail is unbounded and which is an exponential type; that is, the initial cumulative distribution approaches unity (zero) with increasing (decreasing) values of the random variable at least as fast as the exponential distribution approaches unity. The normal, lognormal, exponential, and gamma distributions all meet this requirement for maximum values while the normal distribution satisfies the requirement for minimum values. The type I extreme value distribution has been used for rainfall depth-duration-frequency studies (Hershfield 1961) and as the distribution of the yearly maximum of daily and peak river flows. Gumbel (1958) states that this latter application assumes 1) the distribution of daily discharges (the parent distribution) is of the exponential type, 2) n = 365 is a sufficiently large sample and 3) the daily discharges are independent. Gumbel states that the first and second assumptions cannot be checked because the analytical form of the distribution of discharges is unknown and that the third assumption is clearly not true so that the number of independent observations is something less than 365. In spite of violating the last assumption, experience with the type I for the maximum of daily discharges has been reasonably good. Maximum annual flood peaks would more nearly fulfill assumption 3 although the effective sample size would be much less than 365. The probability density function for the type I extreme value distribution is px(x) = -exp a

+------

[-" n 7)

- exp +---

where the - applies for maximum values and the + for minimum values. The parameters a and p are scale and location parameters with p being the mode of the disMbution. The type I for maximum and minimum values are symmetrical with each other about f3. Figure 6.7 is a plot of the distributions for a = 3,897 and f3 = 7,750. The mean and variance of the extreme value type I distribution are

E(X)

= f3

+ yea

(maximum)

= f3

- yea

(minimum)

(6.48)

CONTINUOUS DISTRIBUTIONS

133 Largest Smallest

-20

-10

0

20

10

30

40

X (1000s)

Fig. 6.7. Example of extreme value type I density curves.

7rL

Var(X) = -a2 6

(both)

where yeis the Euler number having a value of 0.577216. The skewness coefficient is y = 1.I396

(maximum)

= - 1.1396

(minimum)

Thus, the type I has a constant coefficient of skewness. If the transformation

is used, the type I extreme value density function becomes

where the - applies for the maximum values and the tive distribution is py(y) =

Jr, exp[T t - exp(+ t)] dt

=

exp[ -exp(- y)]

=

1

-

exp[-exp(-y)]

-m

(maximum) (minimum)

+ for the minimum values. The cumula-

0, the distribution has ~ corresponds to the extreme value type I11 distribution for a finite upper bound at 5 + a / and maximums that are bounded on the right. The moments of the GEV are

The Var(X) exists for K > -0.5. Other restrictions are K>O For

K

then x < 5 + -

> - %, the skewness is

a K

f o r ~ < O then

x>5+-

a K

where sign(^) is + or - 1 depending on the sign of weighted moment P, of a GEV is

K.

For

K

> - 1, the order r probability

and may be estimated by equations 3.71. The L-moments, Xi may then be estimated by equations 3.74. The parameters of the GEV in terms of L-moments are:

where

Quantile values from the GEV can be determined from

where Px(xp)is the cdf of X. In chapter 7 an example of the use of the GEV for flood frequency analysis is given.

BETA DISTRIBUTION A distribution that has both an upper and lower bound is the beta distribution. Generally, the beta distribution is defined over the interval 0 to 1. It can, however, be transformed to any interval a to p. If the limits of the distribution are unknown, they become parameters of the distribution, making it a 4-parameter rather than a 2-parameter distribution. The beta density function is given by

'

The function B(a, P) = J,' xa- '(1 - x ) ~ -dx is called the beta function. The beta function is related to the gamma function by

CONTINUOUS DISTRIBLTIONS

141

The beta function is tabulated. The mean and variance of the beta distribution are

The mean and variance can be used to get the moment estimators for a and P.

PEARSON DISTRIBUTIONS Karl Pearson (Elderton 1953) has proposed that frequency distributions can be represented by

By choosing appropriate values for the parameters, equation 6.90 becomes a large number of families of distributions including the normal, beta, and gamma distributions. The Pearson type 111has found application in hydrology especially as the distribution of logarithms of flood peaks. This distribution can be written

with the mode at X = 0. The lower bound of the distribution is X = -a. The difference in the mean and mode is 6 and the value of px(x) at the mode is po. It can be shown that the Pearson type I11 is the same as the 3-parameter gamma distribution. By shifting equation 6.91 so that the mode is at X = a and the lower bound is at X = 0, we have

The gamma distribution has the mode at (q - l)/A and the mean at q/A. Thus a = (-q - l)/A and 6 = -q/A - (q - l)/A = l/A. The value of px(x) at the mode for the gamma distribution is

Substituting these quantities into 6.92 results in

which is the gamma distribution.

SOME IMPORTANT DISTRIBUTIONS OF SAMPLE STATISTICS We have already seen that sample statistics as functions of random variables are themselves random variables. Statistical tests depend on the probability distribution of test statistics, which are merely sample statistics. In this section three of the important distributions of sample statistics are briefly discussed. Chi-Square Distribution If Z is a standardized, normally distributed random variable, Z = (X - k)/a, then

where Y is the sum of squares of n random values of Z and has a chi-square distribution with n degrees of freedom. The chi-square distribution is a special case of the gamma distribution when X = j/2 and is a multiple of %. The distribution thus has a single parameter v = 27 known as the degrees of freedom. The expression for the distribution is

The mean and variance of the distribution are

The parameter v is usually known in any application of the chi-square to statistical testing. Equation 6.95 produces the moment estimator for v as = X.In figure 6.4, the curve labeled X = % is a chi-square distribution with v = 6. The coefficient of skew for the chi-square distri bution is 2*/. The cumulative chi-square distribution is contained in the appendix in the form

for various values of v and a. The fact that the sum of squares of n random standard normal variates is a chi-square distribution with v = n makes it apparent that if Xi is a chi-square random variable with parameter ni then X = CXi is a chi-square random variable with parameter v = I n i if all the Xi are independent. If z,, z2, ..., Z, is a random sample from a standard normal distribution, then y = Cz! = C(xi- ?T)2/a2has a chi-square distribution with v = n - 1. Furthermore since s2 = I ( x i - x ) ~ / (n - I), the quantity (n - l)s2/a2has a chi-square distribution with v = n

-

1 (Mood et al. 1974).

CONTITWOUS DISTRIBUTIONS

143

The t Distribution . If Y is a standardized normal variate and U is a chi-square variate with v degrees of freedom and Y and U are independent, then

has a t distribution with v degrees of freedom. The t distribution is given by

The mean and variance of the t distribution are

n Var(T) = n -2

for

n

>2

A table of the cumulative t distribution is in the appendix in the form

for various values of v and a. One use of the t distribution is as the sampling distribution of the mean from a normal distribution with unknown variance. If we write

and then divide the numerator and denominator by a , we get

has a standard normal distribution and U = (n - 1)s2/u2has a X2 where Y = (X - p)/(u/ 6) distribution. Thus from equation 6.98, T has a t distribution with n - 1 degrees of freedom. As v of the t distribution gets large, the t distribution approaches the standard normal distribution. Thus, for large samples, the sampling distribution of the mean of a normal distribution with unknown variance approaches a normal distribution. We have already seen that the distribution of the sample mean from a normal distribution with a known variance is exactly a normal

distribution. One can reason that as the sample size increases, the estimate for the variance improves to the point where the sampling distribution of the mean of a normal distribution with unknown variance can be approximated by the sampling distribution of the mean of a normal distribution with a known variance, which is itself a normal distribution. In practice one rarely knows the variance of the distribution from which a sample is obtained. Example 6.8. A sample of size 8 from a normal distribution results in X = 12.7 and s2 = 9.8. What is the probability that X is in error by more than 1.0? Solution: (% - p,)/m has a t distribution with n - 1 degrees of freedom. To be in error by X - p,l > 1.0. more than 1.0 units we must have I

The desired probability is the area to the right of t = 0.904. By interpolation in the t table, this value is found to be 0.198. By symmetry, the area to the left of -0.904 is 0.198. The desired probability is 0.198 + 0.198 = 0.396. If a standard normal distribution had been used rather than a t distribution, it would have been necessary to find prob(lZ1 > 0.904). This probability can be found from the standard normal table to be 0.366. Thus, even for a sample as small as 8, the normal is a reasonable approximation. The F Distribution If U is a chi-square variate with y = m degrees of freedom and V is a chi-square variate with y = n degrees of freedom and U and V are independent, then

has an F distribution with yl = m and y2 = n degrees of freedom (m and n are known as the numerator and denominator degrees of freedom, respectively). The F distribution is given by

The mean and variance of the F distribution are

COKTINUOUS DTSTRIBUTIONS

145

The cumulative F distribution is contained in the appendix as a function of m and n for values of P,(f) = 0.90,0.95,0.975,0.99, and 0.995. The table contains values of F,,,, such that 100a% of the distribution with m and n degrees of freedom lies to the left of F,,,,.,. For example, the probability that a random observation from an F distribution with 5 numerator and 10 denominator degrees of freedom exceeds 4.2 is 1.00 - 0.975 = 0.025.

TRANSFORMATIONS Often a transformation can be made in an attempt to anive at a probability distribution that will describe the data. Common transformations are logarithmic transformations, translations along the x axis, and n" power transformations for n = K,, %, 2, and 3. We have already made one application of the logarithmic transformation to get the lognormal distribution from the normal distribution. Other distributions can be transformed by means of this transformation as well. Benson (1968) and an Interagency Subcommittee on Water Data (1982) discuss the use of the log-Pearson type I11 distribution for flood frequencies. Translations are especially useful in the case of bounded distributions. We made use of a translation in deriving the 3-parameter extreme value type I11 for minimums from the corresponding 2-parameter distribution. In general, a translation is accomplished by subtracting a location parameter, E, from the random variable. For example

could be considered a 2-parameter exponential distribution with the lower bound at X = E. Exercises 3.17 and 3.18 deal with estimating the two parameters of this distribution. In general, the addition of a displacement parameter, if the displacement parameter is unknown, makes parameter estimation via maximum likelihood much more difficult. Moment estimators are relatively simple in that the addition of a displacement parameter affects the mean by p = p, + E and has no effect on the variance or skewness. Thus, a 3-panmeter gamma distribution might be given by

with the moment estimators for h, T, and E determined from

146

CHAPTER 6

The fact that y must be now used to estimate means that for small samples accuracy is lost, because y is based on the third sample moment. As shown earlier, the 3-parameter gamma is the same as the Pearson type 111distribution. Sangal and Biswas (1970) have used the 3-parameter lognormal distribution obtained by fitting a normal distribution to the logarithms of (X - E) where E is a parameter that must be estimated from the data. They found for 10 Canadian rivers that the 3-parameter lognormal distribution fit the observed distribution of peak flows. They also state that the Gumbel extreme value distribution is a special case of the 3-parameter lognormal distribution. The three parameter lognormal is given by

where

Stidd (1953) and Kendall (1967) discuss transforming variables by Y = x"' and then fitting a normal distribution to Y. They discuss this transformation in terms of precipitation probabilities. Exercises 6.1. Show that the mean of the uniform distribution is (P (P - 4 / 1 2 .

+ a)/2

and the variance is

6.2. What is the skewness and kurtosis of the uniform distribution? 6.3. What is the skewness and coefficient of variation for the exponential distribution? 6.4. Fit the gamma distribution to the data of exercise 2.2. Plot the expected relative frequency according to the gamma distribution on the plot of exercise 2.2. 6.5. Repeat exercise 6.4 using the lognormal distribution. 6.6. Fit the lognormal distribution to the Kentucky River data of table 2.1. Is this a good approximation for the data? 6.7. Work exercise 5.8 using the lognormal distribution.

CONTINUOUS DISTRIBUTIONS

147

6.8. A set of data having a mean of 4.5 and a standard deviation of 2.0 is thought to follow the type I extreme value distribution for maximums. What proportion of the observations from this distribution exceed 6.0? Plot the probability density function. 6.9. Repeat exercise 6.8 using the type I extreme value distribution for minimums. 6.10. Repeat exercise 6.8 using the Weibull distribution. 6.11. Repeat exercise 6.8 using the lognormal distribution. 6.12. Show that the exponential distribution is memoryless [i.e., show that prob(X 2 t T ~ X> t ) = prob(X > t)].

+

6.13. Plot the probability density function and the cumulative probability distribution for the lognormal distribution with p, = 50,000 and ox= 25,000. 6.14. Plot the theoretical distribution of the largest value selected from a normal distribution with p = 4 and u = 4 for sample sizes of n = 2,5,9, and 33. Compare the results with those of example 6.6. 6.15. Derive expressions analogous to equations 6.45 and 6.46 for the smallest of n independently and identically distributed random variables. 6.16. Verify equation 6.52 from equations 6.47 and 6.51. 6.17. Assume that during month 1 the mean and standard deviation of the monthly rainfall are 0.750 and 0.433 inches, respectively. Similarly, during month 2 the mean and standard deviations of monthly rainfall are 3.000 and 0.866 inches, respectively. Assume monthly rainfall amounts can be approximated by the gamma distribution and that rainfall in month 2 is independent of rainfall in month 1. What is the probability of receiving more than 3 inches of rain during the two-month period? 6.18. Show that for the 2-parameter Weibull distribution the parameter a is a function only of the coefficient of variation. Using this fact, describe a procedure for estimating a and lj of the distribution. 6.19. If peak discharge, q, is lognormally distributed with mean p, and variance oi, what is the probability distribution of stage S? Assume stage and discharge are related by q = asb. 6.20. Work exercise 6.19 assuming the peak discharges are distributed as the type I extreme value distribution.

4

6.21. In example 6.3 let be approximated by 6.0. Calculate from equation 6.25 and then evaluate the prob(yie1d > 20.0) by using the equation following equation 6.21. Compare the results with those of example 6.3. 6.22. Use the method of moments to estimate the parameters of the 3-parameter lognormal distribution for the North Llano River near Junction, Texas. What is the return period of a mean annual flow of 273 cfs or more? 6.23. Calculate the return period associated with an annual runoff of 0.500 inches for Walnut Gulch near Tombstone, Arizona (Data in Appendix C). Assume (a) lognormal distribution, (b) gamma distribution, (c) extreme value type I, (d) normal distribution. 6.24. Assume the data of exercise 4.10 are distributed as a 2-parameter exponential distribution. Estimate the parameters of this distribution and prepare a table comparing the observed and expected number of floods over the 100-year period.

7. Frequency Analysis ONE OF the earliest and most frequent uses of statistics in hydrology has been that of frequency analysis. Early applications of frequency analysis were largely in the area of flood flow estimation. Today nearly every phase of hydrology is subjected to frequency analysis. Although most of the discussion in this chapter centers on flood flows or peak flows, the techniques are generally applicable to a wide range of problems including runoff volumes, low flows, rainfall events of various kinds, water quality parameters, measures of ground water levels and flows, and many other environmental variables. The statistical and mathematical manipulations discussed in this chapter do not depend on the units of measurement or the quantity measured. The assumptions that are made, however, must be carefully compared to the situation under study. The goal of a frequency analysis is to estimate the magnitude of an event having a given frequency of occurrence or to estimate the frequency of occurrence of an event having a given magnitude. The frequency is often stated in terns of a return period, T, in years, or a probability of occurrence in any one year, p. Other terminology commonly used includes the estimation of a "quantile" or "percentile" of the probability distribution of the quantity of interest. The loopth percentile is simply the event having a probability, p, of occurring. The term "quantile" is used in a similar manner. The 9othquantile is the same as the 9othpercentile. The loopthpercentile or the loopthquantile is the value, xp, of the random variable X satisfying

where p is the exceedance probability, or the probability that x, is exceeded.

There have been and continue to be volumes of material written on the proper probability distribution to use in various situations. One cannot, in most instances, analytically determine which probability distribution should be used. Certain limit theorems such as the Central Limit Theorem and Extreme Value Theorems might provide guidance. One should also evaluate the experience that has been accumulated with the various distributions and how well they describe the phenomena of interest. Certain properties of the distributions can be used in screening distributions for possible application in a particular situation. For example, the range of the distribution, the general shape of the distribution, and the skewness of the distribution often indicate that a particular distribution may or may not be applicable in a given situation. When two or more distributions appear to describe a given set of data equally well, the distribution that has been traditionally used should be selected unless there are contrary overriding reasons for selecting another distribution. However, if a traditionally used distribution is inferior, its use should not be continued just because "that's the way it's always been done". The first part of this chapter discusses empirical frequency analysis by plotting data in the form of a cumulative probability distribution. The second topic covered is analytical frequency analysis based on probability distributions. A simplified technique based on frequency factors is shown for determining the magnitude of an event with a given return period. In general, the frequency factor is a function of the distributional assumption that is made and of the mean, variance, and, for some distributions, the coefficient of skew of the data. Regional frequency analysis is then discussed. Regional frequency analysis attempts to use data from several locations in a "homogeneous" region to determine the frequency relationship for a point having limited data. The chapter closes with a discussion of the frequency analysis of precipitation data and other forms of hydrologic data. Frequency analysis of hydrologic data requires that the data be homogeneous and independent. The restriction of homogeneity ensures that all the observations are from the same population. Nonhomogeneity may result from a stream gaging station being moved, a watershed becoming urbanized, or structures being placed in the stream or its major tributaries. Different types of storms, such as frontal storms and storms associated with hurricanes, may introduce nonhomogeneity. In this latter situation a mixed population model may be required for the frequency analysis. The restriction of independence ensures that a hydrologic event such as a single large storm does not enter the data set more than once. For example, a single storm system may produce two or more large runoff peaks only one of which (the largest) should enter the data set. Dependence may also result when a major rainfall occurs, producing very wet antecedent conditions on a catchment. A subsequent rainfall may then produce much larger flows than would have occurred had a more normal antecedent condition existed. The flow from the second storm is then dependent on the fact that the first storm had occurred. Runoff from only one of these events, the largest one, should enter the analysis. For the prediction of the frequency of future events, the restriction of homogeneity requires that the data on hand be representative of future flows (i.e., there will be no new structures, diversions, land use changes, etc., in the case of stream flow data). Recently, the possibility of climate change has been raised as a factor contributing to nonhomogeneity of a hydrologic record. If climate change is occurring at a rate rapid enough to affect the usefulness of a particular hydrologic analysis, this change must be reckoned with in the analysis,

FREQUENCY ANALYSIS

151

Hydrologic frequency analysis can be made with or without making any distributional assumptions. The procedure to be followed in either case is much the same. If no distributional assumptions are made, the observed data are plotted on any kind of paper (not necessarily probability paper) and judgment used to determine the magnitude of past or future events for various return periods. If a distributional assumption is made, the magnitude of events for various return periods is selected from the theoretical "best-fit" line according to the assumed distribution. If an analytical technique is used, the data should still be plotted so that one can get an idea of how well the data fit the assumed analytical form and to spot potential problems.

PROBABILITY PLOTTING Once data for a frequency analysis have been selected, they must be carefully scrutinized to ensure all of the observations are all valid representations of the hydrologic characteristic under consideration. For example, in a flood frequency data set consisting of the annual maximum flow, it is possible that the lower values are merely flows somewhat above the flows for the remainder of the year but do not truly represent high flows or flood flows. In such a case, some truncation of low flows might be instituted with the analysis done on the truncated data set and adjusted to the full record length. After accepting the data as valid, basic statistics (mean, variance, skewness) of the data should be computed and the data plotted as a probability plot. Plotting probability density functions and cumulative probability distributions on arithmetic paper has already been discussed. In general, when the cumulative distribution function, Px(x), is plotted on arithmetic paper versus the value of X, a straight line does not result. To get a straight line on arithmetic paper, Px(x) would have to be given by the expression Px(x) = ax + b or p,(x) = a, the uniform distribution. Thus, if the cumulative distribution of a set of data plots as a straight line on arithmetic paper, the data follows a uniform distribution. Probability paper can be developed so that any cumulative distribution can be plotted as a straight line. Generally, the scaling of the probability axes is unique for each of the different probability distributions to plot as a straight line. The scaling of the probability axis may even have to change as the parameters of a particular distribution change. Constructing probability paper is a process of transforming the probability scale so that the resulting cumulative curve is a straight line. Many types of probability paper are comrnercially available, including paper for the normal, lognormal, exponential, certain cases of the gamma, extreme value (type I), Weibull, and chi-square distributions. A few computer software packages provide for plotting using a normal distribution probability scale. Some of the packages will plot probability directly whereas others use the Z transformation of the normal distribution. The resulting plots are similar. When the Z transformation is used, the probability associated with the plotted Z values must be independently determined. The most common probability paper has a normal probability scale and either an arithmetic (normal probability paper) or logarithmic (lognormal probability paper) scale. Normally distributed data will plot as a straight line on normal probability paper and lognormally distributed data will plot as a straight line on lognormal probability paper. One way to determine if data might be from a normal or lognormal distribution is to plot the data on normal and lognormal probability paper and visually determine if a straight line is obtained.

A probability plot is a plot of a magnitude versus a probability. Determining the probability to assign a data point is commonly referred to as determining the plotting position. For a population, determining the plotting position is merely a matter of determining the fraction of the data values less (greater) than or equal to the value in question. Thus the smallest (largest) population value would plot at 0 and the largest (smallest) population value would plot at 1.00. Assigning plotting positions to sample data is not as straightforward. Generally, a sample will not contain the smallest or largest value of the unknown population. Thus, plotting positions of 0 and I should be avoided for sample data unless one has additional information on the population limits. Plotting position may be expressed as a probability from 0 to 1 or a percent from 0 to 100. Which method is being used should be clear from the context. In some discussions of probability plotting, especially in hydrologic literature, the probability scale is used to denote prob(X > x) or 1 - Px(x). In this book we will adopt this convention. The reason for this is that the return period, Tx(x), is l/prob(X > x) = 1/(1 - Px(x)), or the reciprocal of the probability scale. One can always transform the probability scale from 1 - Px(x) to Px(x) or even Tx(x) if desired. Probability plotting of hydrologic data requires that individual observations or data points be independent of each other and that the sample data be representative of the population (unbiased). Some common types of sample data are complete duration series, annual series, partial duration series, and extreme value series. The complete duration series consists of all available data. An example would be all the available daily flow data for a stream. This particular data set would most likely not have independent observations. Complete duration series data are rarely subjected to a standard frequency analysis because they likely contain significant serial correlation. Since what is generally of interest are rare events, often only the largest or smallest event over a period of time, generally one year, are selected. Such a series is known as the annual series. The data in table 2.1 is an annual series. The partial duration series consists of all values above (below) a certain base. All peak flows above 40,000 cfs in the Kentucky River, Salvisa, Kentucky, would represent a partial duration series. This series may have more or less values in it than the annual series. For example, there would be 9 years that would not have contributed any data to a partial duration series with a base of 40,000 cfs for the data in table 2.1; however, some years may have more than one peak above the base. The partial duration series is also known as the 'peaks over threshold' series. The annual series and the partial duration series approach one another for long return periods. Beard (1974) has shown that the relationship between annual series and partial duration series flood peaks varies throughout the United States and recommends the use of empirically derived, regionalized relationships. Frequently, the annual series and the partial duration series are combined so that the largest (smallest) annual value plus all independent values above (below) some base are used. For periods of record longer than about 10 years, the annual series and the partial duration series give very similar results. The extreme value series consists of the largest (smallest) observation in a given time interval. The annual series is a special case of the extreme value series with the time interval being one year.

FREOUENCY ANALYSIS

153

Regardless of the type of sample data used, the plotting position can be determined in the same manner. Gumbel (1958) states the following criteria for plotting position relationships: 1. The plotting position must be such that all observations can be plotted. 2. The plotting position should lie between the observed frequencies of (m - l)/n and m/n where m is the rank of the observation beginning with m = 1 for the largest (smallest) value and n is the number of years of record (if applicable) or the number of observations.

3. The return period of a value equal to or larger than the largest observation and the return period of a value equal to or smaller than the smallest observation should converge toward n. 4. The observations should be equally spaced on the frequency scale.

5. The plotting position should have an intuitive meaning, be analytically simple, and be easy to use. Several plotting position relationships are presented in Chow (1964) and Singh (1992). A general plotting position relationship is given by

where a and b are constants (Adamowski 1981). Some of the most common relationships for plotting positions are shown in Table 7.1. Unless specifically stated to the contrary, the Weibull relationship is used in the remainder of this book. Benson (1962a), in a comparative study of several plotting position relationships, found on the basis of theoretical sampling from extreme

Table 7.1. Common plotting position relationships Name

Source

California

California (1923)

Hazen

Hazen (1930)

Weibull

Weibull (1939) Cunnane (1978)

Gringorton

Rao and Hamed (2000)

Adamowski

Adamowski (198 1)

Relationship

CHAPTER 7

154

value and the normal distributions that the Weibull relationship provided estimates that were consistent with experience. The Weibull plotting position formula meets all 5 of the above criteria: 1) All of the observations can be plotted since the plotting positions range from l/(n + I), which is greater than zero, to n/(n + I), which is less than one. Probability paper for distributions with infinitely long tails does not contain the points zero and one; 2) The relationship m/(n + 1) lies between (m - l)/n and m/n for all values of m and n; 3) The return period of the largest value is (n + 1)/1, which approaches n as n gets large, and the return period of the smallest value is (n + l)/n = 1 + 1In, which approaches 1 as n gets large; 4) The difference between the plotting position of the (m + and m~ value is l/(n + 1) for all values of m and n; and 5) The fact that condition 3 is met plus the simplicity of the Weibull relationship fulfills condition 5. One objection to the Hazen plotting position is that the return period for the largest (m = 1) event is 2n, or twice the record length. An objection to the California plotting position is that the smallest value (m = n) has a plotting position of 1, which implies that the smallest sample value is the smallest possible value. A value of 1 cannot be plotted on many types of probability paper. It should be noted that all of the relationships give similar values near the center of the distribution but may vary considerably in the tails. Predicting extreme events depends on the tails Table 7.2. Determination of plotting position for Kentucky River data Flow

Rank

pp

Flow

Rank

pp

Flow

Rank

pp

Flow

Rank

pp

FREQUENCY ANALYSIS

155

of the distribution, so care must be exercised. The quantity 1 - Px(x) represents the probability of an event with a magnitude equal to or greater than the event in question. When the data are ranked from the largest (m = 1) to the smallest (m = n), the plotting positions correspond to 1 - Px(x). If the data are ranked from the smallest (m = 1) to the largest (m = n), the plotting position formulas are still valid; however, the plotting position now corresponds to the probability of an event equal to or smaller than the event in question, which is Px(x). Probability paper may contain scales of Px(x), 1 - Px(x), TX(x),or a combination of these. Plotting data on probability paper results in an empirical distribution of the data. As an example of probability plotting, consider the data in table 2.1. The steps in plotting this data are: 1. Rank the data from the largest (smallest) to the smallest (largest) value. If two or more observations have the same value, several procedures can be used for assigning a plotting position. The procedure adopted here is to assume they have different values and assign each a unique rank. For example, in the data of Table 7.2, the value of 82,900 is assigned a rank of both 22 and 23 since it occurs twice in the data set. 2. Calculate the plotting position. 3. Select the type of probability paper to be used. Normal probability paper is used in this example.

4. Plot the observations on the probability paper. The data of Table 2.1 are ranked and the plotting positions calculated based on the Weibull relationship in Table 7.2. Figure 7.1presents the plotted data.

0.5 1

2

5

10

20 30

50

70 80

Exceedance proability

Fig. 7.1. Normal probability plot of Kentucky River flow data.

90

95

98 99

156

CHAPTER 7

When probability plots are made and a line drawn through the data, the tendency to extrapolate the data to high return periods is great. The distance on the probability paper from a return period of 20 years to a return period of 200 years is not very much; however, it represents a 10-fold extrapolation of the data. If the data do not truly follow the assumed distribution with population parameters equal to the sample statistics, the error in this extrapolation can be quite large. This fact has already been referred to when it was stated that the estimation of probabilities in the tails of distributions is very sensitive to distributional assumptions. Because one of the usual purposes of probability plotting is to estimate events with longer return periods, Blench (1959) and Dalrymple (1960) have criticized the blind use of analytical flood frequency methods because of this tendency toward extrapolation. If a set of data plots as a straight line on probability paper, the data can be said to be distributed as the distribution corresponding to the probability paper. Because it would be rare for a set of data to plot exactly on a line, a decision must be made as to whether or not the deviations from the line are random deviations or represent true deviations, indicating that the data does not follow the given probability distribution. Examining figure 7.1, it is apparent that, with the exception of the largest value, the deviations from a straight line are small. It might be assumed that the data can be approximated by the normal distribution. So far two tests, both based on judgment, have been described for determining if a set of data follows a certain distribution. The first method was to visually compare observed and theoretical frequency histograms and the second to visually compare observed and theoretical cumulative frequency curves in the form of probability plots. In chapter 8, statistical tests based on these two visual tests will be presented. Historical Data Occasionally, flood information outside of the systematic flow record is available from historical sources such as newspaper reports, earlier flood investigations, or from paleohydrologic investigations. Such data contain valuable information that should not be ignored in a frequency analysis. Bulletin 17B of the United States Water Resources Council (1981) demonstrates computing the plotting position of the historical observations on the basis of the historical record length. Likewise, the plotting position of the systematic data is computed on the basis of the historic record length, except that the rank used in the calculation is adjusted by a factor, W, depending on the historic record length, H, the number of historic flows, Z, and the length of the systematic record, N. These are related by

H-Z w=---N

The adjusted rank for the systematic data is

with m being the unadjusted rank of the total record (systematic plus historic).

FREQUENCY ANL4LYSIS

157

Thus, if 20 years of systematic data and 2 historic observations larger than any values in the systematic record are available from a 50-year period preceding the systematic record, the plotting position for the 2 largest values would be 1/71 = 0.014 and 2/71 = 0.028. The weighting factor would be

The remaining plotting positions would be calculated from the adjusted rank given by

The adjusted rank is then used in the plotting position relationship (equation 7.2). Thus, for m = 3 (the largest systematic flow observation), the plotting position using the Weibull plotting position relationship would be [3.40(3) - 61/71? or 0.0592, and for m = 22 (the smallest value) the plotting position would be [3.40(22) - 61/71, or 0.9690. This compares to plotting positions of 1/21, or 0.0476, and 20/21, or 0.9523, respectively, if the historic data had been ignored. If the historic data had simply been used to augment the systematic record without using the weighting factor, the plotting positions for these two events would have been 1/23, or 0.0435, and 2/23, or 0.0870, respectively. Clearly, a plotting position of 0.0435 assigns too high a probability of occurrence to the largest systematic value. Knowledge that there were 48 years with no flows larger than the two historical events has been ignored in this later case. It is also apparent that the weighting procedure adjusts the plotting position toward a more frequent occurrence for the largest systematic value thus taking into account the fact that two flows greater in magnitude than the largest systematic flow occurred. Bulletin 17B also suggests the flow statistics be computed by weighting the contribution of the systematic record to the various statistics by the factor W. Thus the adjusted mean is

where the X represents the systematic record and X, the historic data. Similarly, the variance and skew can be determined from

If a log based distribution such as the lognormal or log Pearson III is being used, the X's and Xz7s would be based on logarithms.

158

CHAPTER 7

Outliers When probability plots of hydrologic data are made, frequently one or two extreme events are present that appear to be from a different population because they plot far off of the line defined by the other points. For example, it is entirely possible that a 100-year event is contained in 10 years of record. If this is the case, assigning a normal plotting position of 1/11 to this value would not be reflective of its true return period. Unfortunately, the true return period is not known. The treatment of these "outliers" is an unresolved and controversial question. The fact that this occurs frequently in hydrologic data should not be surprising. Using methods discussed in chapter 4, the probability of at least one occurrence of an n-year event in a k-year record can be calculated as 1 - (1 - 1 /n)k. For example, the probability of at least one occurrence of a 100-year event in a 32-year record is 1 - 0 . 9 9 ~or ~ ,0.275. If we have four independent 32-year records, we expect one to contain at least one 100-year event. This is the case even though the 100-year event is from the same population as the other 3 1 events in the 32-year record. Bulletin 17B suggests that outliers can be identified from

where XHand XLare threshold values for high and low outliers and K, can be approximated from

K,, = 1.055 + 0.981 log,, n

(7.8)

where n is the number of observations. If a peak in the record exceeds XH and historical information of the type discussed earlier is available regarding that peak, it should be removed from the systematic record and treated as historical observation as discussed in the section Historical Data. If historical information on the flow is not available, it should be retained as a part of the systematic data. If a flow is less than XL, that value should be deleted from the record and conditional probability procedures as explained in the section on Treatment of Zeros should be employed. More detail on the treatment of outliers is contained in Bulletin 17B. One should be very reluctant to ignore high outliers in a flood frequency analysis unless strong evidence exists that the data point contains substantial error.

ANALYTICAL HYDROLOGIC FREQUENCY ANALYSIS Probability plotting without any distributional assumptions is an empirical method of frequency analysis. If one is willing to make a distributional assumption concerning a data set, an analytical frequency analysis may be done by estimating the parameters of the assumed distribution and then using the fitted distribution to estimate the relationship between magnitudes and probabilities. Such a procedure would be a direct application of the techniques of chapter 6. For example, if the lognormal distribution is used, the parameters of the distribution would be

FREQUENCY ANKYSIS

159

estimated based on either the actual observations or their logarithms. Then the magnitude of a flow having a particular exceedance probability or return period would be based on the lognormal distribution and the estimated parameters. Fitting probability distributions to data and estimating quantiles or probabilities from these distributions has the advantage of smoothing the data and of making it possible for standardizing frequency estimation procedures. It also provides a consistent way for extrapolating short records to obtain estimates corresponding to 50- to 200-year flows. Of course, such extrapolations are fraught with ambiguities. The selection of an appropriate probability density function is critical as is having an adequate sample from which to estimate the parameters of the selected distribution. Rao and Hamed (2000) have an extensive discussion of the mathematical properties of most of the probability distributions that are used in hydrologic frequency analysis. Chow (195 1) has shown that many frequency analyses can be reduced to the form

where XTis the magnitude of the event having a return period T and KTis a frequency factor. This relationship comes about by writing any X as

and then stating that AX, the deviation from the mean, is the product of the standard deviation s and a frequency factor K.

KT depends on the probability distribution being used and the return period. Recalling that c, = s/X, equation 7.11 takes on the form of equation 7.9. Chow (1951, 1964) presents the frequency factors for many different types of frequency distributions. Equation 7.9 can also be used to construct the probability scale on plotting paper so that the distribution corresponding to KT plots as a straight line. The use of frequency factors is equivalent to using the method of moments for estimating the parameters of a pdf. Normal Distribution For the normal distribution it can easily be shown that KT is the standardized normal variate Z. The standard normal distribution, along with equation 7.9, can be used to determine the magnitude of normally distributed events corresponding to various probabilities. For example, the magnitude of a 20-year peak flow for the data of table 2.1 can be determined by calculating

and

160

CHAPTER 7

The 20-year event corresponds to a prob(X > x) of .05, so the probability of an event less than the 20-year event is 0.95. The value of Z corresponding to a probability of 0.95 is found from standard normal tables to be 1.645. Thus

+ c,K~~) = 66,540(1 + 0.335

X20 = X(l

=

X 1.645)

103,209 cfs

which agrees with the value given by figure 7.1. Lognormal Distribution For the lognormal distribution, the magnitude of a flow with a given return period can be determined by recalling that the logarithms of the flow are normally distributed. The data are first converted to their natural logarithms by Y = ln(X). The mean and standard deviation of the logarithms are then determined. XT is then given by

where 7 and s, are based on the natural logarithms of X, and Kn is from the standard normal distribution. Log Pearson Type I11 Distribution Benson (1968) reported on a method of flood frequency analysis based on the log Pearson type I11 distribution, which is obtained when the base 10 logarithms of observed data are used along with the Pearson type UI distribution (equation 6.91). This method is applied as follows: 1. Transform the n annual flood magnitudes, Xi, to their logarithmic values, Yi (i.e., Yi = logloXi for i = 1,2? .. n). .?

2. Compute the mean logarithm, 7. 3. Compute the standard deviation of the logarithms, s,. 4. Compute the coefficient of skewness, C,.

5. Compute

FREQUENCY ANALYSIS

161

where KT is obtained from table 7.3. Note that this relationship is identical to equation 7.9 except the logarithms are used.

6. Compute X, = antilog Y, = loY'. This method has as a special case the lognormal distribution when C, = 0. For short periods of record, the skew coefficient calculated from equation 7.13 may not be a reliable estimate of the population skew coefficient and it may be desirable to replace it with a regionalized coefficient

Table 7.3a. KT values for positive skew coefficients Pearson type III distribution'

Skew coef.

1.0101

2

Y

99

50

Recurrence interval (years) 5 10 25 Percent chance ( 2 ) 20 10 4

Interagency Advisory Committee on Water Data (1982).

50

100

200

2

1

0.5

162

CHAPTER 7

Table 7.3b. KTvalues for negative skew coefficients Pearson type III distribution'

Skew coef

1.0101

2

Y

99

50

Recurrence interval (years) 5 10 25 Percent chance (2) 20 10 4

50

100

200

2

1

0.5

Interagency Advisory Committee on Water Data (1982).

(Beard 1962, 1974; Benson 1968). Figure 7.2 contains regionalized skew coefficients of annual streamflow maximum logarithms computed by the U.S. Geological Survey. The frequency factors of table 7.3 can be used for the Pearson type 111 distribution in the same manner as for the log Pearson type ID. The actual data values rather than their logarithms would then be used. Approximate values of KT for the Pearson Type 111distribution are given by

FREQUENCY ANALYSIS

163

Fig. 7.2. Generalized skew coefficients of annual maximum stream flow logarithms.

where K, is the standard normal deviate (Interagency Advisory Committee on Water Data 1982). Because of certain limitations on this approximation, the use of the table for KT is recommended. Obviously, the use of analytic approximations for KT for any of the distributions makes the calculations for flows of various return periods quite easy using spreadsheets or other computer software. Table 7.4 contains the maximum percent error in equation 7.14 as compared to Table 7.3. Note that a 1% error in KT does not translate directly to a 1% error in flow. For example, when the log Pearson type 111 is used in example problem 7.2, the 100-year flow is estimated at 29,719 cfs. The skewness of the logarithms was 0.296, so use of equation 7.14 has a maximum error of 0.09%. With such an error, KT would be 1.0009 X 2.542, or 2.567, and the resulting flow estimate would be 29,752 cfs, which represents a difference of 0.11 % from

Table 7.4. Errors in the use of equation 7.15 for estimating KT log Pearson distribution

164

CHAPTER 7

the estimate using the table value. This is a very small error when one considers the uncertainties present in estimates of this kind. Often interpolation has to be done in table 7.3, which may introduce more error than the use of equation 7.14. Only for C, < -2.5 is the error in KT > 2% for T of 50, 100, and 200 years. Extreme Value Type I Distribution (Gumbel Distribution) Chow (1951) presents the following relationship for the frequency factor for the extreme value type I maximum distribution

where ye is the Euler number (0.577216) and Tx(x) is the desired return period of the quantity being calculated. Potter (1949) presents some curves that simplified the application of the extreme value type I. Kendall (1959) presents the frequency factors shown in table 7.5 for the extreme value type I distribution. The values computed from equation 7.15 are equivalent to an infinite sample size in table 7.5.

Table 7.5. Frequency factors for extreme value type I distribution Sample size n

Return period

FREQUENCY ANALYSIS

165

Other Distributions Any of the distributions discussed in chapter 6 can be fit to data by using the methods discussed in that chapter. Frequency factors for some of the other distributions are given by Chow (1951, 1964). GENERAL CONSIDERATIONS Many proponents (and opponents) of one analytical form or another for flood flow frequencies have come to the fore over the past few decades. The proponents claim that some particular method is superior to some other method and "prove" their claim by a few rationalizations and some case studies. The fact remains that these rationalizations involve questionable assumptions. There is no direct theoretical connection between any analytical form of the frequency distribution and the underlying mechanisms governing flood flows except through the limit theorems. The primary consideration in selecting a particular analytical form for the frequency distribution is that the distribution "fit" the observed data (Anderson 1967; Benson 1968). Benson (1968) reported on the results of a study by a work group consisting of 18 representatives from 12 federal agencies of the U.S. government. This group studied 6 methods of flood frequency analysis on 10 streams located throughout the United States. The records on these streams ranged in length from 40 to 97 years with an average of 55 years. The drainage areas ranged from 16.4 to 36,800 square miles. The six methods of analysis consisted of 1) the gamma distribution, 2) Gumbel distribution, 3) Gumbel distribution using the logarithms of the data, 4) lognormal distribution, 5) log Pearson type 111distribution, and 6) Hazen's method. The computational procedures used were much like those presented in this book. The Hazen method consists of using an equation like equation 7.8 along with a table of empirically derived frequency factors that are a function of the return period and the coefficient of skew (Hazen 1930). Large differences were produced by the 6 different methods especially at long return periods. The results showed that the lognormal, log Pearson type 111, and Hazen methods were about equally good. The group suggested that the log Pearson type 111be used unless there was a good reason to use some other method. This recommendation was made even though the group realized that "there are no rigorous statistical criteria on which to base a choice of method". Benson's (1968) report states that the study showed that "the range of uncertainty in flood analysis, regardless of the method used, is still quite large" and that many questions concerning it remain unresolved. In a follow-up study, Beard (1974) examined flood peaks from 300 stations scattered throughout the United States. Several probability distributions were tried, including the log Pearson type HI, lognormal, Gumbel's extreme value distribution, and the 2- and 3-parameter gamma distributions. Beard concluded that only the lognormal and log Pearson type 111 with a regionalized skew coefficient were not greatly biased in estimating future flood frequencies. He stated that the latter distribution produced somewhat more consistent results but that ... regardless of the methodology employed, substantial uncertainty in frequency estimates from station data will exist ...". In selecting a particular analytical form for a frequency curve, one may be tempted to select a distribution with a large number of parameters. Generally, the more parameters a distribution has, the better it will adapt to a set of data. However, for the sample size usually available in hydrology, the reliability in estimating more than 2 or 3 parameters may be quite low. Thus, a compromise must be made between flexibility of the distribution and reliability of the parameters. "

166

CHAPTER 7

Recognizing the short record lengths often available for frequency analysis, methods of augmenting natural data by synthetic data are being developed. In some cases the rainfall record pertaining to a watershed is much longer than its streamflow record. In this event it may be possible to calibrate a deterministic streamflow model to the watershed and then use the long rainfall record to generate a long synthetic streamflow hydrograph. This synthetic hydrograph can then be combined with existing data into a single frequency analysis. In the absence of rainfall records, it may be possible to transfer records from a nearby station or to stochastically generate a series of rainfall data. This data could then be used with the calibrated deterministic model to augment natural streamflow data. One might consider weighting the natural data more than the augmented data in the final frequency analysis. Regression and correlation techniques might be used to relate peak flows to rainfall or to peaks from nearby gages and using this relationship to extend the available record. It was because of the many factors and uncertainties that are involved in the selection of a probability distribution to use in flood frequency determinations, that several agencies of the U.S. Federal government developed the guidelines published as "Guidelines for Determining Flood Flow Frequency," commonly known as Bulletin 17B (Interagency Advisory Committee on Water Data, 1982). Bulletin 17B has become a standard for flood frequency analysis of annual flood peak discharges. The developers of Bulletin 17B recognized that "there is no procedure or set of procedures that can be adopted which, when rigidly applied to the available data, will accurately define that flood potential of any given watershed. Statistical analysis alone will not resolve all flood frequency problems." The basic Bulletin 17B approach is to use the log Pearson type I11 distribution as explained above. Because this distribution is a 3-parameter distribution, the coefficient of skew is used when estimating the parameters by the method of moments. The skew coefficient is sensitive to extreme flood values and thus difficult to estimate from small samples typically available for many hydrologic studies. Figure 7.2 presents a map of generalized skew coefficients for the logs of peak flows taken from Bulletin 17B of the Interagency Committee. The station skew coefficient calculated from observed data and generalized skew coefficients can be combined to improve the overall estimate for the skew coefficient. Under the assumption that the generalized skew is unbiased and independent of the station skew, the mean square error (MSE) of the weighted estimate is minimized by weighting the station and generalized skew in inverse proportion to their individual mean square errors according to the equation (Tasker 1978):

where Gw is the weighted skew coefficient, G is the station skew (from equation 7.1 3), G is the generalized skew (from figure 7.2), MSEc is the mean square error of the generalized skew, and MSEGis the mean square error of the station skew. MSEE is taken as a constant, 0.302, when the generalized skew is estimated from figure 7.2. MSEGcan be estimated from (Wallis, Matalas, and Slack 1974):

FREQUENCY ANALYSIS

167

where A=-0.33+0.081GI

ifIGI10.90

=-0.52+0.301GI

ifIGI>0.90

N

=

(7.18)

record length

It is recommended that if the generalized and station skew differs by more than 0.5, the data and flood producing characteristics of the watershed should be examined and possibly greater weight given to the station skew.

CONFIDENCE INTERVALS Any stream flow record is but a sample of all possible such records. How well the sample represents the population depends on the sample size and the underlying population probability distribution, which is unknown. Both the form and parameters of the underlying distribution must be estimated. If a second sample of data were available, certainly different estimates would result for the parameters of the distribution even if the same distribution were selected. Different parameter estimates will obviously result in different return period flow estimates. If many samples were available, many estimates could be made of the distribution parameters and consequently many estimates could be made of return period flows-say Qloo.One could then examine the probabilistic behavior of these estimates of Qloo.The fraction of the Q,,'s that fell between certain limits could be determined. In actuality, we have just one sample of data from which to make estimates of QT. Statistical procedures are available for estimating confidence intervals about estimated values of QTthat will give a measure of uncertainty associated with QT. Confidence limits give a probability that the confidence limits contain the true value for QT.A 90% confidence limit indicates that 90% of the time intervals so calculated will contain the true estimate for QT. Letting L,and UT be the lower and upper confidence intervals

where a is the degree of confidence expressed as a percent. Exact determination of L,and UT depend on the underlying parent population. Bulletin 17B of the Interagency Committee presents some approximate relationships for confidence intervals

where X and s, are the sample means and standard deviations and KT, and KT,, are the lower and upper confidence coefficients. If a distribution like the log Pearson type III distribution is used, X

168

CHAPTER 7

and sx are based on the logarithms of the data and L,and UT are the logarithms of the confidence limits. Approximations for KT,, and KT,Ubased on large samples and the noncentral t-distribution are

where

In these relationships, KT is the frequency factor of equation 7.9, Z, is the standard normal deviate with cumulative probability c = 50 + a / 2 if a is expressed as a percent. If a is 90%, then c is 95%. The sample size is n. Confidence limits can be placed on frequency curves plotted on probability paper by making calculations such as above for several values of T.

TREATMENT OF ZEROS Most hydrologic variables are bounded on the left by zero. A zero in a set of data that is being logarithmically transformed requires special handling. One solution is to add a small constant to all of the observations. Another method is to analyze the non-zero values and then adjust the relation to the full period of record. This method biases the results as the zero values are essentially ignored. A third and theoretically more sound method would be to use the theorem of total probability (equation 2.10).

Because prob(X 1 xlX = 0) is zero, the relationship reduces to

In this relationship, prob (X # 0) would be estimated by the fraction of non-zero values and prob(X 1 xlX # 0) would be estimated by a standard analysis of the non-zero values with the sample size taken to be equal to the number of non-zero values. This relation can be written as a function of cumulative probability distributions.

FREQUENCY ANALYSIS

169

or

where Px(x) is the cumulative probability distribution of all X (prob(X 5 xlX 2 0)), k is the probability that X is not zero, and Px*(x) is the cumulative probability distribution of the non# 0)). This type of mixed distribution with a finite zero values of X (i-e., prob(X < X ~ X probability that X = 0 and a continuous distribution of probability for X > 0 was discussed in chapter 2. Jennings and Benson (1969) have demonstrated the applicability of this approach to analyzing flood flow frequencies with zeros present. Equation 7.23 can be used to estimate the magnitude of an event with return period Tx(x) by solving first for Px*(x) and then using the inverse transformation of P,*(x) to get the value of X. For example the 10-year event with k = 0.95 is found to be the value of X satisfying

Note that it is possible to generate negative estimates for Px*(x) from equation 7.23. For example, if k = 0.25 and Px(x) = 0.50, the estimated Px*(x) is

This merely means that the value of X corresponding to Px(x) = 0.50 is zero. This makes sense because Px(x) = 0.50 corresponds to the 2-year flow, or the flow equaled or exceeded every other year. If only 25% or 114 of the annual flows are greater than zero, then the flow exceeded every other year must be zero. Example 7.1. Seventy-five years of peak flow data are available from an annual series; 20 of the values are zero; and the remaining 55 values have a mean of 100 cfs, a standard deviation of 35.1 cfs, and are lognormally distributed. (a) Estimate the probability of a peak exceeding 125 cfs. (b) Estimate the magnitude of the 25-year peak flow. Solution: (a) prob(X > 125) = 1 - prob(X

5

125) = 1

-

Px(125)

Applying equation 7.23

Px*(125) can be evaluated by solving equation 7.12 for KN and then using the table for the normal distribution to get the desired probability.

From equations 6.35 and 6.36

From a table of the standard normal distribution, this K, for a C, of 0.351 corresponds to a prob(X, < x) of 0.795. Px(125) = 1 - 0.733 prob(X

2

125) = 1

+ 0.733(0.795) = 0.850

-

Px(125) = 0.15 or T = 1/0.15 = 6.7 yrs.

The probability of a peak flowing any year exceeding 125 cfs is 0.15. The conditional probability of a peak exceeding 125 cfs given that the peak is not zero is 1 - 0.795 = 0.205. (b) Px*(x) = [Px(x) - 1 =

(1

+ k]/k

=

[l

-

(l/T)

- 0.04 - 1 + 0.733)/0.733

=

-

1 + k]/k

0.945

The value of X corresponding to Px*(x) = 0.945 can be obtained from equation 7.12. Z, for P(x) = 0.945 is 1.60. Therefore, X2, = exp(4.547 + 0.341 *1.60) = 163 cfs. --

- --

- -

-

-- -

Example 7.2. Table 7.6 contains annual peak flow data for Black Bear Creek near Pawnee, Oklahoma, for the years 1945 through 1997. (a) Plot the data on normal and lognormal probability paper. (b) Plot the "best" fitting normal, lognormal, extreme value type I, and log Pearson type I11 distributions on the plot of part a. (c) Estimate the 100-year peak flow based on the four distributions of part b. (d) Estimate the 90% confidence intervals on the log Pearson type ID estimates. (e) Estimate the 100-year peak flow using the log Pearson type 111with a weighted skew coefficient based on the station skew and the generalized skew coefficients.

FREQUENCY ANALYSIS

171

Table 7.6. Annual peak flow data for Black Bear Creek near Pawnee, Oklahoma Year

Flow (cfs)

Year

Flow (cfs)

Year

Flow (cfs)

Solution: (a) The plotting positions are calculated by ranking the data from largest to smallest and then using the relationship pp = m/(n 1) where m is the rank and n is the number of observations (53). Since the largest observation is 30,200 cfs, it is assigned a pp of 1/54, or 0.0185. The second largest value is 19,200 cfs with a pp of 2/54 or 0.0370, and so forth until the smallest value of 1,560 cfs with a pp 53/54 or 0.9815. The data are plotted in figures 7.3.

+

(b) The best fitting lines for the various distributions can be obtained by calculating several points from equation 7.9. The basic statistics of the data are found to be

Mean Std dev Skewness

Data

In of data

6683 5337 2.262

8.568 0.68 1 0.296

The next step is to determine the appropriate frequency factors for various return periods for the four distributions. The frequency factor for the nonnal and log normal distributions comes from the standard normal distribution. KT for the extreme value and log Pearson distributions come from equations 7.15 and 7.14, respectively. Sample calculations follow for a return period of 20 years.

N flow

prob

-3.000

-2.000

LN flow

-1.000

LP3 flow

EV 1 flow

0.000

1 .OOO

2.000

3.000

z

Fig. 7.3a. Flood frequency curves for Black Bear Creek using the standard normal z and arithmetic flow scales.

L

Fig. 7.3b. Flood frequency curves for Black Bear Creek using the standard normal z and logarithmic flow scales.

FREQUENCY ANALYSIS

173

Exceedance probability

Fig. 7 . 3 ~ .Flood frequency curves for Black Bear Creek using normal probability paper.

\.

-

.... .. .. Normal

- Lognormal

--

Extreme value log Pearson Data

Exceedance probability

Fig. 7.3d. Flood frequency curves for Black Bear Creek using lognormal probability paper.

Normal distribution:

Lognormal distribution: XT = EXP (L(1

+ CvyKT))= EXF'

( (+ 8.568 1

o'6811.645))=16132

8.568

Extreme value distribution:

Log Pearson distribution: The calculations must be based on the logarithms.

XT = EXP(L(1

+ CvYKT))= EXP (8.568 (1 + o'68 1.724)) 8.568

= 17029

Figures 7.3a-d show the resulting plot of the data and the best fitting distributions. The four plots all contain the same information but show different formats. The first two plots use the z transformation and the second two plots use normal probability scales. Both arithmetic and logarithmic scales are shown for flow. Note that the normal distribution plots as a straight line when the arithmetic scale is used and the lognormal distribution plots as a straight line when the logarithmic scale is used. (c) The 100-year flow estimates are contatined in the last line of the above table. (d) The calculations of the confidence intervals are contained in the following table: (4) Kt, 1

(5 Kt, u - 1.7499

0.1786 1.1122 1.6588 2.1360 2.2795 2.6994 3.089 1

FREOUENCY ANALYSIS

1

2

175

5

10

20 3 0 4 0 5 0 6 0 7 0 80

90

95

9899

Exceedance probability

Fig. 7.4. Flood frequency curves for Black Bear Creek with confidence intervals.

Explanation of columns in above table: (1) Return period; (2) From equation 7.14; (3) From equation 7.22b; (4) and ( 5 )From equation 7.21 ; (6) and (7) Equation 7.20; (8) Exp(col(6)); (9) Exp(col(7)); (10) Last column of previous table The results are plotted in the figure 7.4. (e) Station skew, G, is 0.296. The generalized skew from figure 7.2 is -0.22. From equations 7.16 to 7.18 we get

Qloo= exp(Y(l

+ CvyKT))= exp

( (

8.568 1

2.3082)) +8.568

=

25333 C

~ S

176

CHAPTER 7

Example 7.3. Estimate the 100-year flow for Black Bear Creek using the GEV distribution. Solution: From example problem 7.2, E = 6683, sx = 5337, and C, = 0.296. From equations 3.7 1

From equations 3.74

From equations 6.8 1-6.84

c=h,+

cx[T(l

+ K) - 11 K

=

4108

From equation 6.85 for T = 100 and Px(x,) = 0.99 x, =

cx 5 + -(1 K

- [-ln(P,(x,))]")

=

29,803 cfs

TRUNCATION OF LOW FLOWS In some situations the lower flows in an annual series of peak flow data may not truly represent flood flows. Such a situation may arise when no particularly heavy rainfalls or snow melts occur over a period of a year. It may then be desirable to truncate or delete these low peak flow values and analyze the remaining peaks adjusting the probabilities as discussed in the

FREQUENCY ANALYSIS

177

5000

10000

15000

2000(

Truncation level (cfs)

Fig. 7.5. Estimated 100-year flow as a function of truncation level for Black Bear Creek. Treatment of Zeros section. Figure 7.5 shows the estimated 100-year peak flow for Black Bear Creek using data from example problem 7.2, the log normal distribution, and various truncation levels. For example, if a truncation level of 3000 cfs is selected, the 11 values less than 3000 would be truncated and the k in equation 7.23 would be (53 - 11)/53, or 79. USE OF PALEOHYDROLOGIC DATA Baker (1987) has defined paleohydrology as the study of past or ancient flood events that occurred prior to the time of human observation or direct measurement. Paleohydrologic techniques provide means of obtaining data over periods of time much longer than are available from systematic records or even historical data. Paleohydrologic data may enable the evaluation of long-term hydrologic conditions by complementing existing short-term systematic and historical records, providing information at ungaged locations, and helping reduce uncertainty in flow estimates. Paleohydrology is discussed by Baker (1987), Kochel and Baker (1982), Costa (1987), Jarrett (1991), and Stedinger and Baker (1987). Once the magnitude and year of occurrence of a paleoflood is determined, that flow value can be assigned a return period. For example, if it is determined that 3000 years ago there was a flood in excess of any flow since that time, the flow could be assigned a return period of 3000 years. Questions of the stationarity of flood flows, the dating of paleofloods, and the difficulty of estimating the magnitude of paleofloods must be addressed in any paleoflood study. PROBABLE MAXIMUM FLOOD The probable maximum flood (PMF) is the flow that can reasonably be expected under conditions that maximize runoff conditions from the most severe combination of meteorologic and hydrologic conditions for the drainage basin in question. A PMF does not directly enter into a flood frequency analysis since the probability of such a flood is unknown. The PMF may provide an upper bound to a frequency analysis. The concept of a PMF has been criticized (Yevjevich 1968) as being neither probable nor maximum, yet it has found wide use for hydrologic designs for facilities whose failure would endanger human life or cause great economic loss.

178

CHAPTER 7

DISCUSSION OF FLOOD FREQUENCY DETERMLNATIONS A flood frequency study concerning the American River near Sacramento, California (National Research Council 1999), illustrates many of the difficulties in flood frequency analysis. Data available in that study included 93 years of systematic data, historical data, paleohydrologic data, and data derived through the use of rainfall-runoff models. Still considerable controversy exists as to the proper estimate for the 100- and 200-year flood flows. The foundation of any frequency analysis is the selection of a particular probability distribution for describing the data. The parameters of this distribution are estimated and the magnitude of events for various return periods are calculated. Methods for plotting the observed data on probability paper and for constructing the best fitting line according to the selected distribution have also been discussed. At this point it should be clear that there is nothing inherently hydrologic about frequency analysis procedures. They are simply statistical techniques that operate on numbers. The fact that the numbers being used are peak flows is of no concern to the technique. It should be of great concern to the analyst, however. Statistical frequency analysis simply attempts to extract information about the probabilistic behavior of a set of numbers from the numbers themselves. In hydrologic frequency analysis this probabilistic behavior is then generally extrapolated by the analyst to frequencies of occurrence well beyond that contained in the original set of numbers. From these extrapolations the flows having return periods of 25, 50, 100, or even 500 years are determined. The straightforward application of hydrologic frequency analysis as generally einployed uses no or very little hydrologic knowledge. In actuality, rare flows are determined by the hydrologic conditions that exist at the time of these flows and not by the statistical behavior of a sample of maximum peak flows that may have occurred some time in the past. Resolving the apparent conflict between these statements is what separates the hydrologist from the statistician. Statistics are descriptive of a set of observed data. Statistics do not define a cause and effect relationship or a physical relationship. Any conclusion drawn on the basis of a statistical frequency analysis assumes that the sample of data on hand is representative of a wider range of data known as the population. In hydrologic terms what this means is that if we have a sample of 15 years or so of observed annual maximum peak flows and use this data to estimate the 100-year flood, we are assuming the hydrologic behavior of the basin during the 100-year flood is somehow imprinted in the 15 years of observed data and that the statistical technique being used can uncover this imprint and use it. To determine if this is truly the case, the hydrology of the basin must be examined. Some of the questions that must be answered are: 1. Is the type of storm that is likely to produce the 100-year flow represented in the observed sample? 2. Is the contributing area of the basin the same for extreme floods as it is for small ones?

3. Are there ponds and reservoirs that may discharge at high rates during rare floods and not during smaller flows? What is the possibility of a dam breach and what would be the resulting flow?

FREQUENCY ANALYSTS

179

4. Are the channel flow and storage characteristics the same for extreme flows as they are for smaller flows?

5. Are land use and soil characteristics such that flows from rare storms may relate to precipitation in a manner different from more common storms?

6. Are there seasonal effects such that rare floods are more likely to occur in a different season than the more common floods?

7. Is the rare flood represented in the sample of data? If so how is it treated? Is it assigned a return period of 15 years where in fact its return period may be much greater than that? 8. Are changes going on within the basin that may cause change in the hydrologic response of the basin to rainstorms?

9. Are there climatic changes occurring that may influence flood flow frequencies? These last few paragraphs paint a discouraging picture for flood frequency analysis. That need not be the case as long as one does not discard hydrologic knowledge in the process. Often, the questions posed can be answered in such a way as to make the statistical analysis valid. At other times, when problems with the statistical procedures are recognized, adjustments can be made in the resulting flow estimates to more accurately reflect the hydrology of the situation. Hydrologic frequency analysis should be used as an aid in estimating rare floods. Sometimes the estimates made on the basis of the statistical frequency analysis can be taken as the final estimate. Sometimes the statistical estimate may need to be adjusted to better reflect the hydrology of the situation. It should be kept in mind that other hydrologic estimation techniques suffer from some of the same difficulties as do the statistical techniques. For example, if a hydrologic model is being employed, the parameters of the model must be estimated in some way. This is generally done on the basis of observed data from the basin in question, from observed data, from a similar basin or from so-called physical relationships such as Manning's equation, infiltration parameters, and so forth, and a set of accompanying tables. Regardless of how the parameters are estimated, the same type of questions regarding these estimates and the nature of the hydrologic model itself must be answered as outlined above for frequency analysis estimates. We cannot substitute mathematical and empirical relationships for hydrologic knowledge any more than we can substitute statistics for hydrologic knowledge. Based on this discussion, one might conclude that the magnitude of rare events should not be estimated because the estimates may be so uncertain. Generally, however, this is not one of the options available. An estimate must be made. Hydrology must not be ignored in making this estimate. Statistical, modeling, or empirical flow estimates should be made and then adjusted, if required, to reflect the hydrologic situation. This is not to say a factor of safety is to be applied. Adjustments should be based on hydrology, not rules of thumb.

180

CHAPTER 7

REGIONAL FREQUENCY ANALYSIS Regional flood frequency analysis has three major components, namely, delineation of homogeneous regions, determination of appropriate probability density functions (or frequency curves) of the observed data, and the development of a regional flood frequency model (i.e., a relationship between flows of different return periods, basin characteristics, and climatic data). Delineation of Homogeneous Regions Effective regionalization requires defining regions, generally geographic regions, that are similar and then capturing hydrologic relationships, generally empirical, for the region. The reason for this is that better predictions should result using data from a hydrologically similar region than from a dissimilar region. The standard error of estimate for homogeneous regions should be less than the standard error of estimate obtained without dividing the area into homogeneous regions. Regions are generally defined based on several considerations.Among these considerations are political boundaries, catchment boundaries, geologic boundaries, and climatic boundaries. Regional definitions require judgment. Generally, regions cannot be defined independent of the personal judgment of the individual doing the analysis. A particular technique for assigning data to homogeneous groupings, known as Cluster Analysis, is covered in chapter 12. Often, the actual definition of a region is arrived at somewhat by trial and error. Data are analyzed, grouped, and examined. Data may be moved from one group to another in an attempt to improve the quality of the analytic relationship being developed. Although regional estimation techniques, such as the regional flood frequency analysis, have been useful in the transfer of data from gaged to ungaged sites, they have also ushered in several problems. If all the gaged stations simply represent realizations of the same underlying population, then a straightforward pooling approach would be appropriate. The Index Flood method discussed earlier makes such an assumption. This method assumes that the region from which the observed data are obtained is homogeneous. The first task which must be completed in the process of regional flood frequency analysis is therefore the identification of homogeneous regions. Homogeneous regions may be defined as regions having similar hydrologic, climatic, and physiographic characteristics. The criteria most often used to delineate homogeneous regions are based on either geographic consideration (basin characteristics, weather regimes) or basin response characteristics (such as probability distributions and regional statistical flood parameters-e.g., skewness, coefficient of variation, etc.). There seems to be no uniquely objective approach to the delineation of homogeneous regions. It is generally agreed, however, that grouping basins within a homogeneous region will yield regional relationships with lower standard errors than those for entirely different areas (Kite 1977). Residual analysis has occasionally been used as a tool for defining homogeneous regions. The residual pattern from a linear regression of a given design flood for the entire study area is examined and regions are then delineated on the basis of geographic proximity of the positive and negative residuals (Gingras and Adamowski 1993). A second approach of defining homogenous regions is to group all stations with the same probability distributions or those that have constant distribution parameters (Hosking et at. 1985a,b).

FREOUENCY ANALYSIS

181

De Coursey (1973) applied discriminant analysis, a multivariate procedure, to flood data from Oklahoma to form groups of basins having a similar flood response. Bum (1988, 1989, 1990) described techniques for identifying homogeneous regions based on the correlation structure of the observed data, cluster analysis, and the Region of Influence (ROI) approach, respectively. The importance of identifying hydrologically homogeneous regions was further demonstrated by Lettenmaier et al. (1987) in a study that showed the effect on extreme flow estimation of regions containing heterogeneity. Of the many approaches that have been used to identify homogeneous regions, cluster analysis, a multivariate technique, has been getting more prominence in this field. This is primarily due to the fact that although cluster analysis does not entirely eliminate subjective decisions associated with the other methods, it greatly facilitates interpretation of a data set. The objective of cluster analysis is to group gaging stations that have similar hydrologic or basin characteristics. The most common similarity measures in cluster analysis is the Euclidean distance. Historical Development Weldu (1995) reviewed several articles on regional flow estimation. The earliest approach to the regionalization problem was to use empirical equations relating flood flow to drainage area within a particular region (Benson 1962~).The formulas were based on few data for a particular region and contain one or more constants whose values are empirically determined. Such a formula, in generalized form, is

where Q is the flow, C is coefficient related to the region, and A is the drainage area. The above equation, although simple to derive and apply, does not address the frequency of the flow and the effect of variations in precipitation or topography on the flows that are not accounted for. The various "culvert formulas" used by railroad and highway engineers, such as the Talbot formula (AISI 1967) are of this general type. The Talbot formula is widely used and is denoted by

where a = cross-sectional area of culvert in ft2. Various empirical formulas were later devised that attempted to include the concept of frequency and that involved rainfall in computing flood peaks. Perhaps the most widely used of such formulas is the Rational Formula (Shaw 1983), which expresses the peak flow (Q,) in terms of the rainfall intensity (i) with the desired return period, drainage area (A), and a coefficient that accounts for basin characteristics (C) as

One major weakness in this type of empirical formula is that the coefficients will remain constant only within regions in which other hydrologic factors vary little, which implies that the regions must of necessity be fairly small.

Statistical Methods Other methods of regionalization include the application of statistical techniques to hydrologic data. Statistics provides a means of reducing a mass of data to a few useful and meaningful figures. The distribution of the data could be represented by a probability density function or a curve that defines the frequency of values of the variable. Statistical procedures may also provide methods of relating dependent variables to one or more independent variables through regression analysis. Most applications of statistical techniques require a considerable amount of data. The value of the analysis is directly related to the quantity and quality of the data that are available. Often, hydrologic estimates are required at locations where there is little or no data. The design of a bridge opening or culvert, for example, on one of the many streams for which there is no data may be required. Regionalization is an attempt to use data from locations in the same region as the point of interest to make hydrologic estimates at the point of interest. Regional flood frequency models have extensively been used in hydrology for transfemng data from gaged to ungaged sites. Two such regionalization procedures, namely the index-flood and regression-based methods, have evolved over the years and have extensively been used in regional flood frequency analysis. This treatment will focus on flood frequency analysis. The goal is to estimate flood flows of various return periods for streams and locations where there is little or no data. Frequency Distributions After a homogeneous region has been identified, the next stage in the specification of the regional flood frequency model is the choice of appropriate frequency distribution(s) to represent the observed data. The distributions most commonly used in hydrology are normal, lognormal, Gumbel extreme value distribution (type I), and log Pearson type 111. The U.S. Water Resources Council (1982) conducted studies involving comparison among different probability distribution functions and their recommendation was to use the log Pearson type I11 as the basic distribution for defining the annual flood series. The Council also recommended that this distribution be fitted to sample data using the method of moments. In a more detailed study, the U.K. Natural Environment Research Council (1975) found that 3-parameter distributions such as the log Pearson type 111and the generalized extreme value distribution (GEV) were found to fit data from 35 annual flood series better than the 2-parameter distribution functions. The log Pearson type III (LP 111) distribution has extensively been used in flood frequency analysis since its favorable recommendation by the Water Resources Council in 1976. The frequent use of the LP III attracted a number of detailed mathematical and statistical studies regarding its role in flood frequency analysis. Various alternative fitting techniques for the LP I11 distribution have been suggested by Matalas and Wallis (1973) and Condie (1977). These researchers carried out comparisons between the method of moments and the method of maximum likelihood, and concluded that the latter method yielded solutions that are less biased than the method of moment estimates. Bobee (1975) and Bobee and Robitaille (1977) suggested using moments of the original data instead of using moments of the logarithmic values. NozdrynPlotnicki and Watt (1979) studied the method of moments, the method of maximum likelihood, and the procedure proposed by Bobee (1975), and found that none of the methods were superior

FREQUENCY ANALYSIS

183

than the others and concluded that the method of moments was the best because of its computational ease. An important step in a regional flood frequency analysis is to ensure that the data that are being used are of good quality. The data must be representative of the region and they must be representative of the long-term flood characteristics of the region. Data on the physical characteristics of the catchments and any other data that are used must be of good quality. There are no regional flood frequency techniques that can overcome faulty data. After collecting and screening the data, the first step is to fit various pdfs to the observed peak flow data at locations where sufficient data exist. Once all of the available data are fit to the candidate distributions, assumptions and statistical tests must be made in an effort to select the distribution that best describes each data set. This selection of pdfs is based on probability plots of observed data along with the fitted distributions. Statistical tests such as the chi-square test and the KolmogorovSmirnov test discussed in chapter 8 may be made. Personal judgment based on the probability plots is also used. Once the best fitting pdf is selected for each data set, that pdf can be used to estimate the peak flow for various return periods. The pdf which best fits the data for the majority of the stations or locations included in the study is generally used for all locations. Several options are now available for the next phase of the analysis: 1. Develop a relationship between the peak flows of various return periods and measurable characteristics of the catchments producing the flows (QT - = f(X)). 2. Develop a relationship between parameters of the pdf that best fits a majority of the flow data and measurable characteristics of the catchments producing the flows (0 - = f(X)). 3. Develop a dimensionless flood frequency curve for the region plus a relationship between some index flood for each catchment and measurable characteristics of the catchments (QT/Q vs T and Q, = f(X)). Regression-Based Procedures All three of the options mentioned above require relationships with measurable characteristics from the catchments for which flow data are available. Characteristics that might be included in the analysis include precipitation variables, such as mean annual rainfall and 24-hour rainfalls for various return periods. Physical characteristics such as catchment area, land slopes, stream lengths, stream slopes, and land use might be included. Soils information such as permeability and water holding capacities can be used. There are also a large number of geomorphic parameters such as drainage density, catchment shape factors, and measures of elevation changes that might be included. The result of this data collection effort will be a matrix of data having n observations on m X is an m X n matrix. The n observations on each catchment come about catchments. Therefore from making a single measurement or observation on each of the n characteristics included in the analysis. The m represents the number of catchments in the study. Thus, a study that involved

30 catchments and 12 characteristics on each catchment would produce a data matrix having 30 rows (one for each catchment) and 12 columns (one for each characteristic). A regional flood frequency approach, in addition to the m X n data matrix of independent variables, will include an m X p data matrix of dependent variables which are the peak flow estimates for the various return periods. With return periods of 2,5, 10,25,50, and 100 years, p will be 6. With 30 catchments, a 30 X 6 matrix of dependent variables, where the rows are the catchments and the columns correspond to the various return periods, will result. Multiple regression techniques can now be used to relate the dependent variables to the independent variables based on the 30 observations on hand. Regression based on the regional data and based on logarithms of the regional data can be investigated. Through the estimation process based on multiple regression, the independent variables that are not useful in predicting the dependent variables can be eliminated. The goal is to find if the peak flow for the various return periods can be estimated based on a small subset of the original n catchment characteristics. Although not always possible, it is desirable to use the same subset of independent variables for predicting each of the p dependent variables. This will help to ensure a consistent set of predictions for various return periods on a particular catchment. Using multiple regression to estimate the magnitude of a flood event that will occur on average once in T years, denoted by QT, by using physical and climatic characteristics of the watershed has a long history (Benson 1962c, 1964; Benson and Matalas 1967; Thomas and Benson 1970). Sauer (1974) developed regional equations relating flood frequency data for unregulated streams in Oklahoma to basin characteristics through multiple linear regression techniques. Similar studies have been done throughout the United States (Jennings et al. 1993). The Hydrology Committee of the U.S. Water Resources Council (1981) investigated numerous methods of estimating peak flows from ungaged watersheds and found that the results obtained using regional regression compared favorably well with more complex watershed models. A logarithmic transformation of the QT,physiographic, and climatic data may be required to linearize the regression model and to satisfy other assumptions of regression analysis. The relationship most commonly used is of the form

where XI, X,, ... X, represent the basin and climatic data, and b,, b,, b,, ... b, are the regression parameters. Regression parameters may be estimated using the ordinary least squares (OLS), weighted least squares (WLS) or generalized least squares (GLS). OLS do not account for unequal variances in flood characteristics or any correlations that may exist between streamflows from nearby stations. To overcome these deficiencies in the OLS method, Tasker (1980) proposed the use of WLS regression with the variance of the errors of the observed flow characteristics estimated as an inverse function of the record length. Using a weighting function of

FREQUENCY ANALYSIS

185

where N is the number of stations, toand t , are constants, and ni is the record length of station i, Tasker (1980) reported that the WLS produced a smaller expected standard error of predictions than the OLS. Using Monte Carlo simulation, Stedinger and Tasker (1985) demonstrated that the WLS and GLS provide more accurate estimates of regression parameters than the OLS. A major drawback of the WLS and GLS is the need to estimate the covariance matrix of the residual errors. The covariance matrix of the residual errors is a function of the precision with which the model can predict the streamflow values. Estimating a peak flow for some return period on an ungaged catchment now becomes an exercise in applying the appropriate regression equation to the ungaged catchment. The required catchment characteristics are used in the appropriate prediction equations to estimate the peak flow. Regional frequency analysis using option b is very similar to option a except the dependent variables in the regression analysis are the parameters or some function of the parameters of the pdf selected to represent the flood peak flows. If a lognormal distribution is used, there will be 2 dependent variables, the mean and standard deviation of the logarithms of the flows. If the log Pearson type It1 is used, there will be 3 dependent variables.

.98

.95

.90

-80

.70

.60 .SO .40

.30

.20

.I0

.05

.02

5

10

20

50

Based on data from Durant and Blackwell (1959) For stations in parts of Alberta and Saskatchewan, Canada. 1911-1956 (REGION A)

0 Point from individual station

+

Median of 18 stations

Only max and min station value plotted in this region

1.02

2

1.11

Return period (yrs.)

Fig. 7.6. Regional flood frequency curve.

186

CHAPTER 7

Again, it is desirable to use the same set of independent variables to predict all of the parameters of the selected pdf. This is because the parameters will most likely be correlated. Using the same set of independent variables helps ensure that one maintains a consistent relationship among the parameters of the pdf. Estimating peak flows for an ungaged catchment consists of using the derived prediction equations to estimate the parameters of the flow frequency pdf. These parameters are then used in the pdf to estimate flow magnitude with the desired return periods. Index-flood method Another widely used statistical procedure in regional flood frequency analysis is the indexflood method. This method, first described by Dalrymple (1960), involves the derivation and use of a dimensionless flood frequency distribution applicable to all basins within a homogeneous region. Regional-Index Flood Relationship The next step in the index-flood method is to define the index flood. The ratios of peak flows of various return periods to this index flood are then computed. The ratios are of the form QT/QI where QT is the flood with return period T, and QI is the index flood. The index flood is often taken as the mean annual flood or the 2-year flood. A plot is made of QT/QI versus T containing data for all of the watersheds. A line is drawn through the median of the data in this plot. The resulting line is the regional flood frequency line. In the past, the index-flood method was widely used to perform regional frequency analysis (Dalrymple, 1960; Benson, 1962). The basic premise of the method is that a combination of streamflow records maintained at a number of gaging stations will produce a more reliable record than that of a single station and thus will increase the reliability of frequency analysis within a region. The index flood method consists of two major steps. The first involves the development of dimensionless ratios by dividing the floods at various frequencies by an index flood, such as the mean annual flood for each gaging station (Stedinger 1983; Lettenrnaier and Potter 1985; Lettenmaier et al. 1987). The averages or medians of the ratios are then determined for each return period to estimate a dimensionless regional frequency curve. The second step consists of the development of a relationship between the index-flood and physiographic and climatic characteristics of the basin. Flood magnitudes and frequencies at required locations within the region can then be estimated by rescaling the corresponding dimensionless quantile by the index flood. The index-flood method, once the standard U.S. Geological Survey (USGS) approach, is based on the assumption that the floods at every station in the region arise from the same or similar distributions (Chowdhury et al. 1991). At some stage this procedure fell out of favor, primarily due to the fact that the coefficient of variation of the flows, which is assumed to be constant in an index-flood method, was found to be inversely related to the watershed area (Stedinger 1983). This implies that the standard deviations of the normalized data do not remain constant for various values of basin areas, because the coefficient of variation of the observed data is equal to the standard deviation of the normalized flows. This can be demonstrated as follows. Let Yi be the normalized flows given by:

FREQUENCY ANALYSIS

187

where xi represents the ordered observed flows (with x, being the largest observation and x, the smallest) and Z is the mean observed flow, then the coefficient of variation of the observed data, CV,, is given by:

Substituting the standardized value of Xi

the right-hand side of this equation is nothing but the standard deviation of the normalized flows. The index-flood method started to be popular once again in the late 1970s and early 1980s since the introduction of the probability weighted moments (PWM), a generalization of the usual moments of a probability distribution (Greenwood et al. 1979). Greis and Wood (1983) reported that improved regional estimates of flood quantiles were obtained by applying the PWM over the conventional methods such as the method of moments and maximum likelihood estimation. Parameter estimation by PWM requires the calculation of moments Mijkdefined as

where i, j, and k are real numbers and X is a random variable with distribution function, F(x) where F(x) = Prob(X 5 x). M1,o,ois identical to the conventional moment about the origin and the probability weighted moments corresponding to MITO,, or Mk are denoted as

All higher-order PWMs are linear combinations of the ranked observations x, r . . . 5 x,, which is an indication that PWM estimators are subject to less bias than ordinary moments. Ordinary moment estimators such as variance (s2)and coefficient of skewness (C,) involve squaring and cubing of observations respectively, with a potential to give greater weight to outliers, resulting in a substantial bias and variance. However, one major weakness of the PWM is that it cannot be used to estimate parameters for those distributions which cannot be expressed in inverse form, such as LP 111. Regionalization Using L-Moments and the GEV Distribution Hosking et al. (1985) and Stedinger et al. (1994) discuss regional flood frequency analysis using L-moments and the generalized extreme value distribution. The following is adapted from their work.

The generalized extreme value distribution (GEV) is given by

Consider K sites with flood records Xi(k) for i = 1, 2, ...,n, and k = 1, 2, ..., K. Normalize the Xi(k) by dividing the observations at a site by the mean of the observations at that site.

1. At each site compute the three L-moments X,(k), X2(k), and X3(k) of the normalized observations using the probability weighted moments (PWM) estimators. The L-moments are linear combinations of the ranked observations.

where xQjis the jthorder statistic of the normalized observations with x(,) the smallest and x(,) the largest. 2. To get a normalized frequency distribution, compute the average of the normalized Lmoments of order r = 2 and r = 3.

- c:= ~t[ir(k)/fil(k)l

A:

1

=

k Ck=l Wk

for r = 2 , 3

The w, are weights. The weights might be based on n,. 3. Use the fir to obtain the parameters and X : of the normalized regional GEV by letting

FREQUENCY ANALYSIS

189

Then, from the expression of P,(x), compute

which is the regional flood frequency curve evaluated at probability p or T = 1 - l/p. 4. Estimate the

where

loopthpercentile flood distribution at any site k by

6: is the at-site sample mean for site k.

For sites without flow records on which to estimate h:, a regional regression could be used to develop an equation of the form

where X is a set of physical and hydrologic characteristics. Regionalization Using Modeling Conceptual hydrologic models are, in a sense, regionalization tools. A hydrologic model is used to estimate flow characteristics at a particular location. The model requires as input certain parameter values that must be estimated. In the absence of flow at the point of interest, these parameters must be estimated from experience on other similar catchments. Some way of correlating model parameter values with catchment characteristics is required. These relationships are then used to estimate the values for the parameters of the catchment of interest. This represents a regionalization approach in that parameters are estimated by transferal of information from other basins to the basin of interest.

FREQUENCY ANALYSIS OF PRECIPITATION DATA The amount of rainfall (depth) that can be expected to occur in a given period of time (duration) on the average once every so many years (frequency) is an important design variable for many hydraulic structures. Depth-duration-frequency relationships have been developed for the United States (Hershfield 1961) for durations of 30 minutes to 24 hours and return periods of 1 to 100 years and published as U.S. Weather Bureau TP 40. The procedure used in developing TP 40 (Hershfield 1961) was to prepare four key base maps showing the 2-year, 1-hour; 2-year, 24-hour; 100-year, 1-hour; and 100-year, 24-hour rainfalls for the United States. Annual series data were used consisting of the maximum 60-minute

CHAPTER 7

190

Table 7.7. Empirical factors for converting partial duration series to annual series (Hershfield 1961) Return period

Conversion factor

and 24-hour rainfall depths converted to a partial duration series by using the factors shown in table 7.7. For example, if the 5-year partial duration series value estimated from the maps is 2.00 inches, the corresponding annual series depth would be 0.96(2.00) or 1.92 inches. For return periods greater than 10 years, the conversion factor is essentially unity. The 2-year rainfall amounts were determined by plotting on log-log paper the return period versus the rainfall depth using the California plotting position formula (Table 7.1), drawing a smooth curve through the points, and reading the 2-year value. The 100-year rainfall amounts were determined by using the type I Extreme Value distribution for selected stations with long rainfall records. The ratio of the 100-year to the 2-year rainfall amount was then determined for these stations and a map prepared showing the value of this ratio. The 100-year rainfall amounts for the stations with short records was estimated by the 100-year to 2-year ratio. The rainfall depths for other return periods were determined by plotting the 2-year and 100-year depths on special paper, connecting the points by a straight line, and reading off the desired rainfall depths. The spacing of the return periods along the abscissa of this special paper was empirical from 1 to 10 years based on free-hand plotting of partial duration series data and theoretical according to the type I extreme value distribution from 20 to 100 years. The transition between 10 and 20 years is smoothed by hand from the type I values. The rainfall depths for durations other than 1 hour or 24 hours were obtained by plotting the 1-hour and 24-hour values on a second special paper and connecting the points with a straight line. This diagram was obtained empirically from an analysis of records from 200 first-order U.S. Weather Bureau stations. The depth of rainfall for the 30-minute duration is obtained by multiplying the 1-hour value by 0.79. From these analyses, curves called depth (or intensity)-duration-frequency curves can be prepared. Data from the maps in TP 40 can be used to determine depth-duration-frequency (DDF) relationships for locations where actual data does not exist. Often, in developing DDF curves, the interpolation from the maps of TP40 may result in rather rough plots. The curves can be smoothed by using an empirical smoothing equation. One such equation is D=

KTFx (T + b)"

where D is the depth, T is the duration, and F is the frequency of the rainfall. The coefficients K, x, b, and n may be estimated using nonlinear regression techniques. Figure 7.7 shows the results of such an analysis for Stillwater, Oklahoma, based on TP40 data.

FREQUENCY ANALYSIS

0.1

191

1

10

100

Duration (hrs)

Fig. 7.7. Rainfall depth-duration-frequency relationship for Stillwater, Oklahoma. Rainfall data for longer durations, such as weeks or months, can be analyzed by using the gamma distribution. Barger and Thom (1949) have shown the gamma distribution applicable to rainfall data. Barger, Shaw, and Dale (1959), Friedman and Janes (1957), Strommen and Horsfield (1969), and Mooley and Crutcher (1968) are among those who have used the gamma distribution for rainfall. By using equation 7.23, it can be seen that the probability of a rainfall R exceeding X is given by

and the probability of R being less than x is given by

where k is the probability of rain or the proportion of time intervals with rainfall and P*(x) is the cumulative probability distribution of rain given that R Z 0. often the gamma disfribution is used for rainfall data. The parameters of the gamma distribution generally are determined by using equations 6.18 and 6.19. Bridges and Haan (1972) have presented a technique for determining the reliability of rainfall estimates from the gamma distribution based on simulation studies. FREQUENCY ANALYSIS OF OTHER HYDROLOGIC VARIABLES The principles set forth on flood frequencies and rainfall frequencies also apply to frequencies of other hydrologic variables. Basically, the quantity to be analyzed must be defined, the data tabulated, and then a frequency analysis made. For instance, in the case of flow volume-frequency

192

CHAPTER 7

studies, the duration(s) of interest must be specified and then the maximum or minimum flow volumes for each year having the specified duration are tabulated. The maximum flow volumes would be used in the case of flood-flow volumes and the minimum volumes would be used in the case of low-flow studies. Frequency analysis can be applied on water quality parameters such as dissolved oxygen, biological oxygen demand, sediment loads, and many other quantities. Care must be taken to see that the data used meet the necessary requirements of homogeneity, independence, and representativeness. For example, if sediment concentration frequencies are being studied and part of the data are collected during low flows and part during high flows, the data may not be homogeneous because of the relationship between sediment concentration and flow rate. Exercises 7.1. Assume.that daily rainfall on rainy days follows an exponential distribution. The average daily rainfall on rainy days is 0.3 inches. If 30% of all days are rainy, what is the probability that on some future day, the amount of rainfall received will exceed 1.OO inch? Assume daily rainfalls are independent. 7.2. Derive a table of frequency factors for the exponential distribution corresponding to T = 2, 5, 10,20,50, and 100 years. 7.3. Select several streams in a single locality and prepare a plot of the ratio of the T-year flood to the mean annual flood (as in figure 7.6). 7.4. An analysis of 50 years of data showed that the probability of a flood peak exceeding 90,000 cfs on a certain river was -02. During a 10-year period 2 such peaks occurred. If the original estimate of the probability of this exceedance was correct, what is the probability of getting 2 such exceedances in 10 years? 7.5. Forty years of peak streamflow data are available. All but one of the data points indicate that a lognormal distribution with = 125,000 cfs and sx = 50,000 describes the data very nicely. The one outlier is equal to 285,000 cfs. What is the probability that an event of 285,000 cfs or greater could occur in the 40-year period if the flood peaks truly follow the lognormal distribution with X and sx as given? 7.6. Select a set of data consisting of 20 or more independent observations. Plot these data on normal probability paper using several of the plotting position relationships contained in table 7.1. 7.7. Compute the 100-year peak flow for the annual series data of example 7.2 assuming the data follow the gamma distribution. 7.8. Prepare a plot on log-log paper of low flow frequency-volume-duration for Cave Creek near Fort Spring, Kentucky. Plot volume in inches as the ordinate, duration in months (use 1, 2, 3, 6,

FREQUENCY ANALYSIS

193

and 12 months) as the abscissa and use as curve parameters frequency (use 2, 5, 10, and 25 years). 7.9. Work exercise 7.8 for maximum flow frequency-volume-duration on Cave Creek. 7.10. Plot the annual runoff data for Walnut Gulch near Tombstone, Arizona, on normal and lognormal probability paper. Does either of these distributions appear to "fit" the data? 7.1 1. Plot on normal probability paper the annual runoff data for (a) Piscataquis River near Dover- Foxcroft, Maine, (b) North Llano River near Junction, Texas, and (c) Spray River, Banff, Canada. Is there any apparent relationship between the curvature (or lack of it) and the skewness? 7.12. Work exercise 7.11, only plot the data on lognormal probability paper. 7.13. For the Piscataquis River near Dover-Foxcroft, Maine, estimate the 100-year annual flow assuming the data follow the (a) normal distribution, (b) lognormal distribution, (c) Pearson type III distribution, (d) log Pearson type I11 distribution, (e) extreme value distribution. 7.14. Work exercise 7.13 for the 100-year annual flow on the North Llano River near Junction, Texas. 7.15. Work exercise 7.13 for the 100-year annual flow on the Spray River, Banff, Canada. 7.16. In reference to exercises 7.13,7.14 and 7.15, which distribution would you expect to give the "best" estimate for the 100-year flow on each of the three rivers? Discuss in terms of the means, variances, coefficient of variation, and skewness. 7.17. Plot the annual peak discharge of Walnut Gulch near Tombstone, Arizona, on lognormal probability paper. Draw in what you consider the best fitting straight line. Estimate the mean and variance of the data from this plot. 7.18. Plot the suspended sediment load data for the Green River at Munfordville, Kentucky on normal and lognormal probability paper. Draw in the best fitting straight line. 7.19. Use the lognormal distribution to estimate the 25-year runoff volume for July on Walnut Gulch near Tombstone, Arizona. Plot the data on lognormal probability paper and draw in the theoretical best fitting straight line.

8. Confidence Intervals and Hypothesis Testing IN CHAPTER 3, parameter estimation was discussed in general terms. In chapters 4,5, and 6 specific methods for estimating the parameters of certain probability distributions were discussed. Again, it should be recalled that parameter estimates are called statistics, are functions of the sample (random) values, and are themselves random variables. Parameter estimates have associated with them probability distributions. Thus far we have discussed methods of getting point estimates for parameters and certain properties of these point estimates. The possible errors in these point estimates due to inherent variability in random samples of data have not been discussed. This chapter considers the reliability of parameter estimates and the testing of hypotheses regarding population parameters. Hypothesis testing and confidence interval estimation may be classed as parametric or nonparametric depending on whether or not assumptions are made regarding the probability distribution of the observations and/or the parameters under consideration. Parametric and nonparametric tests have certain assumptions in common. They both rely on independence in the observations and randomness of the sample. They both require samples of data to be representative of the situation under analysis. Parametric statistics deal with actual values of observations while nonparametric methods often rely on the ranking or relative position of data values. The use of parametric statistics is frequently criticized because of deviations from the distributions assumed by a particular test. One of the consequences of deviating from the assumed distribution is that the level of significance of the test is no longer exact. This may be a serious problem, but in most cases is not. Generally, the selection of the level of significance is somewhat arbitrary. Early statisticians used 5 and lo%, so everybody uses 5 and lo%! If one

HYPOTHESIS TESTING

195

doesn't know how to select a level of significance, it makes little sense to be overly concerned if the level of significance is unknown due to deviations from distributional assumptions. What is purported to be an exact test becomes an approximate test, but that is often the nature of hydrologic analysis. Uncertainty abounds! An approximate test provides information to the decision maker just as does a "so-called exact test and is certainly better than no test at all. Several papers are available indicating that nonparametric procedures are nearly as good as parametric procedures for some tests when distributional assumptions are met and are superior when distributional assumptions are not met (Helsel and Hirsch 1992). In any application of hypothesis testing or confidence interval estimation, it must be kept in mind that assumptions must be made concerning the data and the process under study. It is unlikely that in an actual application the assumptions will be exactly met. Again, if the assumptions are not fully met, then the tests or confidence intervals become approximate. If we reject the hypothesis that two streams have different BOD loadings, we do not necessarily believe their BOD loadings are exactly the same. It would be rare indeed to have two natural streams that have identical BOD loadings or any other quantifiable characteristic. We know before we run the test, indeed before we collect any data, that the BOD loadings are not precisely the same on two streams. What we are really concerned with is whether the BOD loadings are "significantly" different. In statistical jargon, we are assessing whether the difference we detect in BOD is of such a magnitude that it cannot be attributed to chance if the BOD loadings in the two streams are in fact the same and meet the conditions of the test. For example, consider a situation where the BOD level on two streams is sampled. Assume that on each of the streams the true distribution of BOD is N(4, 1) and the BOD in the two streams is uncorrelated. These are strong assumptions that we can never verify completely. If we could, then statistical testing would be superfluous. It is hypothesized that the BOD levels are the same in the two streams. The investigator decides to sample each of the streams and declare the BOD levels different if the samples from the two streams differ by more than 1 mg/l. What is the probability an error will be made? The error that might be made is to declare the BOD in the two streams different when, in fact, they are, unknowingly to the investigator, the same. Since the BOD level is actually N(4, I), the difference in two independent samples is N(O,2). The probability of selecting a random number from an N(O,2) that is larger in absolute value than one is the probability of making an enor with the test. Since the test statistic, the observed difference, has an N(O,2) pdf, the standardized Z value = 0.707. The corresponding to a difference in excess of the absolute value of one is (1 - 0)/* probability of Z exceeding 0.70 in absolute value for a standard normal distribution is 0.48. There is a 48% chance of rejecting the hypothesis even though it is true. If the investigator thinks this probability of an error is too great, the appropriate value for the test statistic consistent with the acceptable error probability can be determined. For example, if the investigator wants to be 90% confident of not concluding the streams are different when in fact they are not, the cutoff value for Z is such that the prob(Z > z,) = 0.05, which corresponds ~ 1.645 or d = 2.33. to Z = 1.645. Then the actual difference is computed from (d - 0 ) / = Therefore, the stream would be considered not significantly different unless the absolute value of the difference in the samples from the streams exceeded 2.33 mg/l.

If the BOD distribution on one stream was N(3, 1) and on the other N(4, I), the distribution of BOD would have been truly different on the two streams. The distribution of the difference in BOD would be N(l, 2). The probability of getting a difference in excess of rt 1 would be the probability of a value 1 from an N(l, 2). Again, using the standard normal distribution, this probability can be found to be 0.74. In this case, the BOD distributions are different yet there is a 26% chance of erroneously concluding they are not different. What becomes apparent is that there is always a chance of making an error in statistical tests of hypotheses. The first part of the example demonstrates how one could wrongly conclude a difference when none existed and the second part shows how one could fail to detect a difference when one does exist. These two errors are rejecting a true hypothesis-known as a Type I error-or accepting a false hypothesis known-as a Type I1 error. The probability of a Type I and a Type I1 error are usually denoted by cx and P, respectively. In this example when the true situation was no difference, cx was 0.48. In the situation where there was a difference, P was 0.26.

CONFIDENCE INTERVALS A parameter 0 is estimated by 6. The statistic 6 is a random variable having a probability distribution. If 6 can take on any value in some continuous range, then prob(0 = 6) is zero. Rather than a point estimate for 0, it may be more desirable to get an interval estimate such that the probability that this interval contains 0 can be specified. Such an interval is known as a confidence interval. This statement may be written

where L and U are the lower and upper confidence limits, so that the interval from L to U is the confidence interval and 1 - a is the confidence level, or confidence coefficient. Note that in equation 8.1, 0 is not a random variable. One does not say that the probability that 0 is between L and U is 1 - cx but that the probability is 1 - cx that the interval L to U contains 0. The difference in these two interpretations is subtle but based on the fact that 0 is a constant while L and U are random variables. Mood et al. (1974) discuss a general method for determining confidence intervals. Ostle (1963) presents expressions for the confidence intervals for many different statistics. In the discussion to follow, a procedure known as the method of pivotal quantities for determining confidence limits will be illustrated. This method consists of finding a random variable V that is a function of the parameter 0 but whose distribution does not involve any other unknown parameters. Then v, and v, are determined such that prob(v, < V < v,) = 1 - a

(8.2)

This inequality is then manipulated so that it is in the form of equation 8.1 where U and L are random variables depending on V but not 0.

HYPOTHESIS TESTTNG

197

Mean of a Normal Distribution As an example of using equation 8.2, the confidence intervals on the mean of a normal distribution will be determined. We have shown that the quantity

has a t distribution with n

-

1 degrees of freedom, where n is the number of observations used

to estimate Z. Using equation 8.2 we have

If it is desired that the confidence interval be symmetrical in probability, v, and v2 can be chosen so that the probability that a random t is less than v, equals the probability that a random t exceeds v2. Since the 100(1 - a ) percent confidence interval is being sought, both of these probabilities must be a / 2 . The probability that the confidence intervals do not contain 0 has been divided equally between the upper and lower bounds. In the following the notation t,, corresponds to the value o f t such that the probability of a random t with n degrees of freedom being less than t,, is a (see figure 8.1). Equation 8.3 is equivalent to

Since the t distribution is symmetrical, t,/,,-,

- -

t, -

-

,.Therefore

Fig. 8.1. Illustration of confidence intervals using the t distribution.

This latter equation is in the form of equation 8.1, so the confidence limits are

Because F and s, are both random variables, L and U are random variables as well, with estimates 1 and u given by equation 8.4. Note that the assumption that the observations are normally distributed was made. Example 8.1. The sample mean and variance of the Kentucky River data contained in table 2.1 have been calculated as Z = 66,540 and sx = 22,322. What are the 95% confidence limits on the mean assuming the sample is from a normal population? Solution:

From the t table in the appendix

From equation 8.4

Thus, we can say that we are 95% confident that the interval 62,076 to 71,004 contains the true population mean. Comment: If a 90% confidence interval is calculated, it is found to be 62,817 to 70,263. Thus, the 90% confidence interval is shorter than the 95% confidence interval but our degree of confidence that the interval contains F, has decreased from 95% to 90%. If a second independent sample of peak flows on the Kentucky River near Salvisa were available, this sample would have a different mean and variance. In this case, the 95% confidence intervals would be different as well. If many samples were available and the 95% confidence limits were calculated for each, 95% of the confidence limits would contain the true population mean while 5% would not if the data were actually from a normal distribution. The 100(1 - a)% confidence interval on the mean can be made as small as desired by increasing the sample size. This is because s, decreases as the sample size is increased. An increase in the reliability of the sample mean comes at the expense of an increase in the sample size. Unfortunately, in many hydrologic problems the sample size is fixed. For a normal distribution, equations 8.4 provides a means for determining the sample size required in order to estimate J.L, within a given reliability.

HYPOTHESIS TESTING

199

If the population variance of the normal distribution is known, then the pivotal quantity in equation 8.3 becomes (X - y)/u,, which has a standard normal distribution. The confidence limits then become

where z, -a/2 is the value of Z from the standard normal distribution such that the area to the right of Z is a/2. Equations 8.4 and 8.5 are based on the assumption that the underlying population of the random variable X has a normal distribution. Only through the Central Limit Theorem can these relations be applied to non-normal distributions. Confidence limits calculated by these relationships for the means of random samples from non-normal populations are only approximate with the approximation improving as the sample size increases. If these approximations are not satisfactory, other methods are available (Ostle 1963; Mood et al. 1974). Variance of a Normal Distribution The quantity (n - 1)s2/u2has a chi-square distribution with n - 1 degrees of freedom. Letting this quantity equal V in equation 8.2 results in

Choose v, equal to x2~ / ~ ,and ~ -v2 , as x12 -a/2,n-

,.Then

which is in the form of equation 8.1. Thus, the confidence limits on u2 are

Again, equations 8.6 are strictly valid only if X is from a normal distribution and approximate for X from a non-normal distribution-with the approximation improving as the sample size increases.

Fig. 8.2. Confidence limits on a chi-square distribution. The chi-square distribution is not symmetrical so that s i - 1 is not equal to u - si. As the sample size and, thus, the degrees of freedom on the chi-square distribution increases, the distribution approaches a symmetrical distribution so that the upper and lower confidence limits are nearly the same distance from s;. This is illustrated in figure 8.2. Example 8.2. Determine the 90% confidence limits on the variance for the situation described in example 8.1. Solution:

The 90% confidence intervals on the standard deviation are found (by taking the square roots of the above limits) to be 20,001 to 25,33 1 cfs. Comment: In the preceding two examples the confidence limits on the mean and variance of a normal distribution were calculated. If the joint confidence limits on ? and i s; are desired, they cannot be computed separately as was done in these examples. Mood et al. (1974) discuss the estimation of ioint confidence intervals. One-Sided Confidence Intervals Situations may arise where one is only interested in an interval estimate on one side of a parameter. For instance, it may be desired to find only a lower confidence limit. In this situation equation 8.1 becomes

HYPOTHESIS TESTING

20 1

The same procedure for finding L would be followed as was used in the two-sided case, except now all of the probability a will be in one tail. For instance, the one-sided lower limit on the mean of a normal distribution with an unknown variance would be

The analogous results would hold for any one-sided, lower or upper confidence limit. Parameters of Probability Distributions For a wide class of distributions for large samples, the maximum likelihood estimators for the parameters of the distribution are asymptotically normally distributed with the true parameter

{

value as the mean and a variance of nE - In px(x, 8) K g

IT1 .

Using this information, it is possible to construct confidence intervals and joint confidence intervals for the parameters of these distributions. The book by Mood et al. (1974) should be consulted for the procedures to be used.

HYPOTHESIS TESTING Often the acceptability of statistical models can be judged without actually making any statistical tests. This would be the case when observed data is predicted very closely by the model or when observed data deviates very greatly from the model. On the other hand, a common occurrence is for the observed data to deviate some from the model but not enough for one to state that the model is obviously inadequate. In this latter situation one must determine whether the deviations represent true inadequacies in the model, or whether the deviations are chance variations from the true model. The general procedure to be followed in making statistical tests is 1. Formulate the hypothesis to be tested. 2. Formulate an alternative hypothesis.

3. Determine a test statistic.

4. Determine the distribution of the test statistic.

5. Define the rejection region or critical region of the test statistic. 6. Collect the data needed to calculate the test statistic. 7. Determine if the calculated value of the test statistic falls in the rejection region of the distribution of the test statistic.

Table 8.1. Errors in hypothesis testing

Decision Accept hypothesis Reject hypothesis

True situation

True situation

Hypothesis true No error Type I error

Hypothesis false Type II error No error

For many statistical tests, steps 2 4 have been completed and may be found in a wide variety of statistics books. For many of the tests that a hydrologist might like to make, adequate test statistics and their distributions have not been determined-largely because of restrictive assumptions. Nonpararnetric tests relieve this problem to some extent. It is not possible to develop tests that are absolutely conclusive. All of the tests have a possibility of two kinds of error-rejecting a true hypothesis (Type I error) or accepting a false hypothesis (Type I1 error). Table 8.1 depicts the two types of errors. The probability of a Type I error is denoted by a and the probability of a Type I1 error by P. The significance level is defined as 100(1 - a ) (in percent). In testing hypotheses, the probability of a Type I error can be specified; however, the probability of a Type I1 error is not known unless the true parameter values being tested are known. In general as the value of a decreases, the magnitude of P increases. As an example, assume we select an observation xo at random from a normal distribution with variance a; and hypothesize that the distribution has a mean po.The test statistic could be xo itself, which has a normal distribution with unknown mean and variance a;. If the hypothesis is true (something that is not known or the test would not be made), the distribution of the test statistic would be a normal distribution with mean po and variance 0; and would appear as in Figure 8.3. If it is decided to accept the hypothesis if xo is within 2 standard deviations of po and reject the hypothesis otherwise, the critical region or rejection region would be the shaded area in Figure 8.3. From the properties of the normal distribution, it is known that 95.44% of the area of the normal curve is within 2 standard deviations of the mean, so the critical region occupies 4.56% of the area. It is also apparent that there is a 4.56% chance that x, will be in the critical region and the hypothesis rejected even though it is true. Thus, by definition a = 0.0456, or there is a 4.56% chance of making a Type I error due to random variation in the x, selected. It is more common to specify a and from this information determine the critical region. For example, if one wanted a to be 0.10, then the critical region would be I (xo - po)/aoI > 1.645, which is the value of the standard normal distribution such that the area outside the limits - 1.645 to 1.645 is 0.10.

Po-2%

Fig. 8.3. Critical region.

P o

Po+2q,

HYPOTHESIS TESTING

203

Fig. 8.4. Illustration of a and P.

In order to evaluate p, the true parameter values must be known. Again, consider selecting and an unknown mean. Let the a single value xo from a normal population with variance hypothesis be that p = po and the alternative be p # po. If p actually equals p,, then the situation depicted in figure 8.4 would exist and there is a loop% chance that xo will fall in the acceptance region of N(po, a;) and thus a Type I1 error committed. From figure 8.4 it can be seen that as a is increased, p will decrease. It can also be seen that the nearer p1is to po, the greater will be p. This is because it is increasingly difficult to tell the difference between the two distributions. It is not possible to determine the magnitude of P because it is a function of the unknown population mean p,. Example 8.3 shows how P can be evaluated if p1is known. Of course, p1 would not be known or else one would not hypothesize p = po.

4

Example 8.3. Assume a single observation is selected from a normal distribution with mean p1 = 7 and variance a; = 9. It is hypothesized that p = po = 5. If the test is conducted at the 10% significance level, what is P? Solution: Reference should be made to figure 8.5.

Fig. 8.5. Illustration for example 8.3.

a = 0.10

a / 2 = 0.05 which corresponds to z, -,/ =,1.645 (Xu - po)/uo = 2, where Xu is the boundary of the upper critical region (Xu - 5)/3 = 1.645 Xu = 9.935 A, = the area of a normal distribution with mean of 7 and variance of 9 to the left of 9.935. The standardized variate corresponding to Xu = 9.935 is

The area to the left of z, = 0.978 from a standard normal distribution is 0.8365. Similarly, if X, is the boundary of the lower critical region, we have (x, - 5)/3 = - 1.645, or x, = 0.0645. A, is the area of a normal distribution with mean 7 and variance 9 to the left of 0.0645. z, = (0.0645 - 7)/3 or z, = -2.3,l. A, = 0.0104. Now P = A, - A, or P = 0.8365 - 0.0104 = 0.8261. Thus, the probability of accepting the hypothesis that p = 5 when in fact p = 7 is 0.8261 when a is 0.10. The probability of a Type II error is 0.8261.

If calculations such as those contained in example 8.3 are carried out for various values of pl, a curve relating P to p1can be constructed. Such a curve is shown in figure 8.6. Figure 8.6 shows the p curve for a = 0.05 and a = 0.10. Curves such as shown in figure 8.6 are often called operating characteristic (OC) curves. Figure 8.6 verifies the earlier statements that P increases as a decreases and P increases as the true mean, pl, approaches the hypothesized mean, po. In fact, as p1 gets close to po, the

PI

Fig. 8.6. Probability of a type I1 error as a function of the true mean for example 8.3.

HYPOTHESIS TESTING

205 POWER =

I

- 10

I

-5

I

0

I

5

1-8

I

I

1

10

15

20

PI

Fig. 8.7. Example power curve. probability of accepting p = po when p = p, is true gets very large. This may not be a serious problem in practice because we may not care, for instance, whether p is 5 or 5.5. The quantity 1 - P is called the power of a test. Ideally, we would like the power to be large for all values of p, .In fact in testing a hypothesis, we would like a to be small and the power to be large. Figure 8.7 shows that power of a test is a function of a and true parameter values. The power of a test is also a function of the test itself. For instance, we could have chosen as our test statistic & + 3 and then rejected the hypothesis if x, + 3 fell in the critical region. Figure 8.7 compares the power of this test with the test that rejected the hypothesis if x, fell in the critical region. Figure 8.7 shows that for certain values of p,, the X, + 3 test is more powerful than the X, test. Ideally, we would like to use the test that was the most powerful over the entire range of the unknown parameter. Such a test is known as a uniformly most powerful test. Unfortunately, uniformly most powerful tests do not exist in many situations. Selecting which test to use comes down to the purposes of the test and the consequences of making an error. In our example, if accepting the hypothesis p = 5 when in fact p;> 5 is a very serious error, whereas accepting it if p < 5 is of little consequence, we might prefer the X, + 3 test becuase it is more powerful in the region p > 5. If the consequence of an error depended only on the magnitude of the error, the X, test might be preferred. From the above discussion, it should be apparent that the selection of a and the type of test to be used depends on the problem at hand. Mood et al. (1974) discuss these concepts in more theoretical terms. The level of significance, a , is usually chosen to be 0.10, 0.05, or 0.01. In theory, a should be based on the problem at hand. In practice, a is generally arbitrarily selected. Many tests of hypothesis are of the type 0 = €I1 versus the alternative 0 # 0,. Accepting such a hypothesis as true does not mean that one strictly feels that 0 = but rather that 0 is not

significantly different from el. For example, if we calculate the mean of a random sample and then accept the hypothesis that the true mean is 5, we may not believe that the true mean is exactly 5 but rather the true mean is not significantly different from 5. What constitutes a significant difference has been defined by the type of test used and the level of significance. Furthermore, a statistically significant difference and a physically significant difference are not the same. For example, if 6 = 4.0 is an estimate for 0 and a test of hypothesis shows 6 is not significantly different from zero, it does not mean 0 = 0 should be used in some physical analysis if this physical analysis is sensitive to differences in 0 of this order of magnitude. A physically significant difference depends on the problem being studied. The following is a discussion of several common tests of hypotheses. The hypothesis to be tested is denoted by H, and the alternative hypothesis by Ha. For the tests that follow to be correct statistical tests, the assumptions involved in developing the test statistic must not be violated. A primary assumption is that the statistics are estimated based on a random sample. In practice, at least some.of the assumptions are generally violated- with the result that the tests are only approximate tests. This approximation is manifest in the fact that the actual level of significance will not be equal to 100a%. Because these tests are often approximate due to assumption violations does not render the tests of no value. It is the analyst that must make the decision, not a statistical test following some prescribed procedure. The analyst may put less weight on a statistical test in arriving at a decision if the violations of the assumptions of the statistical test are of concern, however. H,: p = p l , Ha: p = p., Normal Distribution, Known Variance In this case, H, is a simple hypothesis and Ha is a simple alternative hypothesis. The test statistic is developed by considering that

has a standard normal distribution. If p1 > p2, then H, is rejected if

If p1< p2, then H, is rejected if

In the preceding expressions, z, -, represents the point on the standard normal distribution such that prob(Z 2 z1-,) = a. H,: p = p,, Ha: p = p2, Normal Distribution, Unknown Variance The test statistic for this situation is

HYPOTHESIS TESTING

207

H, is rejected if

and -

+ tl-,,n-l

x L

sx/v'h

for p1 < p2

(8.12)

H,: p = po, Ha: p # po, Normal Distribution, Known Variance This hypothesis is a simple hypothesis with a compound alternative hypothesis. Again, the test statistic is

H, is rejected if

1

Izl =

(X - Po) ax/ v'h

(>

zl-+.

H,: p = po, Ha: p # p, Normal Distribution, Unknown Variance Generally, a population variance is not known and must be estimated. In that case, H, is tested by using t=

(X - Po) sx/ v'h

H, is rejected if It1 =

1

(K - Po) sx,&

1

> t1-a,2n-1

This test cannot be applied to every set of data. The assumption has been made that the observations are from a normal distribution. -

- --

Example 8.4. The annual runoff for Cave Creek near Fort Spring, Kentucky, for the period 1953 to 1970, has a mean of 14.65 inches and a standard deviation of 4.75 inches. Test the hypothesis that the mean annual runoff is 16.5 inches. Solution: The testing procedures we have available to us all are based on the assumption of normality. If we assume the annual runoff is normally distributed, we can use equation 8.14 to test H,: (I. = 16.5 versus Ha: p # 16.5. There are 18 observations. The test statistic is

208

CHAPTER 8

- b.975,17 = 2.11. Because I t I = Using a 95% level of significance, a = 0.05 and tl 1.65 < 2.11, we do not reject the hypothesis that the mean is 16.5.

Comment: Some statisticians do not like to "accept" H,. Their reasoning is that we have not proven H,, only found strong evidence to support it. As a result of a statistical test, their conclusions would be either reject H, or fail to reject H,. It should be kept in mind, however, that we have not proven H,. For instance, in this example, we have calculated the sample mean to be 14.65 and accepted the hypothesis that the population mean is 16.5. This illustrates two points. First, the data and the test obviously do not prove that k = 16.5. Second, what we really have accepted is not that the mean is 16.5 but that when sampling from this distribution using a sample of size 18, the difference between the sample mean of 14.65 and the hypothesized mean of 16.5 can reasonably be ascribed to chance variations due to the random sample. Our conclusion is that based on this sample, we cannot say that the population mean is not 16.5 or based on this sample the population mean is not (statistically) significantly different from 16.5. Test for Differences in Means of Two Normal Distributions If the variances of the two normal distributions are known, then the H,: k, - k2 = 6 versus Ha: k1 - k2# 6 can be tested by calculating the test statistic

In this case, Z has a standard normal distribution, so the rejection region is 1 z I > z, If the variance of the two normal distributions are equal but unknown, the H,: k1 - p2 = 8 versus Ha: p1 - p2 # 6 is tested by calculating the statistic

which has a t distribution with n1 + n, - 2 degrees of freedom. Thus, H, is rejected if

Again, note that these two tests are based on sample normality. For large samples, the Central Limit Theorem may enable us to use these tests as approximate tests for nornormal samples. Gibra (1973), Ostle (1963) and others discuss testing the H,: k, - p2 = 6 versus Ha: k1 k2# 6 when sampling from two normal populations with unknown and unequal variances. Ostle recommends the following approximate procedure. Compute the test statistic

HYPOTHESIS TESTING

209

The hypothesis is rejected if

where w, = s:/nl

Test of H,: u2 = u i versus H,: u2 # u i Normal Population A test of H,: u2 = ui versus H,: u2 # ui when sampling from a normal distribution with sample size n can be made by calculating the test statistic

and then accepting H, if

Otherwise H, is rejected. Test of H,: a: = a; versus H,: a: # a; for Two Normal Populations To test the hypothesis that the sample variances of two normal populations are equal, the sample test statistic is

ST

where is the larger sample variance. F is distributed as an F distribution with n, - 1 and n2 - 1 degrees of freedom, where n, is the sample size for the sample having the larger variance and n2 is the sample size for the sample with the smaller variance. H, is rejected if

Test for Equality of Variances from Several Normal Distributions To test the H,: a: = a$ = ... a$for k independent samples each from a normal population with mean ki and variance a', it is first necessary to calculate the k sample variances s'. The

CHAPTER 8

210

quantity Q/h is approximately distributed as a chi-square distribution with k - 1 degrees of freedom where

and

H, is rejected if

In this test, Ha is 02 that are not all equal. This means that at least one is different from the other The test is known as Bartlett's test for homogeneity of variances. Homogeneity of variance is also known as homoscedasticity.

(~2.

Example 8.5. For the preceding example, test the hypothesis that the variance is 36.00. Solution: The assumption of normality is used. The test is based on equation 8.18 using a

=

0.05

From a chi-square table

Since 10.65 is between 7.6 and 30.2, H, is not rejected.

TESTING THE GOODNESS OF FIT OF DATA TO PROBABILITY DISTRIBUTIONS Two ways of judging whether or not a particular distribution adequately describes a set of observations have already been discussed. Both of these methods required a visual judgment of goodness of fit. One method was to compare the observed relative frequency curve with the

HYPOTHESIS TESTtNG

21 1

hypothesized relative frequency curve. The second method was to plot the data and the hypothesized distribution as a cumulative probability distribution on appropriate paper and judge as to whether or not the hypothesized distribution adequately describes the plotted points. Statistical tests corresponding to these visual tests will be discussed. In the following discussion, the hypothesis being tested is that the data are from a specified probability distribution. Chi-square Goodness of Fit Test One of the most commonly used tests for goodness of fit of empirical data to specified theoretical frequency distributions is the chi-square test. This test makes a comparison between the actual number of observations and the expected number of observations (expected according to the distribution under test) that fall in the class intervals. The expected numbers are calculated by multiplying the expected relative frequency by the total number of observations. The test statistic is calculated from the relationship

where k is the number of class intervals, and Oi is the observed and Ei the expected (according to the distribution being tested) number of observations in the ithclass interval. The distribution of X: is a chi-square distribution with k - p - 1 degrees of freedom, where p is the number of parameters estimated from the data. The hypothesis that the data are from the specified distribution is rejected if

Example 8.6. As an example of using the chi-square test, consider the Kentucky River data of table 2.1 and test the hypothesis that the data are from a normal distribution. The observed and expected numbers in each class interval are obtained by multiplying the relative frequency by 99, which is the number of observations. Table 8.2 shows the calculation of x:. The degrees of Table 8.2. Chi-square test on Kentucky River data (0 -E ) ~

Class mark

Observed number

Expected number

25,000 35,000 45,000 55,000 65,000 75,000 85,000 95,000 105,000 115,000

3 6 16 16 18 13 13 7 3 4 -

5.03 6.57 11.10 15.39 17.51 16.35 12.54 7.89 4.08 2.55 -

0.820 0.050 2.162 0.025 0.0 14 0.686 0.017 0.100 0.284 0.823

99

99

4.982

Total

E

CHAPTER 8

212 Table 8.3. Chi-square test on Kentucky River data (modified) Class mark

Observed number

Total

Expected number

(0 -E)~ E

7 99

freedom is k - 3, or 7, since two parameters (pXand a;) were estimated for the normal distribution. Comparing x:. of 4.98 with Xg,90,7 = 12.0, it is concluded that the normal distribution can not be rejected for this data for a = 0.10. If x:. had exceeded X:-,,k-,-,, the hypothesis that the normal distribution describes the data would be rejected. In constructing Table 8.2 the expected number in a class interval is based on n[Px(xi) P,(X~-~)] for all intervals except the first and last ones. For the first interval the expected number and for the last interval n[Px(w) - Px(x,-,)I. In these expressions xi represents the is -(xi) right boundary of the i" class. Comment: By examining table 8.2 and equation 8.21, it is apparent that the chi-square goodness of fit test is quite sensitive in the tails of the assumed distribution. Because of this many statisticians recommend that classes be combined if the expected number in a class is less than 3 (or 5). If the 5 criteria is used, the first two classes and the last two classes must be combined. This makes the calculation of X2 as shown in table 8.3 and X: value is reduced to 3.62. The degrees of freedom are reduced to 5. Perhaps a better way of conducting the chi-square goodness of fit test is to define the class intervals so that under the hypothesis being tested the expected number of observations in each class interval is the same. This means that the class intervals will be of unequal width and that the interval widths will be a function of the distribution being tested. Example 8.7. A chi-square test for normality of Kentucky River data using 10 class intervals each having the same expected frequency can be conducted as follows. Ten class intervals means that the expected relative frequency or probability in each interval is 0.1. The class boundaries can be determined by solving the inverse of the cumulative distribution. For instance, the boundaries of the 4thclass intervals are given by the values of x satisfying Px(x) = 0.3 and Px(x) = 0.4.

HYPOTHESIS TESTING

2 13

Table 8.4. Chi-square test based on equal expected numbers per class interval Class number

Lower boundary

1 2 3 4 5 6 7 8 9 10

-0c)

37933 47753 54834 60885 66540 72195 78246 85327 95 147

Upper boundary

Observed number

Expected number

( 0 - E)? E

37933 47753 54834 60885 66540 72195 78246 85327 95147

8 15 13 7 7 14 5 12 8 10 99

9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 9.9 99

0.365 2.627 0.97 1 0.849 0.849 1.698 2.425 0.445 0.365 0.001 10.596

00

Total

.

Table 8.4 contains the data for conducting the chi-square test based on 10 class intervals having equal expected numbers of observations (99/100 or 9.9) in each interval. In this case, Xi is 10.60, which is less than Xi.90,7 of 12.02. The hypothesis is, again, not rejected. Distributional Tests Based on Cumulative Distributions Conover (1980) presents a good discussion of statistical tests based on cumulative distributions. The most commonly used of these tests is the Kolmogorov-Smirnov one sample test (also known as the Kolmogorov test). The hypothesis being tested is that a set of empirical observations come from a particular, known, and completely specified cumulative distribution. This test is conducted as follows: 1. Let Px(x) be the completely specified theoretical cumulative distribution function under the null hypothesis. 2. Let S,(x) be the sample cumulative density function based on n observations. For any observed x, Sn(x) = k/n, where k is the number of observations less than or equal to x. 3. Determine the maximum deviation, D, defined by

4. If, for the chosen significance level, the observed value of D is greater than or equal to the critical tabulated value of the Kolmogorov-Smimov (K-S) statistic, the hypothesis is rejected. The Kolmogorov-Smimov test statistic is included in the appendix. This test can be conducted by calculating the quantities Px(x) and Sn(x) at each observed point, or by plotting the data as in figures 7 . 3 and ~ d and selecting the greatest deviation on the probability scale of a point from the theoretical line. If the latter approach is used, care must be

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Fig. 8.8. Graphical determination of critical K-S value. taken to select the largest deviation on the probability scale which is not necessarily linear. The largest deviation of the empirical distribution from the known distribution is sought. The empirical distribution gives the Prob(X i x) and is thus a step function with steps at each data point. Rather than falling on a data point, the largest deviation may be at a point where the probability takes a step change. Example 8.8 and figure 8.8 illustrate the determination of this maximum deviation, which in this case occurs just prior to X = 18 and has a value of 0.29 on the probability scale. In this case, the known distribution is an exponential distribution. There are eight data points. The critical value for the K-S statistic with a = 0.10 is 0.411. Thus, the hypothesis that the data are from this particular distribution cannot be rejected. When few observations are available, it is very difficult to use a statistical test to find an appropriate distribution for the data. In figure 8.8 only 8 observations are available. Obviously, the chi-square test can not be used because adequate data for grouping are not available. As already seen, the K-S test is insensitive for a sample of this size since it requires a large deviation to reject the hypothesis with this small sample. If the K-S test is used to test the hypothesis that these data came from a uniform distribution or a normal distribution, these hypotheses could not be rejected either. With small samples, the power of the K-S test is not very great and the probability of failing to reject a false hypothesis, a Type I1 error, is great. Example 8.8. Consider the data values 18,29,45,56, 50,40,20, 10. Test the hypothesis that the data are from an exponential distribution with known mean of 33.5. Rank

Ranked data

sx

px

Isx - pxI

Isx-, - pxl

HYPOTHESIS TESTING

215

The critical value is 0.411 for n = 8 and ci = 0.10. The hypothesis cannot be rejected. Note that for the Kolmogorov-Smirnov test, P,(x) is a completely specified, cumulative probability distribution. That is no parameters for the distribution must be estimated from observed data. Crutcher (1975) points out that when parameters must be estimated to specify P,(x), the Kolrnogorov-Smirnov test is conservative with respect to the Type I error. That is, if the critical value is exceeded by the test statistic obtained from the observed values, the hypothesis is rejected with considerable confidence. Crutcher (1975) presents a table of critical values for sample sizes of 25 and 30 as well as infinitely large samples for the exponential, gamma, normal, and extreme value distributions when parameters of these distributions must be estimated. In general, these critical values are smaller than the values given in the Kolmogorov-Smirnov table in the appendix. Conover (1980) discusses Lilliefors's extension of the K-S test to the normal distribution with mean and variance estimated from the data (Lilliefors, 1967) and the exponential distribution with mean estimated from the data (Lilliefors, 1969). The tests are conducted as with the K-S except that the critical values are smaller. Conover (1980) presents tables for the required critical values. Based on data in Conover, letting KS represent the critical value of the Kolmogorov-Smimov statistic and L represent the critical value for the Lilliefors test, the approximation L = a + bKS can be used where a and b are given in the following table for 4 to 30 observations. For n greater than 30 the approximation L = c / 6 from Conover ( 1 980) yields reasonable estimates for the critical values. Distribution

a

a

b

c

Normal Normal Normal Exponential Exponential Exponential

0.10 0.05 0.01 0.10 0.05 0.01

0.02 1 0.027 0.040 0.003 0.009 0.016

0.586 0.565 0.528 0.780 0.767 0.744

0.805 0.886 1.031 0.977 1.075 1.274

Example 8.9. Repeat example 8.8 assuming the mean is unknown. Solution: The calculated mean is 33.5, so the observed maximum deviation is 0.29, as before. If the calculated mean had been other than 33.5 the values for Px would change. The critical D based on the Lilliefors test using the exponential distribution and the approximations above is

with ci = 0.10. The tabled value for KS is 0.411. Therefore L is found to be 0.003 0.780(0.411) or 0.324. The hypothesis cannot be rejected.

+

21 6

CHAPTER 8

Example 8.10. Test the hypothesis that the Kentucky River peak flow data are normally distributed. Use the Kolmogorov-Smirnov test. Solution: The data are plotted in figure 8.9. The maximum deviation between the best fitting line, Px(x), and the plotted points, S,(x), on the probability scale is about 0.074 at X = 55,200 cfs (table 8.5). Because the test for normality is being done and the mean and variance are estimated from the data, Lilliefors approach is used. For a = 0.10 and n = 99, the critical value for the Lilliefors statistic is 0.805/* or 0.081. Table 8.5 shows the calculations needed to find the maximum deviation. The maximum deviation is the maximum value in the columns under

Normal distribution

Fig. 8.9a. Normal probability plot of Kentucky River data on annual flow.

Normal distribution

Fig. 8.9b. Lognormal probability plot of Kentucky River data on annual flow.

Table 8.5. coritirzzred

Rank

Data

Sx

Px

Sx

-

Px

S(x

-

1) - Px

Rank

Data

87100 87200 88900 89400 91500 92500 93700 94300 96100 98400 99100 101000 105000 107000 111000 112000 115000 144000

Sx 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Max dev

Px

Sx

-

Px

S(x - 1) - Px

HYPOTHESIS TESTING

219

I .i . i i i I i I i . . i . I .

0

10

20

.

.

30

.

.

40

.

,

50

.

.

60

.

.

70

.

80

90

100

n

Fig. 8.10. Critical correlation values for normality test. Sx - Px and S(x - 1) - Px. Because the largest value is less than the critical value, the hypothesis of a normal distribution cannot be rejected. Other tests for normality include the Shapiro-Wilkes test (Conover 1980) and a test based on the correlation between the standardized Z-value associated with the plotting position and the values of the observations (Helsel and Hirsch, 1992). The critical correlation values are shown in figure 8.10. The test is credited to Looney and Gulledge (1985). For this test the Weibull plotting position should not be used. Table 8.6 shows the Kentucky River data, the plotting positions calculated from the Cumane (1978) relationship and the Z-values associated with the plotting positions. The correlation between the flow and the Z-values is 0.989. Figure 8.10 shows that for 99 observations and a = 0.05, the critical correlation value is about 0.98. Thus, the hypothesis of normality can not be rejected. It is interesting to note that the correlation between the logarithms of the values and the Z-values is 0.993 indicating that a hypothesis of log normality cannot be rejected either. Comparing Two Empirical Distributions Occasionally, it is desired to determine if the distributions of two independent, random samples are the same. A two-sided Kolmogorov-Smirnov test can be used to assist in this determination. Available are two independent samples of size m and n. P,(x) and P2(y) represent the two unknown distributions. The hypothesis that P,(x) = P2(y) is done by comparing the empirical distributions and finding the maximum value of IS,(x) - S2(y)l over all x and y. Let this maximum difference be T. The hypothesis is rejected if T exceeds a critical value. Critical values of T are given in Conover (l980), Beyer (1968), and other statistical handbooks. Example 8.11 The following data are from two independent samples. Test the hypothesis that the underlying distribution is the same for both samples.

X

=

2.25,2.63, 3.09, 3.47, 3.76,4.01,4.14,5.51,6.10,6.33

Y

=

5.37,5.60,6.33, 8.90

Table 8.6. Kentucky River data Flow

Flow

Flow

HYPOTHESIS TESTING

22 1

The maximum deviation is 0.70. From Conover (1980), with a = 0.10, the critical test value is 13/20, or 0.65. Thus, one can reject the hypothesis that the two samples are from the same distribution. Conover (1980) can be consulted for more details on this test and for a companion one-sided test.

General Comments of Goodness of Fit Tests Many hydrologists discourage the use of the chi-square and Kolmogorov-Smirnov tests when testing hydrologic frequency distributions. The reason for this is the importance of the tails of hydrologic frequency distributions and the insensitivity of these statistical tests in the tails of the distributions. In the example above with 99 observations and a = 0.10, a critical value of the Kolmogorov-Smirnov statistic of 0.12 was obtained. It is nearly impossible to get a deviation of this magnitude in the tails of distributions when the procedures outlined in this chapter are followed. The sensitivity of the chi-square test can be improved in the tails of the distribution if classes are not combined to get an expected frequency of 3 to 5 as recommended earlier. The disadvantage of this is that a single observation in a class with a low expectation can result in a value in excess of the critical value. This single observation can lead to rejecting the hypothesis. Unfortunately, no satisfactory alternate tests are presently available for making goodness of fit tests. Neither the chi-square test nor the Kolmogorov-Smirnov test are very powerful in the sense that the probability of accepting the hypothesis when it is in fact false is very high when these tests are used. This is especially true for small samples. These criticisms of the goodness of fit tests can be illustrated in the exercises dealing with simulation, as shown in chapter 13.

Xz

Exercises 8.1. A sample of 20 random observations produced a mean of 145 and a variance of 30. What are the 95% confidence intervals on the mean assuming a normal distribution if (a) the true variance is estimated by 30; (b) the true variance is 30. Discuss the reason you feel that the confidence intervals computed for part (a) are wider than for part (b). 8.2. What are the 95% confidence intervals on the variance for the samples of exercise 8. l? 8.3. Test the hypothesis that the true mean of the data producing the sample whose properties are given in exercise 8.1 is 165. 8.4. Discuss any connection between hypothesis testing and confidence intervals that you can discern. What are the differences?

222

CHAPTER 8

8.5. Assuming the data are normally distributed, test the hypothesis that the mean peak discharge on the Kentucky River near Salvisa (table 2.1) for the period 1895-1 9 16 is different than it is for the period 1939-1 960. 8.6. Repeat exercise 8.5, except test for equality of variances. 8.7. Using the data of table 2.1, test the hypothesis that the variances of the peak discharges are the same for the three periods 1895-1916,1917-1938,1939-1960. 8.8. Test the hypothesis that the mean monthly rainfall for September and October are the same on the Walnut Gulch watershed near Tombstone, Arizona. What assumptions did you make? Are these assumptions reasonable? 8.9. Repeat exercise 8.8 for equality for variances. 8.10. Test the hypothesis that the difference in the mean monthly rainfall on Walnut Gulch near Tombstone, Arizona, for September and October is 0.50 inches. Discuss the validity of the assumptions that are made. 8.1 1. Test the hypothesis that monthly rainfall in October on the Walnut Gulch watershed near Tombstone, Arizona, is normally distributed. 8.12. Test the hypothesis that annual rainfall on the Walnut Gulch watershed near Tombstone, Arizona, is normally distributed. 8.13. Comment on the results of exercises 8.1 1 and 8.12 in terms of the Central Limit Theorem. 8.14. Would the plotting position relationship used in exercise 7.6 have any effect on the results of a test for normality on the data set you selected? 8.15. Use the Kolmogorov-Smimov test to answer exercise 7.10. 8.16. Use the Kolmogorov-Smimov test to test for normality the three sets of data plotted in exercise 7.11. 8.17. Use the KolmogorovSmirnov test to test for lognormality of three sets of data plotted in exercise 7.12. 8.18. Work exercise 8.16 using the chi-square test. 8.19. Work exercise 8.17 using the chi-square test. 8.20. What distribution do you think would fit the data of exercise 2.2? Use the chi-square test to evaluate your assertion.

HYPOTHESIS TESTING

223

8.21. The following are experimentally determined values of Manning's n for plastic pipe as determined by Haan (1965). Test the hypothesis that the mean value of n is different from the recommended design value of 0.0090.

9. Simple Linear Regression NOTATION IN THIS chapter an upper case letter will represent a variable, a lower case letter will represent the difference between a variable and its mean, and a subscript will be used to denote a particular value for the variable. Thus Y represents a variable which may take on values Y,, Y,, Y3, and so on. Y is the mean of Y. y = Y - Y and yi = Yi - Y.Parameters are denoted by Greek letters and a corresponding English letter is used to denote an estimate for the parameter. Thus a is a parameter estimated by a (& = a). The lower case letter e will be used to denote the difference between an observed value of Y and its predicted value ?.Thus Y - ? = e and Yi - Pi = ei.All summations m this chapter will run from 1 to n unless otherwise specified, where n is number of observations on Y and X.

SIMPLE REGRESSION Possibly the most common model used in hydrology is based on the assumption of a linear relationship between two variables. Generally, the objective of such a model is to provide a means of predicting or estimating one variable, the dependent variable, from knowledge of a second variable, the independent variable. The statistical procedure used for determining a linear relationship between two variables is known as regression. Often the term regression is reserved for use when all of the X variables being considered are random variables. In this book liberties will be taken and the term applied whether or not the X variables are random variables. As used in this chapter, dependent and independent are not the same as dependence or independence of random variables. Here, dependent means that the variable may be expressed as a (linear)

SIMPLE REGRESSION

225

o

+

Data Mean c=a+bx

//

-----

25

so[

1

0 30

---

/

/

95% CI on regression line

/

/ 95% CI on individual ~redictedv ,/

I

I

35

40

1

I

45

50

8 8

I

55

Annual precipitation (in.)

Fig. 9.1. Annual rainfall-runoff relation for Cave Creek. function of a second variable known as the independent variable. Obviously, if the variables are strictly independent in a statistical sense, one variable would give no information about the other. Figure 9.1 shows a situation where it may be desirable to find a linear relationship between the annual runoff, Y, and the annual precipitation, X, for Cave Creek near Lexington, Kentucky. The annual runoff is the dependent and the rainfall the independent variables. The data used in constructing figure 9.1 is contained in table 9.1. Table 9.1. Annual precipitation and runoff for Cave Creek, near Lexington, Kentucky Year

Precip. (inches)

Runoff (inches)

Year

Precip. (inches)

Runoff (inches)

226

CHAPTER 9 Two questions are of immediate concern. Can a model of the form

adequately represent the relationship between Y and X? For what values of a and P is the representation the best? Here E is the difference in Y and a + PX. In looking at the question of the "best" straight line, a criteria for judging "bestness" is needed. One intuitive criteria would be to estimate a and P by a and b so as to minimize the deviation ei between the observed values of Y, Yi, and the predicted values of Y, Y,. In this way, values for a and b would be sought that minimize the sum

Closer scrutiny of equation 9.2 reveals that it is not desirable to minimize the sum in an algebraic sense becausethat would be equivalent to finding an a and b such that E ei is -a. Another criteria might be to find an a and b such that X ei is zero. The fallacy with this can be seen by considering two points. If the line Y = a + bX goes through the two points, then X ei would be zero; however, the sum is also zero for any line that over-predicts one point by the same amount that it under-predicts the second point. Thus, there is an infinity of lines such that E ei = 0, and an additional restriction or criterion is needed to select a single line. The X ei may be positive or negative. A criterion that is not sign dependent is needed. Such a criterion might be to minimize X leil or to minimize X e'. Since absolute values are difficult to work with mathematically, the second criterion is generally selected. Thus it is desired to estimate a and p by a and b such X e' is a minimum. Denoting this sum by M, we have

This sum can be minimized with respect to a and b by taking the partial derivatives of M with respect to a and b and setting the resulting equations equal to zero.

These equations can then be written in the following form, known as the normal equations.

The solution of the normal equations in terms of a and b is

SIMPLE REGRESSION

227

Equations 9.6 and 9.7 provide estimates for a and b such that C. e' is a minimum. Because the procedure is based on minimizing the error sum of squares, C. e', the estimates a and b are commonly called least squares estimates. Equation 9.4 indicates that this solution also satisfies C. ei = 0. Equation 9.7 indicates that the line Y = a + bX goes through the point Y = Y and X = X. The line Y = a + bX is commonly known as the regression line of Y on X. The procedure of determining a and b is known as simple regression. The term "simple" regression is used when only one independent variable is involved, as opposed to multiple regression when several independent variables are involved. The parameter estimates, a and b, are known as the regression coefficients. Equations 9.6 and 9.7 show that a and b are functions of the sample values of Y and X. If another sample of observations were obtained and a and b were estimated from this sample, different estimates would result. We have already seen that

Similarly

Thus, ei represents the deviation between an observed Yi and its predicted value qibased on the regression equation estimated from the particular sample of data at hand. E; represents the deviation between an observed Yi and the assumed true but unknown relation between Y and X given byY = a + P X . Example 9.1. Determine the regression coefficients for the data plotted in figure 9.1. Solution: The data required for solving equations 9.6 and 9.7 are contained in table 9.2. The equation used to calculate b would depend on the method of calculation. If a small desk calculator is used, the first of equations 9.6 might be employed. If an electronic calculator or computer is used, the latter of equations 9.6 might be employed. Generally, less roundoff error will result if the latter form of equation 9.6 is used. In practice, readily available software would be used.

Therefore ? =

- 13.195 1

+ 0.6480X. This line is plotted in figure 9.1.

228

CHAPTER 9

Table 9.2. Calculations on data of table 9.1

Total Average

13.26 3.31 15.17 15.50 14.22 21.20 7.70 17.64 22.91 18.89 12.82 11.58 15.17 10.40 18.02 16.25 234.04 14.63

Comment: The last two columns of table 9.2 contain qi and Yi - 9,. Note that except for rounding errors, Y = 9 , C (Y, - q i ) = C ei = 0 and E = 0.

EVALUATING THE REGRESSION The second question is now considered. Can the data be adequately described by the regression line? Naturally, the answer to this query depends on the definition of adequate. The question will not be answered here but methods for assessing the adequacy of the model will be explored. One approach that does not involve any assumptions is to determine how much of the variability in the dependent variable is explained by the regression. The variability in the dependent variable is quantified as a sum of squares. From figure 9.2 it can be seen that Yi can be expressed as

or

Y,

-

Pi = (Y, - P) - (9,- 7)

Through algebraic manipulations, it can be shown that

g (Y, - Pi)' = g

(Y, - Y)' -

2 (qi-

SIMPLE REGRESSION

229

Fig. 9.2. Components of Y.

Rearranging terms results in

Z (Y, - Y)' However,

=

Z ( y i - Pi)'

+

z(Pi - Q2

2 (Yi - Y)2 is equal to C Y;

- nY2 so we have

The total sum of squares, 2 Y:, has been partitioned into three components. These three components are: 1. n F , the sum of squares due to the mean

2.

C (Yi - Pi)2 = C e:,

the sum of squares of deviations from regression or the residual sum of

squares 3.

X (Pi - y)2, the sum of squares due to regression The sum of squares about the mean or the sum of squares corrected for the mean is

2 (Yi - Y)2 = I:y;

=

which may be written

C y2

=

C e2 + b C xiyi

X Y;

-n p =

X (Y, -

+ 2 (Pi - Y)2

(9.11)

230

CHAPTER 9

Therefore, the total sum of squares corrected for the mean is made up of two componentsthe sum of squares of deviation from regression (also known as the error or residual sum of squares) and the sum of squares due to regression. The larger the sum of squares due to regression in comparison to the residual sum of squares, the more of the total sum of squares corrected for the mean is explained by the regression equation. The ratio of the sum of squares due to regression to the total sum of squares corrected for the mean can be used as a measure of the ability of the regression line to explain variations in the dependent variable. This ratio is commonly denoted by 2 and may be written in a number of ways.

2=

sum of squares for regression sum of squares corrected for mean

2 is called the coefficient of

determination. If the regression equation perfectly predicts every value of Yi , then ei would be zero for every i and 2 e' would be zero. Under these conditions, so that from equation 9.13 2 is seen to be one. On equation 9.11 states that 2 y' = 2 (Pi the other hand, if the regression equation explains none of the variations in Y, then C e' will equal 2 yZ and C (Pi- Y)2will be zero. Under this condition 2 will be zero. Thus, the range in possible values for 2 is from 0 to 1. The closer 2 is to 1, the better the regression equation "fits" the data. 2 is the fraction of the total sum of squares about the mean that is explained by the regression equation. From equations 9.6 and 9.13 we can write

u)2,

Because 0 < 2 < 1, we have - 1 < r < 1. The sign on r is identical to the sign on b because sx and s, are always positive. From equation 9.14 it can be seen that r may also be written as

which would be equal to the sample correlation coefficient if X and Y were both random variables. In fact, r is commonly called the correlation coefficient and can be shown to be equal to the correlation between Y and ?. Correlation is discussed in more detail in chapter 11.

SIMPLE REGRESSION

23 1

Example 9.2. What percent of the variation in Y is accounted for by the regression of example 9. l? Solution:

Thus, 66% of the variation in Y is explained by the regression equation. The remaining 34% of the variation is due to unex~lainedcauses. CONFIDENCE INTERVALS AND TESTS OF HYPOTHESES Thus far in the discussion of simple regression no assumptions have been made conceming the model. In order to use some well-developed theorems conceming hypothesis testing and confidence interval estimation, it is necessary to make the assumption that the E~ are identically and independently distributed as a normal distribution with a mean of zero and a variance of 2 . (A shorthand way of writing this is ei is i.i.d. N(0,d)). For further discussion of the assumptions involved in regression analysis, see the closing section of this chapter, General Considerations. Also see Johnston (1963) and Graybill (1961). This assumption contains many implications. The fact that the E(E,) = 0 has been guaranteed by our estimation procedures. The assumption of independence means that the correlation between E~ and ej for any i # j must be zero. The assumption that the ei are identically distributed with variance a2means that the variance of ei must equal the variance of E~ for all i and j. That is, the variance of ei cannot change as Xi changes. This is known as homoscedasticity. Finally we must have the ei normally distributed. The assumption of normality of the E~can be checked by the procedures of chapter 8. A rough check would be to note that, for the normal distribution, 95% of the values of E~ should be within 2 standard deviations of the mean or only about 5% of the residuals should lie outside the interval -20 to 20. For a further discussion of examining the ei, reference should be made to Draper and Smith (1966). Under the normality assumption, we have E(E) = 0. The Var(e) is given by

The positive square root of the Var(~)is known as the standard error of the regression equation. An unbiased estimate (Graybill 1961) for V a r ( ~ is ~ )s2 calculated from

The least squares estimation procedure produces estimates for a and b such that the standard error of the regression equation is a minimum. Another way to look at the coefficient of determination is to write equation 9.13 as

C. e2 r" = (I:y? - I: e?) - 1 - 7

I:Y?

I:~i

232

CHAPTER 9

Fig. 9.3. Variability in linear regression.

Therefore, if the estimated standard error of the regression equation is nearly equal to the standard deviation of Y, 8 will be close to zero and the regression equation is of little value in explaining variation in Y. Figure 9.3 depicts the relationships among the pdfs of X, Y, and e in a linear regression. What is of interest is the spread or variance in the pdf of e, s2,in comparison to that of Y, s;. The smaller is s2 in comparison to s;, the greater is 8 and the stronger is the linear relationship between Y and X. This is stated mathematically by equation 9.18.

Example 9.3. Is there reason to believe the residuals of example 9.1 are not normally distributed? Solution:

95% of the e, should be between -2s and 2s or between -5.94 and +5.94. An inspection of table 9.2 shows that none of the 16 observations are outside this interval. The number of observations is not sufficient to determine if the ei are N(0, a2), however, there is not sufficient evidence to reject this possibility.

SIMPLE REGRESSION

233

Inferences on Regression Coefficients In order to place confidence intervals on a and P and to test hypotheses concerning them, it : and 0;and estimated is necessary to know the Var(a) and Var(p) which will be designated as a by s: and s:. si and st can be estimated from

and

where s2 is estimated from equation 9.17. If the model is correct, then the quantities b/sb and a/sa are distributed as a t distribution with n - 2 degrees of freedom. Thus the confidence limits on a can be estimated from

where s, is estimated from equation 9.20. Similarly, the confidence limits on P are estimated from

where sb is estimated from equation 9.19. Test of hypotheses concerning a and P can be made by noting that (a - ao)/saand (b - po)/sb both have t distributions with n - 2 degrees of freedom. Thus the hypothesis H,: a = a, versus Ha: a # a, is tested by computing

H, is rejected if It1 > tl-a/2,n-2. Similarly, H,: P = Po versus Ha: P #

t=

(b

-

Po)

Sb

H, is rejected if It( > tl-a,2n-2.

Po is tested by computing

234

CHAPTER 9

The significance of the overall regression equation can be evaluated by testing the hypothesis that P = 0. The H,: P = 0 is equivalent to H,: r = 0. If this hypothesis is accepted, Note that if r = 0, equation 9.18 shows that s2 s t , or the then 9 may be estimated by 7. regression line does not explain a significant amount of the variation in Y. In this situation one would be as well off using Y as an estimator for Y regardless of the value of X.

--

Example 9.4. Compute the 95% confidence intervals on a and p and test the hypothesis that a = 0 and the hypothesis that P = 0.500 for the regression of example 9.1. Solution: sa=s

[a

-+-

X2

'D

Ex?]

The 95% confidence intervals on a are

The 95% confidence intervals on p are

To test H,: a = 0 versus Ha: a # 0, compute

we reject H,: a = 0. Because It1 > f0.975.14,

SIMPLE REGRESSION

235

To test the H,: p = .5 versus Ha: p # .5, compute

Since It1 < t()975,14,we cannot reject H,. The slope is not significantly different from 0.5. Comment: The significance of the overall regression can be evaluated by testing H,: Under this hypothesis

P

=

0.

Because It( > f0.975,14 we reject H,. The regression equation explains a significant amount of the variation in Y. Confidence Intervals on Regression Line Confidence interyals on the regression line can be determined by first calculating the where Ykrepresents the predicted mean value of for a given Xk. variance of

vk

A -

Yk = a

Pk

+ bXk

From equation 3.56

Mood and Graybill (1963) give Cov(a, b) =

u2X Therefore E x2'

--

A

The standard error of

could be estimated by s+kcalculated as

A

Equation 9.25 indicates that the variance of & depends on the particular value of X at which A

-

the variance is being determined. The ~ a r ( & )is a minimum when Xk = X and increases as Xk deviates from X.

236

CHAPTER 9 Confidence limits on the regression line are now given by

A

= a + bX, and sqk is given by equation 9.26. Because s e increases as xk or x,-X where increases, the confidence intervals are the narrowest at X, = X and widen as Xk deviates from X The confidence limits on an individual predicted value of Y would be wider than the ' would have confidence interval on th_eregression line since for an individual Y, the Var(~)or 0 to be added to the ~ a r ( & ) .Thus the variance of an individual predicted value of Y would be A

~ a r ( Y , )+ a2.Confidence intervals on an individual predicted value of Y could then be estimated from equations 9.27 where the expression

would be substituted for skk.The confidence limits on a future predicted value of Y are the same as those for an individual predicted value of Y.

Example 9.5. Calculate the 95% confidence limits for the regression line of example 9.1. Calculate the 95% confidence interval for an individual predicted value of Y for the same problem. Solution: s = 2.97, n = 16,C x' = 570.0559, b.975,14 = 2.145 and X = 42.94.Therefore,from equations 9.27 we have for the 95% confidence intervals on the regression line

where the - applies to the lower limit, 1, and the + to the upper limit, u. Similarly, the 95% confidence interval on an individual predicted value of Y is given by

By substituting various values of Xk into these equations, the desired confidence limits are obtained. These intervals are plotted in figure 9.1.

SIMPLE REGRESSION

237

Confidence Iintervals on Standard Error Confidence intervals may be placed on a2 by noting that the quantity (n - 2)s2/a2 is distributed as a chi-square distribution with n - 2 degrees of freedom. Thus, confidence limits on a2are given by

where s2 is determined from equation 9.17.

EXTRAPOLATTON The extrapolation of a regression equation beyond the range of X used in estimating a and p is discouraged for two reasons. First, as can be seen from figure 9.1 and equation 9.27, the confidence intervals on the regression line become very wide as the distance from is increased. Second, the relation between Y and X may be nonlinear over the entire range of X and only approximately linear for the range of X investigated. A typical example of this is shown in figure 9.4.

GENERAL CONSIDERATIONS Many authors discuss several different linear models depending on the assumptions made concerning Y, X, and E (Graybill 1961; Benjamin and Cornell 1970; Mood and Graybill 1963). These different models revolve around whether X (or X in multiple regression) is a random or nonrandom variable, whether measurement errors are made on Y and/or X, the distribution of X if X is a random variable, and the joint distribution of Y and X if X is a random variable.

True relation

/ /

Fig. 9.4. Effect on nonlinearity and extrapolation.

/

The most common assumptions are: 1. X is a nonrandom variable measured without error, Y is a random variable, and E(Y,IX) is normally and independently distributed with mean a + PX and variance a2. 2. Y and X are both random variables having a joint distribution, the conditional distribution of Y is N(a + PX, a2), and the marginal distribution of X is independent of a , P and a 2 . It turns out that under either of the above conditions, the procedures given in this chapter are valid for tests of hypotheses and confidence interval estimation at a specified level of significance. Graybill (1961) points out that the power of the tests are not the same for the two conditions. If X is a fixed variable measured without error and ei is independently and identically distributed N(0, a2); or Y and X are from a bivariate normal distribution and are measured without error.; or Y and X are from a bivariate non-normal population with the conditional distribution of Y being N(a + PX, a2)and the marginal distribution of X independent of a , P and a2; then the least squares estimates of a , p and a2are also maximum likelihood estimators. The least squares estimates for the regression coefficients are unbiased. If significant measurement errors are made on the X variables, then complications arise. For this situation reference can be made to Graybill (1961) or Johnston (1963). Certainly, measurement errors are always present; however, if these errors are small relative to X, then the theory presented in this chapter and chapters 10, 11, and 12 may still be applied. The reason that measurement errors on X cause problems can be seen by considering the model Y = a + PX + E. If Y and X contain measurement errors, then Y and X are not observed. What is observed is Y * and X*, where

where ey and ex are the measurement errors on Y and X. Thus, the normal equations are solved in terms of Y* = a + PX* + E, or Y + ey = a + p(X + ex) + E = a + f3X + f3ex + E. Now if ex is small in comparison to X, this latter equation becomes Y = a + f3X + E - ey, or Y = a + f3X + e,, which can be handled by the methods outlined in this chapter. Recall that no distributional assumptions are required to get the least squares estimates for a and f3. The assumptions are involved when confidence intervals and tests of hypotheses are of concern, or when it is desired to state that the least squares estimates for a and P are also maximum likelihood estimates. Johnston (1963) points out that the least squares estimates for a and p are biased if significant measurement errors are present on X. One of the assumptions used in developing confidence intemals and tests of hypotheses was that the E~ are independent. If E, is correlated with E ~ + , ,the least square estimates of a and f3 are unbiased, however, the sampling variance of a and f3 will be unduly large and will be underestimated by the least squares formulas for variances rendering the level of significance of tests of hypotheses unknown. Also, the sampling variances on predictions made with the resulting equation will be needlessly large. Correlation between E, and frequently arises when time series data are being analyzed. This type of correlation is known as autocorrelation or serial correlation.

SIMPLE REGRESSION

239

Fig. 9.5. Illustration of situation where Var(ei) # s2for all i.

Johnston (1963) discusses least squares estimation procedures in the presence of autocorrelation. Autocorrelation of errors is discussed in more detail in the next chapter of this book. In some situations the assumption of homoscedasticity [Var(~,)= 0' for all i] is violated. Quite commonly, Var(ei) increases as X increases. Such a situation is depicted in figure 9.5. Draper and Smith (1966) and Johnston (1963) discuss least squares estimation under this condition. Another point to be made concerning hypothesis testing in general is that a statistically significant difference and a physically significant difference are two entirely different quantities. For example, when the H,: P = 0 was tested in example 9.4, the conclusion was that the regression line explained a significant amount of the variation in Y. This refers to a statistically significant amount of the variation at the chosen level of significance. It means that recognizing an a%chance of an error, the relationship Y = a + bX cannot be attributed to chance. It does not imply a cause and effect relationship between Y and X. Looking at the confidence limits on the regression as plotted in figure 9.1 and the scatter of the data, it can be seen that this simple relationship Y = a + bX leaves a lot to be desired in terms of predicting annual runoff. Whether or not the derived relationship is usable depends on the use to be made of the predicted values of Y and not on the fact that the Ho: p = 0 is rejected. It may be that the standard error of the equation, s2,is so large as to render the estimate made with the equation in some particular application too uncertain to be used even though the equation is explaining a statistically significant portion of the variability in the dependent variable. Exercises 9.1. The following data are the maximum air and soil temperatures (bare soil at 2-inch depth) recorded for the first 30 days of July 1973, at Lexington, Kentucky. Derive a linear relationship via simple regression for predicting the maximum soil temperature from the maximum air

CHAPTER 9

240 Air

Soil

Air

Max Temp Soil

Air

Soil

temperature. Estimate a and 3 for the resulting regression. Test the hypothesis that (a) the intercept is 0, (b) the slope is 1, (c) the regression explains a significant amount of the variation in the maximum soil temperature. Would you recommend using this relationship for predicting maximum soil temperature? 9.2. The asterisks following the soil data in exercise 9.1 indicate days on which rainfall occurred. Using only these rainfall days, work exercise 9.1. 9.3. Calculate the regression coefficients in the relationship Q, = a + bQ where Q, is the annual suspended sediment load and Q is the annual water discharge for the Green River at Munfordville, Kentucky. Calculate the standard error of the regression equation and the correlation coefficient. Plot the data along with the 95% confidence intervals on the regression line. Is this a usable prediction equation? 9.4. Show that the correlation coefficient in simple regression is equivalent to the correlation between Y and ?. 9.5. Calculate the regression equation for the data of table 9.1 considering the runoff as the independent variable and the precipitation as the dependent variable. Rearrange the resulting equation to be in the form of the prediction equation of example 9.1. Does the resulting regression equation agree with the regression equation in example 9.1? Should it agree? Why? Which equation should be used? 9.6. A technique used by hydrologists to detect changes in the hydrologic response of a watershed is to examine mass curves for changes in slope. A mass curve is a plot of the accumulation over time of one variable versus the accumulation over time of a second variable. The data below are the annual runoff and precipitation for Thorne Creek experimental watershed in Pulaski County, Virginia. It is thought that there was a change in the hydrologic characteristics of this watershed during the 11-year period of study. Plot the accumulated precipitation as the abscissa and the accumulated runoff as the ordinate. Does there appear to be a change in the rainfall-runoff relationship? During what year? Calculate the slope of the regression lines describing the data

SIMPLE REGRESSION

24 1

both before and after the apparent change. Test the hypothesis that these slopes are not significantly different. Year

Precipitation

Runoff

Year

Precipitation

Runoff

9.7. Occasionally it is desirable to restrict the intercept of a simple regression to 0, thus requiring the regression line to pass through the origin. Derive the normal equation for the slope in this case. Use the resulting equation to calculate the slope of the line describing the data plotted for exercise 9.6. Neglect the apparent change in the slope for this problem (i.e., use all of the data to estimate b in the equation accumulated runoff = b [accumulated precipitation]). 9.8. Hydrologists frequently use watershed physical characteristics as an aid in studying watershed hydrology. The data below are the area (square miles) and length (miles) of several Colorado mountain watersheds (Julian et al. 1967). Derive a linear regression equation for predicting the area of similar watersheds as a function of the watershed length. Plot the data and the derived regression line. Plot the 95% confidence intervals on the regression line. Area

Length

Area

Length

10. Multiple Linear Regression NOTATION THE NOTATION set forth in chapter 9 will be followed in this chapter unless otherwise noted. Additionally, vectors and matrices will be denoted by underlined letters such as y ,X -or b. The inverse of a matrix X will be denoted by X-' .The transpose of X will be denoted by X'. The x, A number of rows and columns in a matrix will be shown as - if X has n rows and p columns. nXp -

v

Thus

-

represents a column vector with n elements. The element of X - corresponding to the n X l i' row and the jthcolumn will be denoted by Xi,. The expression X = [Xi,,] indicates that X is made up of elements Xij. A matrix made up of elements which are deviations from a mean will be denoted by a lower case, underlined letter y . The i, jthelement of y will be given as yij. The v i' element of a vector will be given by Yi. n X l The concepts of chapter 9 must be understood before proceeding to this chapter. Calculations would normally be done on a computer for problems dealing with multiple regression. Standard programs are available, so the emphasis in this chapter is not on computing but on the principles involved in multiple regression.

GENERAL LINEAR MODEL Quite often, a dependent variable may be expressed as a linear combination of several other quantities. For example, the peak rate of runoff from watersheds in a given region may be related to the watershed area, slope of the mainstream, rainfall, and so on. A linear regression model for

MULTIPLE REGRESSION

243

predicting peak runoff would then c~ntainall of these variables. This is an extension of the linear model discussed in chapter 9 to include several independent variables. A general linear model is of the form

where Y is a dependent variable, XI, X2, ..., Xp are independent variables, P,, P2, ..., Pp are unknown parameters, and E is an error component. This model is linear in the parameters, Pj.

is also linear in the parameters, pj, whereas the models

and

are not linear in the parameters. In practice, n observations would be available on Y with the corresponding n observations on each of the p independent variables. Thus, n equations like equation 10.1 can be written, one for each observation. The p unknown parameters are estimated from the n equations. Thus, n must be equal to or greater than p. In practice, n should be at least 3 or 4 times as large as p. The n equations are

where Yi is the ithobservation on Y and Xij is the ithobservation on the j' independent variable. Equations 10.2 can be written

for i = 1 to n. In matrix notation the equations become

where Y is an n X 1vector of observations, X is an n X p matrix made up of n observations on P is a p X 1vector of unknown parameters. If the matrices each of p independent variables, and in equation 10.4 are written out, we get

Y is an n X 1 When the model is written in the form of equation 10.5, it is easy to see that vector of observations on the dependent variable, X is an n X p matrix made up of n observations P is a p X 1 vector of unknown parameters. For equation on each of p independent variables, and 10.4 to have an intercept term, it is necessary that Xi,l = 1 for all i. p1 is then the intercept. In the following development, it is assumed that Xi,, = 1 for i = 1 to n. The model discussed in chapter 9,

is a special case of equation 10.5 with Xi,1 = 1, Xi,2 = X, P1 = a and P2 = P. Following the pattern of chapter 9, the unknown parameters, P, can be estimated by minimizing C e' where

In matrix notation

Differentiating this expression with respect to zero results in

and setting the partial derivative equal to

which represents the normal equations. The solution of equation 10.7 is obtained by premultiplying by (x'x)-'. --

MULTIPLE REGRESSION

245

or we have the result that p can be estimated by

The X'X - matrix plays an important role in estimating P and in the variance of the 6,'s. The X'X - matrix is made up of the sum of squares and cross products of the independent variables. For the p x p matrix X'X - to be inverted, its rank must be p. That is, no row or column can be a linear function of any combination of the other rows and columns. If this occurs, it is known as multicolinearity. z = [z,,,], then zfz/(n - 1 ) is a p X p correlation If we define zi, to be (Xi, - X,)/s, and let matrix, R = [Rij], where Rid is the correlation coefficient between the ith and jth independent variables. By definition, Rid = 1 for i = j. If 1Rij( = 1 for some i # j, then the ithindependent X'X - will be less than variable is a linear function of the j' independent variable and the rank of p. This means that an independent variable cannot be a (perfect) linear function of any other X'X - to be p, an independent variable cannot independent variable. Furthermore, for the rank of be linearly dependent on any linear function of the remaining independent variables. For exarnple, if p is 4 and X2 = a x 1 + bX3 + c, then X2 is a linear function of XI and X3 SO that the rank of X'X - would be at most 3. If there is near linear dependence in X,the calculation of (X 'X)~' may involve roundoff errors and loss of significance leading to nonsensical estimates for P (Draper and Smith 1966). As in the case of simple regression, the total sum of squares can be partitioned into three parts. Draper and Smith (1966) demonstrate that equation 9.10 can be written in matrix notation as

Y'Y - or so that the three components of the total sum of squares, -

2 Y:

are:

1. n p , the sum of squares due to the mean 2. Y'Y - ~ ' x ' Y = (Y - x ~ ) ' ( Y- ~ 6 =)e'e = 2 e? = 2 (Yi - ?i)2, the sum of squares - - ---of deviations from regression or the residual sum of squares

~ '-x ' Y- n? 3. -

=

2 (Pi- Y)',

the sum of squares due to regression

A multiple coefficient of determination, R ~can , now be defined from equation 9.13 as

~2

=

Sum of squares due to regression Sum of squares corrected for the mean

Table 10.1. ANOVA for multiple regression

Source Mean Regression Residual Total

Degrees of freedom

Sum of squares

1 P-1 n-P n

nY2 --fi'x'y - n p

Expected mean square

P'x'Y

YY - ----

s2

YY

--

As in the case of 3, the range of R2 is from 0 to 1. The multiple correlation coefficient is defined as the positive square root of R2. Again, R2 is the fraction of the total sum of squares corrected for the mean that is explained by the regression equation ?=Quite frequently the partitioning of the sum of squares is shown as in table 10.1 in the form of an analysis of variance (ANOVA) table. A mean square in the ANOVA is simply a sum of squares divided by its degrees of freedom. Continuing the analogy with simple regression, define e as Y - XP. The estimation proce- =0. An unbiased estimate for the Var(e,) or u2is s2 where dure guarantees that E(e)

~ 6 .

A

-

e'en-P

-

( Y - @ ) ' ( Y- X S )n-P

(Y'Y-fi'x'~) - -- - - -

"- P

The standard error of the regression equation u is estimated by s. An expression for R2 that is analogous to equation 9.18 is

Again, this shows that if the regression equation is explaining a large part of the variation in Y. The standard error of the equation will be significantly less than the standard deviation of Y. Example 10.1. Benson (1962) studied flood frequencies on many streams in the northeastern United States. The following table contains a partial listing of some of Benson's data. Using this data: (a) Estimate the regression coefficients for the model

where Q is the mean annual flood in thousands of cfs, A is the watershed area in thousands of square miles, and I is the average annual maximum 24-hour rainfall depth in inches.

247

MULTIPLE REGRESSION

(b) Calculate R'.., (c) Calculate Q~for each observation on the independent variables. (d) Calculate ei for each Qi.

Q

Station No.

A

I

Q

e

Solution: To maintain consistency in notation, let Yi = Qi, Xi,I = 1, Xi,, = Ai, Xi,, = Ii. For this problem n = 14 and p = 3. The column of data under Q is the 14 X 1 vector Y, a column of 1's P is made up of along with the data under A and I is the 14 X 3 matrix X,and the 3 X 1 vector b,, b,, and b3. From equation 10.8, we have

(&'&)-' is found to be 3.71678 -0.18094 - 1.37537

-0.18094 0.02028 0.06124

- 1.37537 0.06 124 0.52329

1

CHAPTER 10

248

b,

= 1.6570, The parameter estimates are From equation 10.10, we get

b2 = 13.1510, and 6, =

0.01 12.

-

R2 =

( @ ' ~-' ny2) ~ (Y'Y - - - ny2)

This means that 99% of the variation in Y is explained by the regression equation

Values for Q contained in the above table were calculated from this relationship. Values for ei'were computed from

and are also contained in the above tabulation. The ANOVA table for this example would be Source

d.f.

Sum of squares

Mean square

Mean Regression Residual Total

1

2 11 14

6,606.381 13,182.600 171.090 19,960.071

6591.300 15.554

From the ANOVA table R2 = 13,182.600/(19,960.066 - 6,606.381) = 0.99 and s2 = 15.554 or the estimated standard error of the regression equation is s = 3.94. Comment: The purpose of this example is to demonstrate the meaning of the various matrices and to provide practice in their calculation. Hydrologic significance should not be attached to the high R2 since the watersheds are all close to one another (Maine) and the units on Q are cfs and the watershed area is contained in the equation. Many of the gaging stations are located at various points along the same stream. The number of significant figures that are carried in the calculations should be as large as practical. In reporting the results, the number of significant figures should be reduced. Thus, the reported results on the above regression might be

If a large number of significant figures are not carried in computing the (x'x)-'matrix, significant errors can result. To demonstrate this, the elements of the X'X and X'Y matrices were rounded to two decimal places resulting in estimates for b of = 1.10, = 12.24, and = 5.28. Computational problems of this type are rarely a problem when using well-established computer routines unless there is near colinearity in the X matrix.

b1

6,

b3

MULTIPLE REGRESSION

249

CONFIDENCE INTERVALS A I D TESTS OF HYPOTHESES As was the case in simple regression, in order to use some well-developed theorems on confidence intervals and tests of hypotheses in multiple regression, some assumptions must be made. All of the comments of chapter 9 regarding the assumptions in simple regression remain valid in multiple regression. The assumption will now be made that the E, are identically, independently, and normally distributed with mean 0 and variance u2 .That is, the ei are iid N(0, u2)(see General Comments section of chapter 9.) Confidence Intervals On Standard Error Confidence intervals can be placed on u2by noting that the quantity (n square distribution. Thus, the confidence limits on u2are

Inferences on the Regression Coefficients P the variance of To make inferences concerning covariance matrix of P is given by

,.

-

p)s2/u2has a chi-

6 must be estimated. The variance-

-

which can be shown to be

fii

The variance of is equal to the covariance of -The covariance of with element of (X'X)-I. = X'X then C-' = (x'x)-' and C -- --

fii

fii with itself and is therefore c? times the ia diagonal fij is c? times the i, j~ element of (-X '-X ) ~ ' .If we let

where c,' is the ithdiagonal element of (x'x)-'. If the model is correct, then the quantity bi/sgiis distributed as a t distribution with n - p degrees of freedom where si, is an estimate for upiand is calculated as the positive square root of Confidence intervals on

Pi are given by

(Pi

A test of the hypothesis Pi = Po where Po is a known constant can be made by noting that - P o ) / s ~has a t distribution. Thus, to test H,: Pi = Po versus H,: Pi # Po, the test statistic

is computed. H, is reject if I t 1

> tl - up,n-

,.

p.

Bi

Because in general is not independent of P, (their covariance is given by c i 1 0 2 ) , repeated application of equation 10.17 to test H,: Pi = Poi and H,: Pj = Poj are not independent tests. A test of H,: pi = 0 versus Ha: pi # 0 is equivalent to testing the hypothesis that the iLh independent variable is not contributing significantly to explaining the variation in the dependent variable. If H,: pi = 0 is not rejected, it is often advisable to delete the i~ independent variable from the model and recalculate the regression. A test of the hypothesis that the entire regression equation is not explaining a significant amount of the variation in Y is equivalent to H,: P2 = P3 = - - - = pp = 0 versus Ha:at least one of these p's is not zero. Since pi is not independent of Pj, repeated application of equation 10.17 is not a valid way to test this hypothesis. Use can be made of the fact that the ratio of the mean square due to regression to the residual mean square has an F distribution with p - 1 and n - p degrees of freedom. To test H,: P2 = P3 = = Pp = 0, calculate the test statistic 0

.

-

and reject H, if F exceeds F1-,,p-l,n-p. A test of the hypothesis that k of the independent variables are not contributing significantly to explaining the linear variation in the dependent variable can be made by rearranging the model so that the last k variables are the ones to be tested. The hypothesis is that the last k independent variables are not contributing significantly to explaining the linear variation in Y. In practice, the model does not have to be so arranged. The order of the X's makes no difference. The assumption here is the last k variables are under test. This makes the notation easier. This is equivalent to H,: PP-,+, - Pp-,+2 = Pp = 0 versus Ha: at least one of these p's is not zero. To test H,, denote the full model as the model containing all p of the independent variables. Denote as the reduced model the model obtained by deleting the last k independent variables. The reduced model contains p - k independent variables. Now let Q2 = sum of squares due to regression on the full model with p - 1 degrees of freedom Q1 = residual sum of squares on the full model with n - p degrees of freedom Q2* = sum of squares due to regression on the reduced model with p - k - 1 degrees of freedom The quantity - . a

iMULTIPLE REGRESSION

25 1

will have an F distribution with k and n - p degrees of freedom. H, is rejected if F exceeds F1-a,k,n-p.

Note that Q2 - Q2* is the reduction in the sum of squares due to regression brought about by deleting k independent variables. If Q2* nearly equals Q2 , then the deletion of the k variables has not greatly changed the ability of the model to explain the linear variation in Y. Under these conditions F will be small and H, will not be rejected, indicating that one might eliminate the last k variables from further consideration. Rejection of H, does not imply that all of the last k variables are important-it only implies that at least one of these variables is explaining a significant amount of the variation in Y. Confidence Intervals on the Regression Line To place confidence limits on Y, where Yh = XhP, - it is necessary to have an estimate for the variance of

P,.In this discussion 9, is an estimate of Y (a scalar) at the point Xh - (a 1 X p vector) fi

P. The var(9,) is given in p dimensional space. is a p X 1 vector consisting of the estimates for by (Draper and Smith, 1966)

which can be estimated by replacing u2 with s2.The confidence limits on Yh are given by

The confidence intervals on an individual predicted value of Y, are given by equations 10.21 where var(9,) is replaced by the variance of an individual predicted value of Y at Xh which is xh(Xrx)-'x',). -- given by u2(1 + Other Inferences in Regression Many other tests of hypotheses can be made and confidence intervals constructed relative to multiple regression. For example, one might make tests concerning linear relationships among the b's or that the p's obtained from one situation are equal to those obtained from another situation. Reference can be made to Graybill (1961), Johnston (1963), Draper and Smith (1966), or Neter et al. (1996) for these and other tests. Example 10.2. For the regression equation of example 10.1: (a) Test the hypothesis that the regression equation is not explaining a significant amount of the variation of Y. (b) Test the H,: p2 = 0. (c) Test the H,: p, = 0. (d) Calculate the 95% confidence limits on P2. (e) Calculate the 95% confidence limits on the regression line at the point A = 4,000 square miles and I = 2.0 inches. (f) Calculate the 95% confidence intervals on u2.

CHAPTER 10

252 Solution:

(a) This H, is equivalent to H,: p, = p3 = 0 versus Ha: at least one of p, or p3 Z 0. The test is conducted by calculating the test statistic from equation 10.18. The quantities in equation 10.18 are contained in the ANOVA table with the numerator being the mean square due to regression and the denominator being the residual mean square.

The tabulated F,95,2 is 3.98. Therefore, H, is rejected. The regression equation does explain a significant amount of the variation in Y. (b) H,:

p,

=0

Ha: p, Z 0

The test statistic is from equation 10.17.

= 2.201, so we reject the H,. Area does explain a significant The tabled value of t is t,975,11 amount of the variation in Y.

(c) H,:

p3 = 0

Ha: P3 Z 0

The test statistic is again from equation 10.17

Because It1 < t.975,11, we cannot reject H,. The mean annual maximum 24-hour rainfall depth does not explain a significant amount of the variation in the mean annual peak flow. (d) The 95% confidence limits on P2 are calculated from equations 10.15 as

(e) The 95% confidence limits on the regression line at X2,h= 4.00 and X3,h= 2.0 are determined from equation 10.2 1. The var(Ph)is from equation 10.20. var(9,) = 15.554X -h (XIX)- - 'x; (&I&)-'

is given in example 10.1

MULTIPLE REGRESSION

253

(f) The 95% confidence intervals on u2 are calculated from equation 10.13

The 95% confidence intervals on u can be obtained by taking the square root of these limits to obtain 2.80-6.69. Comment: The hypothesis H,: P2 = 0 and H,: P, = 0 were both tested in this example as though the tests were independent. In fact, P2 and P, are not independent. The cov(p2, can and can be determined from C;: s2 as .0612(15.554) = 0.9519. The correlation between

b,) p2 6,

be estimated from cov(b2, fi,)/(up, up3)as 0.9519/(0.562 x 2.85) = 0.59. The test of H,: P, = 0 is made relative to the full model that includes all of the P's. The acceptance of H, implies that p, = 0 given that p1 and p2 are in the model. In general, if there are p p7s and H,: Pi = 0 is tested for each of them, with the result that k of the hypotheses can be accepted, one cannot eliminate these k variables from the model on the basis of this test alone because each of the individual H,: Pi = 0 assumes all of the other p - 1 p7sare still in the model. To eliminate k variables at once, the test must be based on equation 10.19. As an example of the application of equation 10.19, the H,: P, = 0 will be tested. The ANOVA for the full model is contained in example 10.1. The reduced model is simply Y = b1 + b2X where X is the watershed area in thousands of square miles. Because this is a simple regression situation, we can compute the sum of squares due to regression from b 2 Zxiyi where b = 2 xiyi/Z xf.The result of this calculation is the sum of squares due to regression for the reduced model, which is 13,182.60. The test statistic from equation 10.19 is

The table value of F.95.1,11 = 4.84, so we fail to reject H,: p3 = 0. Note that this test is identical to the test conducted in part (c) of this example. From F and t tables it can be seen that

254

CHAPTER 10

F,-,,,, - t:-an,n, SO for the special case where k = 1 variable is being tested, equations 10.17 and 10.19 produce identical results. Because H,: P3 = 0 was not rejected, the next logical step is to eliminate I from the model and consider only A. Ln so doing the resulting regression equation is

The dependence of p's again is evident because the intercept is not the same as was obtained when rainfall depth was included in the model. This is a somewhat special example in that P2 accounts for nearly all of the variation in Y, leaving virtually none of the variation to be explained by P3. Again, one reason for this unusual situation is the units on Y and A and the proximity of all of the watersheds to each other, resulting in similar rainfalls on all of the watersheds. Unless the relationship between the dependent variable and an independent variable is quite strong, variability in the dependent variable due to variability in the independent variable cannot be detected if there is little variability in the independent variable.

WHICH LINE IS BEST A common situation in which multiple regression is used is when one dependent variable and several independent variables are available and it is desired to find a linear model for predicting unobserved values for the dependent variable. The model that is developed does not necessarily have to contain all of the independent variables. Thus, the points of concern are: 1) can a linear model be used and 2) what independent variables should be included? A factor complicating the selection of the model is that in most cases the independent variables are not statistically independent at all but are correlated. One of the first steps that should R of the independent varibe done in a regression analysis is to compute the correlation matrix ables. The correlation matrix can be computed as follows. Let

where Kj and sj are the mean and standard deviation of the j" independent variable. Then define z = [zij] so that the correlation matrix is

R is a symmetric matrix where Rij is the correlation between the i" and j" independent variables. because Rij = Rj,i.We have already seen that if Ri,j = 1 for i # j, then either variable i or variable j must be omitted from the model or else the X'X - matrix cannot be inverted. If Ri,j is close - - can be inverted and P estimated. If Rijis close to unity, to unity (but not equal to unity), then X'X then the var(bi) or var(bj) may be very large. Tests of hypothesis on Pi and Pj may indicate that neither is significantly different from zero when in fact either Pi or Pj when used alone may be significantly different from zero. The problem here is that since Xi and Xj are nearly linearly

MULTIPLE REGRESSION

255

related, they both are attempting to explain the same thing in the linear model. By having both Xi and Xj in the model, the part of the variation in Y that either would explain if used alone may be split between them in such a fashion that neither is significant. In other words, the effect of one explanatory factor (which may be reflected in either Xi or Xj) is being divided between two correlated variables. Retaining variables in a regression equation that are highly correlated (multicolinearity) makes the interpretation of the regression coefficients difficult. Many times the sign of the regression coefficient may be the opposite of what is expected if the corresponding variable is highly correlated with another independent variable in the equation. Multicolinearity is discussed below. A common practice in selecting a multiple regression model (and one that is not necessarily being advocated) is to perform several regressions on a given set of data using different combinations of the independent variables. The regression that "best" fits the data is then selected. A commonly used criterion for the "best" fit is to select the equation yielding the largest value of R2. Looking at equations 10.21, another and perhaps better criterion is apparent. The confidence intervals on the regression line are a function of s, the estimated standard error. The line with the smallest standard error will have the narrowest confidence intervals. Often the two criteria of the largest R2 and the smallest s give the same results-but not always. As more variables are added to a regression equation, the R2 value can never decrease. Thus, from the standpoint of the R2 criterion, one should use all of the available variables. This, however, makes a clumsy equation and one in which it is extremely difficult to place a meaningful interpretation on the coefficients. As more variables are added to a regression equation, the standard error may get larger. This can be seen from equation 10.11. Every time a variable is added, n - p gets smaller as does --- P'X'X. - - However, the numerator may not, and often does not, decrease proportionally to Y'Y n - p, so that as variables are added s may actually increase. This is a tip-off that the added variables are not contributing significantly to the regression and can just as well be left out. All of the variables retained in a regression should make a significant contribution to the regression unless there is an overriding reason (theoretical or intuitive) for retaining a nonsignificant variable. The variables retained should have physical significance. If two variables are equally significant when used alone but are not both needed, the one that is easiest to obtain or easiest to interpret should be used. The number of coefficients estimated should not exceed 25-35% of the number of observations. This is a rule of thumb used to avoid "over-fitting", whereby oscillations in the equation may occur between observations on the independent variables. Thus far all decisions on which regression equation to use have been made by the investigator. In many cases this is the most reliable method of selecting a regression equation. Using computers, it is possible to perform many regressions on large sets of data. This has led to several formal procedures for selecting a regression equation. Two methods will be discussed here-all-possible-regressions and stepwise regression. For a discussion of some other techniques, reference should be made to Draper and Smith (1966) and Neter et al. (1996).

All-possible-regressions involves calculating regression equations having every possible combination of the X variables. If all of the equations are required to have an intercept term, 2P-' regression equations would have to be calculated where p is the number of independent variables, one of which is always equal to one to produce the intercept term. Thus, if p = 4, 8 regression equations would be calculated (not an impossible task or a bad procedure); however, if p = 11,1024 regressions would have to be calculated and examined. Thus, as p gets even moderately large, the number of regressions required becomes prohibitive and intelligent thought could eliminate many of them. When this many regressions are calculated, the probability of getting a significant regression by chance becomes large. One of the most commonly used procedures for selecting the "best" regression equation is stepwise regression. This procedure consists of building the regression equation one variable at a time by adding at each step the variable that explains the largest amount of the remaining unexplained variation. After each step all the variables in the equation are examined for significance and discarded if they are no longer explaining a significant amount of the variation. Thus, the first variable added is the one with the highest simple correlation with the dependent variable. The second variable added is the one explaining the largest variation in the dependent variable that remains unexplained by the first variable added. At this point the first variable is tested for significance and retained or discarded depending on the results of this test. The third variable added is the one that explains the largest portion of the variation that is not explained by the variables already in the equation. The variables in the equation are then tested for significance. This procedure is continued until all of the variables not in the equation are found to be insignificant and all of the variables in the equation are significant. This is a very good procedure to use but care must be exercised to see that the resulting equation is rational. An alternative stepwise procedure is to start with a full model and eliminate variables one at a time with the least significant variable being chosen for elimination. At each step the significance of all remaining variables is checked to ensure the retained variables are, in fact, more important than the eliminated ones. Of course, the real test of how good the resulting regression model is depends on the ability of the model to predict the dependent variable for observations on the independent variables that were not used in estimating the regression coefficients. To make a comparison of this nature, it is necessary to randomly divide the data into two parts. One part of the data is then used to develop the model and the other part to test the model. Unfortunately, in hydrologic applications there are often not enough observations to cany out this procedure.

EXTRAPOLATION The comments on extrapolation contained in chapter 9 relative to simple regression are equally applicable to multiple regression. In multiple regression an additional problem arises. It is sometimes difficult to tell the range of the data. In example 10.1, A ranges from 0.091 to 8.27 and I ranges from 1.7 to 3.2. Is the point A = 6.0 and I = 2.7 in the range of the data? A plot of A and I is shown in figure 10.1. From this plot it is apparent that A and I do not cover the entire range defined by 0.091< A < 8.27 and 1.7 < I < 3.2. The point A = 6.0 and I = 2.7 does not appear to be in the range of the data. In more than 2 dimensions it is much more difficult to visualize the range of the data.

MULTIPLE REGRESSION

257

Fig. 10.1. Range of data used in example 10.1.

AUTOCORRELATED ERRORS One of the assumptions that is made in linear regression is that the errors are independent. This means that there should be no correlation between the errors at successive observations. Correlation in the errors from one observation to the next is common in time series data, especially if the hydrologic system involves considerable storage. For example, if the dependent variable is the elevation of the ground water in a particular observation well on a monthly basis, it would not be uncommon that if this water level were under-predicted at a particular time step, it would tend to be under-predicted in the next time step. Correlation of this type is often called autocorrelation or serial correlation. The chapters in this book on Correlation and on Time Series Analysis deal with this topic as well. Neter et al. (1996) has a good treatment of regression when serial correlation is present. It is important to note that what is of concern is autocorrelation in the error term of the regression model, not in the dependent or the independent variables. Often, but not always, autocorrelation in the dependent variable leads to autocorrelation in the error term of the regression model. Time series data such as daily or monthly streamflow, monthly ground water levels, and monthly reservoir levels generally have significant serial correlation, and regressions using these as dependent variables often have serial correlation in the error terms. The error term represents deviations between the predicted and observed values of the dependent variable. Serial correlation in the predicted variables can arise because the model predicts similar values from one time step to the next. It is only when over-predictions at one time step tend to follow overpredictions at the previous time step and under-predictions tend to follow under-predictions that serial correlation in the error term exists. Serial correlation in the errors can be detected by examining a time series plot of the errors and noting any patterns. Random scattering of the errors indicates a lack of serial correlation or

258

CHAPTER 10

independence of the errors. Any pattern in the errors may be indicative of serial correlation. The correlogram (chapter 14) of the errors can also be computed. A large first order serial correlation indicates correlated errors. Estimated regression coefficients in the presence of serial correlation in the errors are unbiased but their variances are incorrectly estimated, and thus the level of significance of hypothesis tests regarding these coefficients is unknown. The standard error of the regression equation is also affected so that hypothesis tests involving the standard error are also at an unknown level of significance. Serial correlation may indicate that one or more important explanatory variable is missing from the regression equation. Serial correlation implies that

where E, is the error at time t, p is the serial correlation, and E, is independent with mean zero. Neter et al. (1996) indicate that if E, is iid N(0, a2) then e, has a mean of 0 and a variance of a2/(1 - p2) where p is the first order serial correlation between e, and e,-,. This in turn implies that

Transforming all variables so that

we can perform a regression of Y: versus XI, and eliminate the problem of serial correlation in the errors. Equation (10.25) requires that p be known. It can be estimated by computing the first order serial correlation if the errors from the original equation involving Y, and the Xi,,?sfrom

where 6 is the estimated standard error of the original regression equation. An alternative to this estimation of p would be to include Y,-, and the variables so that equation (10.25) would become

In the event that

as predictor

turns out to be iid N(0, u2), standard tests of hypotheses can then be used to eliminate nonsignificant P's. In equation (10.28) the Y,-, and Xi,l-l are known as lagged variables. Lagged variables can often represent changes in storage. We know from continuity E,

MULTIPLE REGRESSION

259

where I is inflow, 0 is outflow and AS is the change in storage for a particular hydrologic system. In many systems of areal extent A, (Y, - Y,-,)A may be proportional to the change in storage from time t - 1 to t. A prediction of Y, might be based on the difference in inflow and outflow from t - 1 to t and Y,-,.

T - 6 = (Y, - Y,-,)A It + It-1 At 2

-

Ot + Ot-1 At = (Y, - Y,- ,)A 2

which may be written

and is in the form of equation 10.28. Equation (10.28) may eliminate serial correlation in the error term but may introduce multicolinearity in the X matrix. Multicolinearity in X is discussed below. Multicolinearity may arise from serial correlation in one or more of the X variables. Further, if Y, is linearly related to the Xt7s,then Y,-I can be expected to be Linearly related to the X,-,'s. Testing for Serial Correlation One way to detect serial correlation in the errors is through the correlogram (chapter 14). Serial correlation in the errors would be indicated by a significant first order autocorrelation coefficient. A test for serial correlation is presented in chapter 11. The hypothesis is that p(k) = 0 versus p(k) # 0 where p(k) is the kh order serial correlation. The test for k = 1 is of special concern. Possibly the most widely used test for serial correlation in the error term is the DurbinWatson test. In hydrology any correlation tends to be positive rather than negative because the correlation often comes about due to storage in the system under analysis. Ground water levels change slowly because of ground water storage. Flows in major rivers change slowly because of storage in the watershed. Storage tends to promote positive serial correlation. Thus, the normal test for serial correlation would be for p = 0 versus p > 0. The test statistic is

An exact test is not available, but Durbin and Watson have obtained lower and upper bounds dL and du such that values of D outside these bounds lead to the decision that the hypothesis can not be rejected if D > dUand the hypothesis is rejected if D < dL.If dL< D < du, the test is inconclusive. Tables of d, and dUare contained in the appendix for various values of n, p, and for levels of significance equal to 0.05 and 0.01.

260

CHAPTER 10

Neter et al. (1996) indicate that a test for negative serial correlation can be done by using as a test statistic 4 - D. The test is then the same as for positive serial correlation. Helsel and Hirsch (1992) indicate that the Durbin-Watson statistic requires the data to be evenly spaced in time. Corrective Action When serial correlation in the errors is detected, the first step should be to determine if some important explanatory variable is missing from the regression equation. Often in hydrology the serial correlation is a result of storage in the system. In this case, a measure of this storage may need to be included as a predictor variable. In other cases some function of time may correct the problem. Aggregating data over longer time periods may reduce or eliminate serial correlation. As the time between observations increases, the dependence of one observation on another can be expected to decrease. At large enough time intervals, independence may be achieved. As indicated earlier, the inclusion of lagged variables, both on the dependent and the independent variables, may help reduce serial correlation. The chapter on time series modeling should be consulted for more on this topic.

MULTICOLINEARITY In multiple linear regression it is unfortunate that the predictor variables in X are called "inY is being predicted as a function of X. Thus dependent" variables. This terminology reflects that Y has been termed the "dependent" variable because Y is thought to depend on X. By extension, the X's have become known as the "independent" variables because they are what Y is dependent upon. Independence has a special meaning in statistics that differs from the above, as we have seen. We know that if all of the X's in X are mutually independent, then the correlation matrix computed from X will be a diagonal matrix with ones on the diagonal and zeros elsewhere. In most natural sciences where "independent" variables are measured values from uncontrolled experimentation,it is rare to achieve true statistical independence. Some level of correlation almost always exists among the predictor variables. These correlations among the independent variables are often called multicolinearities. Much of the discussion on multicolinearitycomes from Neter et al. (1996). Generally the term multicolinearity is reserved for the case when rather strong correlations exist within the X matrix. As the name implies, multiple linear regression attempts to exploit linear relationships between Y and X to develop a prediction or descriptive equation for Y. If two X variables, say X, and X2, are perfectly linearly related, then r,,, = 1. Furthermore, all of the information relative to Y and X,, will be contained in the relationship between Y and X,. In a linear relationship between other words, nothing is gained by including both X, and X, in a linear regression with Y. As a matter of fact, there is not a unique linear relationship between Y and X, and X, if X, and X, are perfectly correlated. Each of the relationships will predict the same value of Y for all X, and X2pairs that follow the linear relationship between X, and X,. When X, and X2 are perfectly correlated, the residuals of the regressions Y on X,, Y on X,, and Y and X, and X, will all be exactly the same since the same information, in a linear sense, will be contained in all three regressions. (Note the brief mention of multicolinearity following equation 10.8.)

MULTIPLE REGRESSION

26 1

If we now relax the requirement that Xl and X2 are perfectly correlated to requiring that they be "highly" correlated, an approximation to what is discussed above results. Now the residual sum of squares of regressions of Y on X,, Y on X,, and Y on X, and X, will be nearly the same depending on the strength of the linear relationship between XI - and X2. Thus, if a regression of Y on X1 is performed followed by a regression of Y on X1 X2, the reduction in the residual - and sum of squares brought about by the addition of X, will be small because very little information in a linear sense is added to the regression. X, and X, are included is that the linear effects between Y and XI What may happen if both or X2 may be split between Xl and X2 in such a fashion that the regression coefficients do not make physical sense. For example, they may have the wrong sign. Furthermore, the individual regression coefficients may test nonsignificant on both X1 and X2 even though the overall regression is significant. By splitting the importance of either X, or X2 among both X, - and X,, the variance of the Xl and X2 become larger, indicating increased sampling variability regression coefficients on relative to these coefficients. Again, this is brought about by splitting the effect of one important linear relationship among two (or more) variables that are closely linearly related. Substantial changes in the values for the regression coefficients upon the addition or removal of a variable from a regression equation is an indication that multicolinearity may be present. XI and X, in the regression equation will not cause prediction problems as long Having both as the predictions are confined to the region of X, and X, defined by the original data sets. This X, and X, for prediction must exhibit the same near linear relationmeans that values used for ship as did the original values used in estimating the regression coefficient. Multicolinearity is not restricted to correlations between pairs of X variables. It also includes correlation between any one of the X's and any linear combination of any of the remaining X's. Obviously, correlations between pairs of X's are easily detected from the correlation matrix of X. Correlations with linear functions of several X's are not always easily detected. One way to identify the possibility of an X being correlated with a linear combination of the other X's is to compute the regression of Xi on X*, where X* is X with Xi removed. The multiple R, can be examined and used as an indication of multicolinearity. This procedure can be carried out for alloftheX,'s,i = 2, ..., p. A summary of what has been indicated about the effect of multicolinearity is: 1. Multicolinearity in itself does not inhibit the predictive ability of a regression model provided the prediction is made within the regions of the independent variables used in deriving the regression coefficients.

2. Multicolinearity may contribute to an inflated variance in the estimated regression coefficients. The sampling error of the coefficients may be large resulting in individual coefficients being nonsignificant even though the overall regression is indicating a definite linear relationship exists between Y and X.

3. Individual regression coefficients may be hard to interpret in terms of their impact on Y. They may even have the wrong sign. Thus, even though the overall equation makes a valid prediction, the contributions of the individual X variables may not be decipherable.

262

CHAPTER 10

4. The values for individual regression coefficients may change substantially upon the addition or deletion of an X variable that involves multicolinearity. Detection of Multicolinearity Some general indications of the possible presence of multicolinearity that have been identified are: 1. Large correlations in the correlation matrix of X.

2. Regression coefficients that do not make good physical sense. 3. Nonsignificant regression coefficients on important variables. 4. Large changes in the values of regression coefficients upon the addition or deletion of a variable from the regression equation. Possibly the most commonly used formal method for detecting multicolinearity is through the use of the Variance Inflation Factor, VIF, defined as VIF =

1 1

-

R;

where is the multiple coefficient of determination between Xi and all of the other X's in the regression equation. When R: is zero, then Xi is linearly independent of the other X's and the VIF is one. If R: = 1, then the Var(Pi) and the VIF are unbounded. Large values of VIF indicate the presence of multicolinearity. The exact value of VIF at which multicolinearity is declared depends on the individual investigator. Some use a value of 5 and others 10. A VIF of 10 corresponds to an R: of 0.90 and a VIF of 5 corresponds to ~ ? e ~ utoa 0.80. l Some will compute an average VIF over all p - 1 regression coefficients and declare that if this average VIF is "considerably" larger than one, multicolinearity is indicated. Some statistical packages will compute the VIE Some statistical packages use an indicator called the tolerance, which is l/VIF. Thus, a VKF of 10 corresponds to a tolerance of 0.1 and a VIF of 5 corresponds to a tolerance of 0.2.

AN APPLICATION OF MULTIPLE REGRESSION The following illustration of using multiple regression is adapted from Haan and Read (1970). Apart of their study was devoted to developing a prediction equation for the mean annual runoff for small watersheds in Kentucky. The data for the example is contained in table 10.2. The number of observations (13) is very small and does not permit splitting the sample and using a portion of the data for testing the resulting model. Table 10.3 contains definitions of the symbols used in table 10.2. The correlation matrix for the independent variables is contained in table 10.4.

MULTIPLE REGRESSION

263

Table 10.2. Data from Haan and Read (1970) Watershed No.

Runoff

Precipitation

A

S

L

P

di

RS

F

Rr

Table 10.3. Definition of symbols used by Haan and Read (1970) Runoff Precipitation A S

L P

4 Rs

F Rr

Mean annual runoff (inches) Mean annual precipitation (inches) Area (square miles) Average land slope (%) Axial length (miles) Perimeter (miles) Diameter of largest circle that can be drawn entirely within the basin (miles) Shape factor-ratio of dj to do where do is the diameter of the smallest circle that can be drawn which entirely encloses the basin (-) Stream frequency-ratio of number of streams in basin to total area of basin (square miles) Relief ratio-ratio of total relief to largest dimension of basin generally parallel to main stream (feet per mile)

Table 10.4. Correlation matrix for data of Haan and Read (1970)

Table 10.5. Regression analysis of data of Haan and Read (1970) (10 independent variables) Analysis of Variance

Source

Degrees of freedom

Sum of squares

Mean square

Regression Residual Total corrected for mean R = 0.98 Std. Error = 0.69 Variable

b

sri

t

Constant Precipitation A S L P di Rs F

R,

Since the correlation matrix is symmetrical, it is customary to show only the diagonal elements and the elements either above or below the diagonal. The mean and standard deviation of runoff are 16.55 and 1.93 inches, respectively. Table 10.5 contains the results'ofl the multiple regression of runoff on all 9 of the independent variables. Because an intercept term was included, p is equal to 10. In the ANOVA table, the sum of squares for the mean and the total sum of squares are not shown. Instead the total sum of squares corrected for the mean is given. The F that is given is the calculated F for the overall regression equation (from equation 10.18) used in testing the hypothesis that the regression does not explain a significant amount of the variation in Y. Because F,,,,,, is 8.81, this hypothesis is rejected. The lower part of table 10.5 contains the estimated regression coefficients, the standard errors of the regression coefficients, and the calculated t (equation 10.17) used in testing H,: Pi = 0. The only b's with calculated t's greater than 2.0 are those based on precipitation, P, and R,.If all of the variables except these three and the intercept are eliminated at one time, the regression shown in table 10.6 results. In going to the second regression, R*has been reduced from 0.97 to 0.91, the F increased to 28.7, and the standard error has remained unchanged. All of the regression coefficients with the exception of the intercept are now significantly different from zero at the one percent level of significance since t.95,5,9 is 3.25.

MULTIPLE REGRESSION

265

Table 10.6. Regression analysis of data of Haan and Read (1970) (4 independent variables) Analysis of Variance

Source Regression Residual Total

Degrees of freedom

Sum of squares

Mean square

3 9 12

40.64 4.25 44.89 R = 0.95 Std. Error = 0.69

13.55 .47

-

I3

Sa

t

-9.65 0.430 0.620 0.010

4.440 0.093 0.075 0.002

-2.17 4.62 8.25 5.19

R' = 0.91 F = 28.7 A

Variable Constant Precipitation P

b

The t test used to test the hypothesis that Pi = 0 makes the test assuming that all of the other p's are still in the equation. Thus, when a decision is made to eliminate more than one variable, the t's are unreliable and the F test using equation 10.19 should be used. This test determines if several variables are simultaneously making a significant contribution to explaining the variation in the dependent variable. As an illustration of the use of equation 10.19, the hypothesis that PA = P, = p, = P,i = P,, = P, = 0 be tested. For this example n = 13, p = 10, k = 6, Q2 = 43.45, Q2* = 40.64, and Q, = 1.44. The F calculated from equation 10.19 is 0.98. Since F.95,6,.3 = 8.94, it is concluded that the variables A, S, L, di, R,, and F are not significant. The resulting prediction model is Runoff = -9.65

+ 0.43 Precipitation + 0.62 P + 0.010 R,

The observed values of runoff and values predicted from the above equation are shown in the lower half of table 10.6. To demonstrate the behavior of s, R', and F, several regressions were run using various combinations of the data in table 10.2. The results of these regressions are summarized in table 10.7 and figure 10.2. This table illustrates that R2 never increases as variables are removed from the equation, whereas s may decrease as some variables are removed and then increase as more variables are removed. R~ approaches unity as the number of variables is increasing. If the number of variables were increased to 12, then p would be 13 (because the model has an intercept) and R' would be unity. In figure 10.2 the lines connect the best values of the quantities s, R', and F contained in table 10.7. This is because it is possible, for example, to have many combinations

Table 10.7. Some results of several regressions on S, R', and F Variables included Eq.No.

Precipitation

A

S

L

P

*

= Variable included and was significant.

x

= Variable included and was not significant.

4

R,

F

R,

s

R'

F

Fig. 10.2. Behavior of s, R ~and , F as a function of n for data contained in Table 10.2. of 3 variables in the regression equation and each combination would produce a different s, R ~ , and F. TRANSFORMING LINEAR MODELS Many models are not naturally linear models but can be transformed to linear models. For example

MULTIPLE REGRESSION

267

is not a linear model. It can be linearized by using a logarithmic transformation 1nY = l n a

+ plnX

(10.32)

where

Standard regression techniques can now be used to estimate a' and 6' for equation 10.33 and a and 6 estimated from equations 10.34. Two important points should be noted. First, the estimates of a and p obtained in this way will be such that E (Yf - Yi )2 is a minimum and not such that E (Y, - qi )2 is a minimum. Second, the error term on equation 10.33 is additive (Y' = a' + P'X' + E') implying that it is multiplicative on equation 10.31 (Y = axp€).These errors are related by E' = ln E. The assumptions used in hypothesis testing and confidence intervals must now be valid for E' and the tests and confidence intervals made relative to the transformed model. In some situations the logarithmic transformation makes the data conform more closely to the regression assumptions. For example, if the data plot as in figure 10.3, a logarithmic transformation may make the assumption of constant variance on the error more realistic. The normal equations for a logarithmic transformation are based on a constant percentage error along the regression line, whereas the standard regression is based on a constant absolute error along the regression line. For example, the difference between Yi = 200 and Yi = 100 on an arithmetic scale is 100 times as large as the difference between Yi = 2 and Yi = 1. However, on a logarithmic scale In 200 - In 100 = 5.29832 - 4.60517 = .69315, which is the same as " 1

Fig. 10.3. Example of the effect of a logarithmic transformation on the error variance.

In 2 - In 1 = .693 15 - .000 = .69315. In a situation of this type, the standard regression procedure would attempt to fit the point at Y = 100 in order to minimize X (Y - qi)'at the expense of the point Y, = 1 because its contribution to X (Y, - ?i)2 is small. The logarithmically transformed model would give equal percentage weight to both points. The above discussion can be extended to the model

through the transformation

Other models and transformations are available. For example

can be transformed to 1nY = h a

+ PX

Yevjevich (1972a) lists several possible transformations. Whatever the transformation, it must be remembered that the principles of and assumptions regarding least squares apply to the transformed model, not the original model.

INDICATOR VARIABLES IN REGRESSION Consider a relationship between Y and X that may be a function of the year in which the data were collected, as shown in figure 10.4. If a single regression is performed using the model

Fig. 10.4. Use of a simple indicator variable.

MULTIPLE REGRESSION

269

the line labelled "overall" results. If two regressions are done, one on the 1991 data and one on the 1992 data, the two individually labeled lines result. It is possible using indicator variables to obtain the two individual lines with a single regression using the model

where I is an indicator variable. Using this approach, the data would be coded such that I would be 0 for one of the years (say 1991) and 1 for the other year. The resulting equation would then effectively be

+ bX for 1991 Y = (a + c) + bX for 1992

Y =a

Thus, the slopes for the two regressions are the same, but the intercepts are a function of year. The advantage of using the indicator variable is that all of the data are used to estimate a common slope for the two lines. If two independent regressions were done, the slopes would likely be different. Indicator variables can be used to generate two lines having different slopes but a common intercept using the model

or having different slopes and intercepts using the model Y=a+bX+cI+dIX In this later case the result would be

+ bX for 1991 Y = (a + c) + (b + d)X Y =a

for 1992

Obviously, the use of indicators uses extra degrees of freedom and thus requires more data for parameter estimation. The use of indicator variables can be extended to produce three lines. For three equally spaced lines having a common slope, the appropriate model is

where values of - 1,0, and 1 are used for I. The resulting models are Y=(a-c)+bX Y = a + bX Y = (a

forI=-1

for1 = O

+ c) + bX for I = 1

CHAPTER 10

270 Three unequally spaced lines can be generated using the model

Y =a

+ bX + cI1 + d12

(10.38)

and assigning values to I1 and I, as follows

Line 1 Line 2 Line 3

I1

I2

0 0 1

0 1 0

Line Y=a+bX Y = (a + d) Y = (a + C)

+ bX + bX

The resulting three equations are shown above. Three lines with different slopes and intercepts can be generated from the model

using the same values for the indicator variables as above, with the result

+ bX for line 1 Y = (a + d) + (b + f)X Y = (a + c) + (b + e)X

Y =a

for line 2 for line 3

Occasionally, it is desirable to fit a line through a set of data such that the line has a definite break in its slope at some fixed point X = C. Figure 10.5 shows such a situation. A regression of the form

Fig. 10.5. Indicator variables and a change in slope.

MULTIPLE REGRESS ION

27 1

will accomplish this where I = 0 for X Y = a + bX

forXIC

Y = (a - cC)

+ (b + c)X

5

C and I = 1 for X

for X

> C. The resulting equations are

>C

Three slopes can be accomplished using a model of the form

where C, and C, (C, < C,) are the values of X at which the slope changes and the indicator variables have values given by I,=O

forX z1-,/2 where z, the level of significance. Most logistic regression computer programs provide estimates of sbi. The significance of the overall model is tested by formulating the hypothesis that P = 0 and using a chi-square test based on p - 1 degrees of freedom where p is the number of coefficients estimated. The chi-square value is based on the likelihood ratio. Again, the calculated chi-square value is generally provided by a logistic regression program. If the calculated chi-square exceeds P = 0 is rejected. the table value, then the hypothesis that Neter et al. (1996) and Helsel and Hirsh (1992) discuss other aspects of logistic regression including evaluating the overall ability of the model to correctly classify observed values of the dependent variable. The predicted value, E(Pi), represents the predicted probability that Yi should be set equal to 1. One classification scheme is to use the logistic regression model to estimate E(Pi) and set Yi = 0 if E(PJ < 0.5 and Yi = 1 if E(P,) a 0.5. Such a decision rule is appropriite if it is equally likely that the actual outcome is 0 or 1 and it is desired to have equal probability of incorrectly classifying the outcome. A decision rule can also be arrived at in other ways. For instance, if some observations had been set aside and not used to estimate the model, various decision rules might be evaluated and the one selected that performs the best in classifying these holdout observations. The same scheme could be used if there were no holdout observations by using the model to classify the

MULTIPLE REGRESSION

275

observations used to develop the model. Obviously, this latter procedure would not independently evaluate the model because the same observations are used to evaluate as were used to develop the model. Example 10.3. In a certain locality wetland areas are thought to be impacted by groundwater pumpage. By examining a wetland, ecologists can determine if a wetland is impacted or not. By looking at certain bio-indicators such as fungi lines on trees, the normal water level for a wetland may be determined. Water level records can be used to estimate the median water level. The distance to the nearest pumping water well is also known. It is desired to develop a model for classifying a wetland as impacted (Y = 0) or not impacted (Y = 1) based on the difference in the median water level and the normal water level, X2, and the distance to the nearest pumping well, X3. Solution: A logistic regression model of the form of equation 10.49 is fit to the data shown in table 10.8. The results of the logistic regression are shown in table 10.9. Table 10.9 shows that the overall regression is significant = 40.14) but that on X3 is not significant (z, = 0.022/0.180 = 0.12). A second logistic regression was computed eliminating X3 with the

(X1

fi3

Table 10.8. Data for example problem Obs. No.

L50

Impact

Dist

Predicted impact

E (Y)

Residual 0.004 0.005 0.007 0.007 0.014 0.015 0.016 0.019 0.02 1 0.022 0.025 0.029 0.037 0.038 0.045 0.054 0.062 0.082 0.207 -0.390 0.658 -0.304 -0.123 -0.028 -0.008 -0.003 -0.001 -0.001 0.000 (continued)

CHAPTER 10

276 Table 10.8. (continued)

L50

Obs. No.

Imuact

Dist

E (Y)

Predicted imuact

Residual

Table 10.9. First attempt logistic regression report Parameter Estimation Section

z

Variable

Regression coefficient

Standard error

Beta = 0

Problem level

Intercept L50 Dist

7.014575 3.48942 1 2.245529 X lo-'

2.638804 1.355271 0.1801355

2.66 2.57 0.12

0.00 0.01 0.90

Model in transformation form 7.014575 + 3.489421*L50 + 2.245529 X lo-'* Dist Note that this is XB. Prob (Y = 1) is 1/(1 + Exp(-XB)): Model Summary Section Model

R~

Model D.E*

Model chi-square

Model problem

Classification Table Predicted

0

Actual

1

Total

0

Count Row percent Column percent

16.00 88.89 88.89

2.00 11.11 9.52

18.00 100.00 46.15

1

Count Row percent Column percent

2.00 9.52 11.11

19.00 90.48 90.48

2 1.OO 100.00 53.85

Count Row percent Column percent

18.00 46.15 100.00

21.00 53.85 100.00

39.00

Total

Percent correctly classified = 89.74. *D.E = degrees of freedom.

MULTIPLE REGRESSION

277

results shown in table 10.10. A significant overall regression significantly different from zero. The resulting model is

(Xz = 40.12) and both fil and 6, are

Table 10.10. Second attempt logistic regression report Parameter Estimation Section

z

Variable

Regression coefficient

Standard error

Beta = 0

Problem Level

Intercept L50

7.094155 3.a3749

2.553 149 1.283279

2.78 2.68

0.005460 0.007284

Model in transformation form 7.094155 + 3.443749*L50 Note that this is XB. Prob(Y = 1) is 1/(1

+ Exp(-XB)).

Model Summarv Section Model R~

Model D.F.

Model chi-square

Model problem

Classification Table Predicted

0

Actual

0

Count Row percent Column percent

1

Count Row percent Column percent

Total

1

Total

Count Row percent Column percent

Percent correctly classified = 89.74. Misclassified Rows Section

Row

Actual @OUP

Predicted POUP

Score

Residual

278

CHAPTER 10

Fig. 10.8. Logistics regression model for classifying wetlands. This model correctly classified 35 of the 39 observations. Figure 10.8 shows the data and the resulting model. Using a probability level of 0.5, E(Y) = 0.5, for classification, the value of X2 that the model equation 10.53 specifies as the division between impacted and unimpacted wetlands is -7.09 x2= ----- -2.06 3.44

feet

If an E(Y) = 0.78 is used as a cutoff for impact evaluation, only 2 of the wetlands would be misclassified. The true test of the model would be how it performs on an independent data set. Exercises 10.1. Use the matrix methods of this chapter to work example 9.1. 10.2. Compute R for example 10.1. 10.3. Use the matrix methods of this chapter to work example 9.4. 10.4. Use the matrix methods of this chapter to work example 9.5. Calculate the confidence interval for the point X equals 50.0 inches of rainfall.

X is an n X p matrix of n observations on p variables, and Z is the n X p matrix of 10.5. If deviations of the variables from their means, what is contained in the matrix ZIZ/(n - I)? = [zij] = [xij - xj]) 10.6. Use the data in table 10.2 to develop a prediction equation for annual runoff using the model Y = ~ & f " - - Would you prefer this equation over the one contained in table 10.6? Compare the equations in terms of the confidence interval on the regression lines at the mean values of the variables contained in the respective equations.

XP

x?.

MULTIPLE REGRESS ION 10.7. Show that

x (Y

- ? ) 2

279 =Y'Y ~ 'x 'Y. ---

10.8. Derive the normal equations that minimize a method for solving these equations.

2 (Y

-

Q)'

for the model Y = axb. Suggest

10.9. The relationship between stage and discharge (rating curve) for many streams has been found to follow an equation of the type Q = a sbwhere Q is the discharge and S is the stage. Using the following data from the Cumberland River at Cumberland Falls, Kentucky, derive such a rating curve. Test the hypothesis that b = 1.5.

10.10. The data in table 10.11 is a partial listing of the data used by Benson (1964) in a study of floods in the Southwest. Derive a prediction equation for Q,, the mean annual flood, in terms of the remaining variables. Consider both the models given by equation 10.1 and by the multiple regression extension of equation 10.24.

Table 10.11. Independent variables, by station, in rain-flood area

(continued)

Table 10.11 . (continued)

A, contributing drainage area in square miles. S, main-channel slope(85 to 10 percent points), in feet per mile. St, percentrage of area in lakes and ponds, increased by 1 percent. E, altitude index (mean of 85 and 10 percent points), in feet above mean sea level. L, basin length (total length of main channel), in miles. H, basin rise (elevation difference between 85 and 10 percent points), in feet. P, mean annual precipitation, in inches. I, 10-year, 24-hour rainfall intensity in inches. R, ratio of runoff to precipitation during months when annual peak discharges occur. R,, mean annual runoff, in inches. Q,, mean annual flood, cfs.

11. Correlation IN CHAPTER 3 the population correlation coefficient between two random variables X and Y was defined in terms of the covariance of X and Y and the variances of X and Y as

The sample estimate r ~for, p,,,~ is similarly given by

where sx,, is the sample covariance between X and Y, and sx and sy are the sample standard deviations of X and Y, respectively. Figure 3.5 and the accompanying description discussed some typical values for r,,~ and their meaning. Here it was emphasized that 1) rxYycan range from - 1 to 1; 2) r,,~ = f1 implies a perfect linear relationship between X and Y; 3) r,,, = 0 implies linear independence but leaves room for other types of dependence; and 4) if X and Y are independent, then rX,, = 0. In chapters 9 and 10 the concept of correlation was extended to give a measure of the strength of the linear relationship between a random variable Y and a second variable which was a linear function of one or more X variables, each of which may or may not be a random variable. Throughout the text many of the results that have been developed have included the assumption that the random variables were independent or that the sample being analyzed was composed of random observations. A random observation simply means that every possible element in the sample space has an equal chance of being selected during any trial.

Random variables may be either uncorrelated (r,,, = 0) or correlated (rX,Y# 0). Even when sampling from uncorrelated populations, it would be rare for the sample correlation coefficient to be exactly zero. More likely it will deviate from zero due to chance. Thus, statistical tests are needed to evaluate whether the deviation of the sample correlation coefficient from zero may be ascribed to chance or whether the deviation is too large to attribute to chance. If successive observations in a time series of hydrologic data are correlated, this must be taken into account in any inferences made about the data or in attempts to model the process that produced the data. Again, a procedure is required for determining if the sampled elements from a time series can be considered as random. These and other properties of correlation are the subject of this chapter.

INFERENCES ABOUT POPULATION CORRELATION COEFFICIENTS Situations frequently arise where it is desired to test H,: r,,~ = 0 or H,: rX,Y= r* where r* is known. These and other tests about the population correlation coefficient will be discussed in this section. For a more detailed treatment, reference can be made to Graybill (1961). As in the case of all hypothesis tests, certain assumptions are needed. In this section the assumption is made that X and Y are random variables from a bivariate normal distribution. The population correlation coefficient is given by p and the sample estimate of p given by r is based on a random sample. If p = 0, then the quantity

has a t distribution with n - 2 degrees of freedom, where n is the sample size. Thus, to test H,: p = 0, the test statistic is calculated from equation 11.3 and H, is rejected if It/ > t, If n is moderately large (n > 25), then the quantity W is approximately normally distributed with mean w and variance (n - 3)-' where

: [:+:I

W = - In - = arctanh r and

To test the hypothesis H,: p = p* against the alternative Ha: p # p* for p*, a known constant, the quantity

283

CORRELATION

can be considered to be normally distributed with a mean of zero and a variance of one. If > z,-~/,(Zis the standard normal variable), H, is rejected. Confidence limits on p can be estimated from

IzI

Consider k bivariate normal populations having population correlation coefficients of pl, p,, ..., pk and sample correlation coefficients of r,, r,, ..., rkbased on samples of size n,, n,, ..., n,. Then the hypothesis H,: p, = p, = . - - = pk = p* for p*, a known constant, is tested by noting that X2

=

X:= , (arctanh ri - arctanh p*)2(ni- 3)

(11.8)

has a chi-square distribution with k degrees of freedom. H, is rejected if X2 > x;-,,~. Rejection of the hypothesis infers that at least one of the pi's is not equal to p*. The hypothesis H,: p, = p, = - - - = pk (all correlation coefficients are equal) is tested by noting that

has a chi-square distribution with k - 1 degrees of freedom. In equation 11.9, Wi is given by equation 11.4 as Wi = arctanh ri and

H, is rejected if

X2

> X:-u,k-l. Rejection of this hypothesis infers that at least one of the pi's is

not equal to the other pj's for i # j. If the hypothesis that all of the correlation coefficients are equal is not rejected, it may be desirable to calculate a "best" combined estimate 7 of the common correlation p ("best" means weighted with inverse variance). Such an estimate is given by

where W is given by equation 11.10

and

Example 11.1. Burges and Johnson ( 1973) present the following sample correlation coefficients for monthly flow volumes for the Sauk River in Washington and Arroyo Seco in California. In the following table rj represents the sample correlation coefficient between the monthly flow volumes in months j and j - 1. Assume the coefficients are based on 30 observations each and that the parent populations are all bivariate normal (Burges and Johnson actually used the lognormal distribution in their study). 1) Test the hypothesis that p, for the Sauk River is equal to 0.50.2) Compute the 95% confidence limits for p8 of the Sauk River. 3) Test the hypothesis that p, on Arroyo Seco is zero. 4) Test the hypothesis that on each of the streams all of the monthly correlation coefficients are equal. 5) Assume the hypothesis in part 4 is accepted for the Sauk River and estimate an average correlation coefficient for the Sauk River. Month

Sauk River

j

October November December January February March April May June July August September

Solution: 1) H,: p, = 0.5 for Sauk River From equation 11.6

where W = arctanh r = arctanh (.34)

=

.35409

o = arctanh p = arctanh (SO) = -54931

Arroyo Seco

CORRELATION

285

Since Izl < 1.96, we cannot reject H,: r, = 0.5 for the Sauk River. 2) The 95% confidence limits on p, for the Sauk River are calculated from equation 11.7 as

3) H,: p5 = 0 for Arroyo Seco is tested by using equation 11.3.

Therefore we cannot reject H,: p5 = 0. 4) The test, H,: all pj are equal, is tested by using equation 11.9.

Sauk River

Arroyo Seco

i

ri

Wi = arctanh ri

ri

Wi = arctanh ri

1 2 3 4 5 6 7 8 9 10 11 12

0.6 1 0.58 0.50 0.3 1 0.38 0.37 0.44 0.34 0.17 0.65 0.93 0.5 1

0.71 0.66 0.55 0.32 0.40 0.39 0.47 0.35 0.17 0.78 1.66 0.56

0.00 0.00 0.00 0.45 0.2 1 0.70 0.60 0.75 0.98 0.97 0.96 0.00

0.00 0.00 0.00 0.49 0.21 0.87 0.69 0.97 2.30 2.09 1.95 0.00

Sauk River W = 0.585 Arroyo Seco W = 0.798

Sauk River X2 = 27[5.707 - 12(.585)'] = 43.208 Arroyo Seco X2 = 27[15.919 - 12(.798)~]= 223.489

Therefore H, is rejected for both the rivers. 5) An average correlation coefficient for the Sauk River is calculated from equation 11.11.

where W = 0.585 m=

C [(i - 3 ) 2 (ni - 3)

1 ) -- 27/29 = 1/29 27

Comment: In parts 4 and 5 of this problem several simplifications were made in the summations since ni was equal to 30 for all i; in general this cannot be done. In part 5 an overall average correlation coefficient was calculated. Since in part 4 it was shown that the correlations for the various months are significantly different, the utility of an overall average correlation is suspect. Graybill (1961) presents the exact probability distribution of r and states that for small samples, the exact distribution should be used in hypothesis testing. References to tables that aid in hypothesis testing for small samples and examples of their use are also given. Again, it is emphasized that the above tests are based on a random sample from multivariate normal distributions. Even under these conditions, only the test of H,: r = 0 conducted using equation 11.3 is "exact". The other tests are approximate with the approximation improving as the sample size increases. For non-normal populations, it may be possible to transform the variables to a normal situation and then apply the above tests to the transformed data. If a transformation of a nonnormal random variable is not possible or not desired, then the above tests must be considered as

CORRELATION

287

approximate with the approximation becoming poorer as the coefficient of skew of the random variables increase.

S E W CORRELATION It is not uncommon to find in a time series of hydrologic data that an observation at one time period is correlated with the observation in the preceding time period. Such correlation is termed serial correlation or autocorrelation. By definition, the elements of a sample of data possessing serial correlation are not random elements. A serially correlated sample of size n contains less information about a process than a completely random sample of size n. In a serially correlated sample, part of the information contained in each observation is already known through its correlation with the preceding observation Such correlation can also exist between an observation at one time period and an observation k time periods earlier fork = 1,2, . . . In this discussion of serial correlation, it is assumed that observations are equally spaced in time and that the statistical properties of the process do not change with time (stationary process). The population serial correlation coefficient is denoted by p(k) (and frequently called the autocorrelation coefficient) where k is the lag or number of time intervals between the observations being considered. The sample serial correlation coefficient will be given by r(k). The sample serial correlation coefficient for a sample of size n is given by n-k

XiXi+k

r(k) =

-

n-k n-k xi=lxiEi=l X i + k

(n - k)

(2;:: n-k

xi

I?

n-k

From equation 11.12 it is seen that r(0) is unity. That is, the correlation of an observation with itself is 1. Equation 11.12 also demonstrates that as k increases, the number of pairs of observations used in estimating r(k) decreases because all of the summations contain n - k terms. Serial correlation should only be estimated for k considerably less than n. If p(k) = 0 for all k # 0, the process is said to be a purely random process. This indicates that all of the observations in a sample will be independent of each other. In chapter 14, Yevjevich (1972b), Matalas (1966, 1967b),Julian (1967) and others treat hydrologic time series ,inmore detail. Anderson (1942) has proposed a test of significance for the serial correlation coefficient for a circular, normal, stationary time series. A circular series is one that closes on itself so that xn is followed by xl. Under these assumptions

Although the assumption of a circular series is unrealistic, values of r(k) from equation 11.13 will not differ greatly from those calculated from equation 11.12 if n is large in comparison

to k. Under these conditions r(k) will be approximately normally distributed with mean - l/(n - 1) and variance (n - 2)/(n - 112if p(k) = 0. The confidence limits on p(k) are then estimated by

If the calculated r(k) falls outside these confidence limits, the hypothesis that p(k) is zero [H,: p(k) = 0 versus Ha: p(k) # 0] is rejected. Example 11.2.' Frequently, in the analysis of runoff volumes, one finds there is significant serial correlation caused by storages on the watershed. Appendix C contains a listing of the monthly and annual runoff volumes for Cave Creek near Lexington, Kentucky. Test the hypothesis that p(1) = 0 for the annual runoff volumes. Solution: This solution assumes a = 0.05 and is based on equation 11.14, and therefore assumes that the annual runoff is normally distributed and is a stationary time series. Furthermore, p(1) is estimated from equation 11.13 assuming that the series is circular [in this case this is equivalent to assuming x,+, = x, in calculating r(l)].

Since -0.520

< r(1) < 0.402, &: p(1)

=

0 is not rejected.

Comment: From the width of the confidence interval, it is apparent that the above test is not very powerful for small samples. A sample of around 400 observations would be required to reject H,: p(k) = Oifr(k) = 0.1.

+

CORRELATION

289

Matalas (1967b) has suggested that for hydrologic data r(1) tends to be greater than zero due to persistence caused by storage. If r(1) is found to be less than zero, it is in many cases difficult to explain hydrologically. In this case one might take r(1) as equal to zero. Matalas and Langbein (1962) state that in an autocorrelated series, each observation represents part of the information contained in the previous observation. They discuss stationary time series having r(1) # 0 and r(i) = 0 for i = 2, 3, . . . They state that n observations of a nonrandom series having r(1) > 0 give only as much information (measured in terms of a variance) about the mean as some lesser number, n, of observations in a purely random time series. This lesser number of observations is called the effective number of observations and is given by

If r(1) = 0, then n, = n. If r(1) > 0, then n, < n. Equation 11.15 is expressed graphically as figure 11.1. As an example, a 50-year record for which r(1) = 0.2 contains only as much information about the mean as a 33-year record with r(1) = 0. Note that if n is large or r(1) small, the second term in the denominator of equation 11.15 can be neglected with little loss in accuracy.

n

Fig. 11.1. Relation between n and n, for various values of p(1) (after Matalas and Langbein 1962).

CORRELATION AND REGIONAL ANALYSIS Matalas and Langbein (1962), Yevjevich (1972a), Alexander (1954), and others demonstrate that the information relative to estimating the regional mean contained in data from n stations in a region having an average interstation correlation of p is equivalent to the information contained in n' uncorrelated stations in the region where n' is given by

As n gets large, n' approaches l/p. For a of 0.2, the maximum information about the regional mean contained in n stations could not exceed the information contained in 5 uncorrelated stations. From a consideration of equation 11.16, it seems it would be logical to establish relatively few independent hydrologic stations in a region rather than several correlated stations. However, by the very concept of a hydrologic region, the hydrologic characteristics may be correlated. Correlation within a region can be exploited to yield improved estimates of a particular hydrologic variable at a point through correlation with another hydrologic variable at that point or a similar characteristic at another point. For instance, let Y and X represent two random hydrologic variables having no serial correlation for which n, and n, + n2 observations, respectively, are available. Also consider that Y and X are correlated with a correlation coefficient of ryx. Now, the record on Y can be extended by using the correlation between Y and X. This relation is merely a simple regression considering Y as the dependent and X the independent variable. The relation is developed based on the n, common observations. From equation 9.15 it can be shown that the regression between Y and X is given by

where ryXis the estimate for pyx and y and x are deviations from their respective means. Now n2 estimates of Y can be computed from equation 11.17 based on the n2 observations on X not common to the observations on Y. Let Y1 and Y2 represent the mean of Y based on the original n, observations and the n2 estimated observations, respectively. A new weighted mean for Y based on n, + n, observations can now be computed from

For the n2 additional observations to improve the estimate of Y, it is necessary that ryx be greater than 1/(n, - 2) (Matalas and Langbein 1962). If the random variables Y and X contain significant serial correlation, the situation is somewhat more complex. Matalas and Langbein (1962), Matalas and Rosenblatt (1962), and Yevjevich (1972a) contain treatments of this case. In general, serial correlation serves to decrease the information relative to the mean while cross-correlation tends to improve information relative to the mean.

CORRELATION AND CAUSE AND EFFECT At this point it should be apparent that a high correlation between two variables does not necessarily imply that there is a cause-and-effect relation between the variables. The fact that the monthly flows on adjacent small streams are correlated does not mean that changes in the monthly flow of one stream causes a corresponding change in the other stream. More likely both changes are caused by the same external factors operating on both watersheds. Again, it is emphasized that independent variables are uncorrelated and correlated variables are not necessarily related through cause and effect. The dependence in correlated variables is a stochastic dependence and not a physical or cause-and-effect dependence. Dependence and correlation are linear properties. Dependence among variables may be strong and nonlinear in the presence of a nonsignificant (linear) correlation coefficient.

SPURIOUS CORRELATION Spurious correlation is any apparent correlation between variables that are in fact uncorrelated. Spurious correlation can arise due to clustering of data. For example, in figure 11.2, the correlation of Y with X within either of the data clusters is near zero. When the data from both clusters are used to calculate a single correlation coefficient, this correlation is found to be quite high. This is spurious correlation. Figure 11.3 shows a plot of Y versus X where both Yi and Xi are random variables obtained by adding 11 to a random observation from a standard normal distribution. For a sufficiently large sample rx,, would be zero. If both Yi and Xi are divided by yet a third random observation Zi, obtained in the same manner as Xi and Yi, and the correlation between Yi/Zi and Xi/Zi computed, for a sufficiently large sample the correlation will be near 0.5. Figure 11.4 is a plot of Yi/Z, versus Xi/Zi. Figure 11.4 indicates that Xi furnishes information useful in estimating Yi when in fact Yi and Xi are uncorrelated. The correlation between Yi/Zi and Xi/& is spurious.

X Fig. 11.2. Spurious correlation due to data clustering.

292

CHAPTER 11

Fig. 11.3. Absence of correlation between two random variables.

Fig. 11.4. Spurious correlation introduced by dividing 2 random variables by a common third random variable.

CORRELATION

293

Pearson (1896-1 897) investigated the spurious correlation that can arise between ratios. Let Y = X1/X2and Z = X3/X4. The correlation between Y and Z, rxz, was found to be a function of the variances, covariances, and means of the X's. Pearson's derivation assumed that the X's were normally distributed and that the coefficient of variation of each X was small enough so that its third and higher powers could be neglected. Reed (1921) arrived at the same results without specifying the parent distribution of the X's. Pearson's general formula is

where rij is the correlation between Xi and Xj, and Ci is the coefficient of variation of Xi. Chayes (1949) and Benson (1965) considered many special cases of equation 11.19. For example, if X2 = X4, r12 = r13 = r34 = 0, r24 = 1, and C, = C2 = C3 = C4, equation 11.19 reduces to rxy = 0.5, which is the case shown in figure 11.4. Benson (1965) produced a table (Table 11.1) showing many special cases of ratio and product correlations. Spurious correlation can arise in hydrology when dimensionless terms or standardized variables are used. Benson (1965) presents several examples of possible spurious correlation in hydrology.

Exercises 11.1. Calculate the first-order serial correlation coefficients for the sediment load and annual discharge data for the Green River at Munfordville, Kentucky. Test the hypothesis that these two correlations are equal. Discuss the assumptions you have made and how they affect the validity of the tests you have made. 11.2. Calculate the correlation between the sediment load and annual discharge for the Green River at Munfordville, Kentucky. Test the hypothesis that this correlation is equal to 0.50. 11.3. Verify the "comment" of example 11.2. 11.4. Calculate the first-order serial correlation coefficient for the Spray River, Banff, Canada. Test the hypothesis that the first order serial correlation is zero. 11.5. Work exercise 11.4 for the Piscataquis River near Dover-Foxcroft, Maine. 11.6. If the annual runoff from the Spray River, Banff, Canada, is normally distributed, how many independent observations would provide as much information relative to estimating the mean annual runoff as does the 45 years of actual record? 11.7. Work exercise 11.6 for the Piscataquis River, near Dover-Foxcroft, Maine and its 54 years of record.

.p

u"u"u" rnmm

5 u"Urnm *u"u" u "- 5 r~ nm

2 u_

n PI

2

h

u_

Ph IC)

U

rn I

N N

PI PI

U

s

V

n PI PI

U

+

cV

2

<

h

u" % 2 I *

rn I

u, L' U + PI*

PI PI

U

V

I =

U

+

PIPI

U,

U

-

+

PI

V

% 4

+

rn PIN

U

:U V PI

1%

IX

121 12 1) 11 PI X, 21 x" x

11.8. The following data were collected on two streams in southeastern Kentucky. Use the data to extend the peak flow record of Cave Branch through 1972. Estimate the average peak flow for the entire record plus estimated record for Cave Branch. Is this estimated average an improvement over an estimate based on the actual observed record of Cave Branch? Peak Flow Data

Year

Cave Branch

Helton Branch

Year

Helton Branch

12. Multivariate Analysis THERE ARE several multivariate data analysis techniques that may prove useful in working with hydrologic data. The treatment here is concerned with the principles and techniques of principal components analysis, cluster analysis, and multivariate regression analysis. Multivariate techniques have been around for some time but have found limited use in hydrologic applications. For a more complete treatment of multivariate analysis, especially the inferential aspects of multivariate analysis, reference should be made to the books by Morrison (1967), Press (1972), Cooley and Lohnes (1971), Harman (1967), Anderson (1958), Harris (1975), and Karson (1982). Synder (1962) and Wong (1963) were among the first to apply multivariate analysis to hydrology.

NOTATION In this chapter an uppercase underlined letter will denote a matrix and a lowercase underlined letter will denote a column vector. Thus Z could be an n X p matrix made up of p n X 1 column vectors for j = 1,2, ..., p.

zj

PRINCIPAL COMPONENTS Often, when data are collected on p variables, these p variables are correlated. This correlation indicates that some of the information contained in one variable is also contained in some of the other p - 1 variables. The objective of principal components analysis is to transform the p original correlated variables into p uncorrelated, or orthogonal, components. These components are linear functions of the original variables. Such a transformation can be written

X is an n X p matrix of n observations on p variables. Because we will be dealing with where variances and covariances, all X's will be assumed to be deviations from their respective X is a matrix of deviations from means. Z - is an n X p matrix of n values for each means so that of p components, and A is a p X p matrix of coefficients defining the linear transformation. X contains correlation, it might be Because the original p-variate set of observations X with q < p orthogonal components. Thus, it is desired possible to characterize the variance of to construct Z - so that each component, -Jz . (an n X 1 column vector) explains the maximum amount of the variance of X left unexplained by the first j - 1 components. In this way it may be found that the first q components explain most of the system variance and that the last p - q components explain little of the system variance. The total variance of X is defined to be the X. The variance-covariance matrix of X is sum of the variances of the p variables contained in - where 2 - = [ui,,]and oijis the covariance of the ith and j" defined to be the p X p matrix 2 variables in X for i # j and ui,,is the variance of the ith variable. 2 - is estimated by S whose elements are given by

The total system variance, V, is defined as the sum of the variances of the original variables and can be estimated as V = Trace S=

2:='=, s ~ , ~

The j" principal component, -Jz., is the linear function z-J . = Xa. J where -Jz . is an n Zij

=

(12.4) X

zk=Xi,kakj P

1 and -Ja . a p

X

1 column vector.

zj can also be written z . = [zij] where -J

(12.5)

MULTIVARIATE ANALYSIS The variance of

299

zjis found from

and may be estimated by -Ja!Sa.. -J Note that this is simply a matrix equivalent of equation 3.58 for the variance of a linear function. The variance of the first principal component z, is estimated by

z, -

is thus defined by the vector a, that maximizes the variance of z, subject to the constraint that a' -1a = 1. This is a normalizing constraint without which there would be no unique solution. -1 Equation 12.7 can be maximized by using the Langrangian multiplier A, to introduce the = 1. Let constraint

Q is maximized by differentiation.

For the solution of equation 12.8 to be other than the trivial solution al = 0 we must have

a, the charThis is a classical characteristic value problem. A, is called the characteristic root and acteristic vector of S. Equation 12.9 has p solutions for A,. This is easily seen by considering the special case of S to be a 2 X 2 matrix in which case equation 12.9 becomes

or ( s , . ~- A1)(s2,, - A,) - s ~slV2 , = ~ 0. This is a quadratic equation in A, having 2 solutions. S guarantee that the p solutions will be real. Special properties of Multiplying equation 12.8 by a; results in

Because our objective was to maximize a;Sa,, the desired solution to equation 12.9 is the largest characteristic root (the largest value) for A.

Equations 12.7 and 12.10 demonstrate the important point that

Having found the characteristic root, A,, of S, the characteristic vector, a,, is found from equation 12.8 using the constraint that _a;?, = 1, which is equivalent to Xi=, a:,, = 1. The second principal component is found in a similar manner. Now it is desired to find -a, such that Var(z2) - =a;Sa2 is maximized subject to the constraints that a;a2 = 1 and a;a, = g;a, = 0. This latter constraint guarantees that z, and z2 are orthogonal (uncorrelated). Using,a procedure similar to the above for a,, let Q be

Premultiplication by a; results in

a; results in a; Sa, = 0. Substituting Due to orthogonality, premultiplication of equation 12.8 by this into equation 12.13 results in y = 0. Therefore, from equation 12.12 we have

a,, the coefficients of the second largest principal component, are the from which it follows that coefficients of the characteristic vector associated with the second largest characteristic root of S. Premultiplying equation 12.14 by a; also results in A, = a;Sa2 = Var(z2). - In general, the jb principal component of the p-variate sample X is the linear function 3 = -JX a where 2j are the elements of the characteristic vector associated with the jth largest characteristic root of S. From equation 12.1 we can find Z'Z - - as Z'Z - = (XA)'(XA) - - = A'X'XA = (n - 1) A'SA. - It can be easily shown (see equations 12.24-12.28) that A'SA - - is a diagonal p X p matrix with the ith diagonal element equal to A,. This matrix may also be written as -A'SA = D, where D, - is the diagonal matrix whose diagonal elements are the characteristic roots of S. One property of matrices is that if E is an orthogonal matrix, then the trace of E'FE - - equals F. Therefore the trace of - = Trace(AISA) -= Trace(S) - =V Trace (D,)

However

The sum of the characteristic roots which equals the sum of the variances of the principal components also equals the total system variance.

MULTIVARIATE ANALYSIS

30 1

The covariance between zi and sj is Cov(zi, - sj)= Cov(Xai, - JXa.) = --J 4 S a - = 0 for i # j. Therefore, zi and are uncorrelated. Some important properties obtained thus far are:

zj

1. zi and

zj are uncorrelated for i # j

4. x L = l ~ a -r z=i

EL=,hi = Traces- = V

5. Z = XA where

From item 4 above, it can be seen that the fraction of the total variance accounted for by the jth principal component is A,/V. In many situations the first q components account for a large fraction (say 90% or more) of the system variance, indicating that the last p - q components are not needed in terms of explaining variance. Many times these last p - q components are discarded with X matrix containing the effect that the problem has been reduced from one of dealing with an n X p correlation to dealing with an n X q(q < p)Z - matrix that is orthogonal. The question of how many components are needed to satisfactorily explain the system variance or what part of the total system variance should be explained is an unresolved one. Morrison (1967) suggests that only the first 4 or 5 components should be extracted since later components will be difficult to physically interpret in terms of the problem at hand. Unfortunately, there are no statistical tests that can be used to determine the significance of a component. The sampling theory of principal components is not well developed, especially when the components are extracted from the correlation matrix rather than the covariance matrix as in later examples. X, and the principal components, Z, is given by The covariance between the original variables, Cov(X, - Z) - = Cov(X, XA) = SA -The covariance between the variables and the jthcomponent is given by

From (S - A-I)a. 1- -J = 0 we have -JS a = Ajgj. Therefore, Cov(X, - Z.) -J = Ajgj. The covariance between the ithvariable and the jth component is given by A,qi. The correlation between the i" variable and the jthcomponent is

302

CHAPTER 12

The vx(xi) = siPi= -

S;

and Var(zj) = A,. Therefore

This equation can be used to transform A into a p X p matrix of correlations between the ith observed variable and the jth computed component. These correlations can then be used in an attempt to assess the physical meaning of the components. In some situations, some of the p variables can be eliminated from further consideration by examining the correlations defined by equation 12.17. If a variable has no significant correlation with a component, then that variable is not contributing much to the variance of the component. By eliminating the variable from the component, the fraction of the system variance explained by the component would be changed very little. The difficulty here is that this variable may be correlated with a second component, in which case its elimination would decrease the variance explained by the second component. For these reasons, variables are generally eliminated only if they are not correlated with any of the q components retained for analysis.

Example 12.1. Consider the data in table 10.2. Let X be a 13 X 3 matrix made up of 13 X based on the observations on mi^), S(%), and L(ft). Compute the principal components of covariance matrix. Compute the correlation between the variables and the components. Solution: S is computed from equation 12.2.

(2- Ail is computed from

which simplifies to

MULTIVARIATE ANALYSIS

303

The solutions to this cubic equation are

Note that 2 Xi = Trace S = 161.557. S = 100(155.963)/161.557 = The first principal component accounts for 100Al/Trace 96.54% of the total system variance. The coefficients of the characteristic vectors can be computed from equation 12.8. For example, for the first principal component we have

Solving these three equations simultaneously for a,,,, a , , , and a,,, results in a2,, = -51.43a3,,

and a,,,

=

1.5503a3,,.

Using the constraint that a:, + a;,, + a:,, = .020. Similarly, for A, and A, we get

=

1, the solution is a,,,

=

-030, a2,, = -.999 and

Thus,

The values for the principal components can now be calculated from

where X is composed of deviations from the mean-that means.

is, deviations of A, S, and L from their

304

CHAPTER 12

The correlation matrix between the variables and the components can be computed from x2 and z1 is equation 12.17. For example, the correlation between Cor(x2, - Z,) = ~ : ~ a ~ ,=, 155.9631/2(-0.999)/155.7691P /s~ = -0.9995 The resulting correlation matrix is

Example 12.1 illustrates that using the S matrix in a principal component analysis presents some problems if the units of the X variables differ greatly. In example 12.1, the magnitude of the observations associated with the second variable were much greater than those associated with the other two variables. Consequently, the variance of x2 was much greater than either Var(x,) or Trace S or 96.4% of the system variance. This means that Var(x3). - x2 - accounted for 100 Var(x2)/ x2. This can also be seen from the fact the first principal component is merely a restatement of that the correlation between x2 and 2, is - 1.000. In most hydrologic studies the problem of noncommensurate units on the X's has been handled by standardizing the X's through the transformation ( x , ~- Xj)/sj. The covariance matrix S =R, as can be seen from of the standardized variables becomes the correlation matrix equation 12.2. The principal components analysis is then done on R. The total system "variance" now becomes Trace R = p because R has 1's on the diagonal. The characteristic roots and vectors are determined from

and the numerical value of the components is computed from

MULTIVARIATE ANALYSIS

305

The correlation between the jthstandardized variable and the jth component (equation 12.17) reduces to

These correlations are sometimes called factor loadings. The factor loadings can be used to attach physical significance to the components. If a particular component is highly correlated with 1,2, or 3 variables, then the component is a reflection of these variables. For example, in a study of watershed geomorphic factors, it might be found that a component is highly correlated with the average stream slope and the basin relief ratio. This being the case, that particular component might be termed a measure of watershed steepness. -

-

Example 12.2. Repeat example 12.1 using R instead of S Solution:

which has solutions A, = 1.9692 A, = 0.9273

A, = 0.1035 In this formulation, z, accounts for 100(1.9692)/3 or 65.64% of the system "variance" z, and z, account for 30.91% and 3.45%, respectively. whereas The corresponding characteristic vectors are

The factor loadings computed from hf124, are

Since component 1 is highly correlated with both area and length, this component might be called a "size" component. Likewise, component 2 might be called a slope component. In terms of

explaining the "variance" of R, component 3 could be eliminated because it explains only 3.40% of the variance and is not correlated with any of the variables. We cannot eliminate any variables, however, because component 1 is strongly dependent on XI and X3 - whereas component 2 depends on X,. In terms of explaining the variance of R, we have reduced our problem from one of considering X matrix with correlations to a 13 X 2 Z matrix without correlations (assuming Z3 is a 13 X 3 discarded). The values for the components are computed from

where

thus

MULTIVARIATE AVALYSIS

307

REGRESSION ON PRINCIPAL COMPONENTS Often a principal components analysis is the first step in the development of a prediction model for some dependent variable, Y. Once the principal components are derived, they are used as the independent variables in a multiple regression analysis with the dependent variable, Y. Because of the differing units usually present in the original independent variables, the principal components are generally abstracted from the correlation matrix. The steps in performing a multiple regression on principal components are outlined here. First, the independent variables are standardized and the dependent variable centered so that X = [xij]and Y = [y,] where -

-

x1.J- . = (X.. - Xj)/sj and yi = Yi - Y 1.J

(12.21)

where Yi is the ithobservation on Y. Y is the mean of Y, Xi, is the i" observation on the j" variable, and X, and sj are the mean and standard deviation of the jthvariable. Centering Y is not necessary. It eliminates the need for an intercept and simplifies notation. The matrix of principal components, Z, is determined from Z - =XA with A being a p X p matrix whose jthcolumn is the characteristic vector computed from equation 12.18 with R =X'X/(n - 1). The regression model is

a,,

where Y is an n X 1 vector whose elements are the n observations of the centered dependent variable, Z is an n X p matrix whose elements, Zij, represent the i" value of the jth principal component. p is estimated from equation 10.8 as

A

The expression for P can be simplified by writing Z as

where -Jz. is an n X 1 vector whose elements are the n values of the jthprincipal component.

so that Z'Z - = [zi' - zj]

-

From equation 12.4 we have

4R -J a . is 0 for i # j and is A, for i = j. Thus, Z'Z - is a p X p matrix whose off-diagonal Now elements (i # j) are all zeros and whose j' diagonal element (i = j) is (n - 1)A,.

(z'z)-' -

is therefore

Equation 12.23 can now be written as

From equation 10.14 and the above results it is apparent that C O V ( ~p,) ,, = 0 var(@,) =

for

uL

(n

-

l)Aj

i#j for

i =j

where u is the standard error of the regression equation.

MULTIVARIATE ANALYSIS

b,

309

6,

Thus is independent of for i f j. The independence of the b's is a result of the onhogonality of the principal components. Since the p's are independent, the t-test given by equation 10.17 can be repeatedly applied to test hypotheses on the 6's from a single regression equation. Furthermore, the numerical value for the fi's retained in the regression will not be altered by eliminating any number of the other b's. This is the distinct advantage of having an orthogonal matrix of independent variables. A second advantage of having independent b's is that the interpretation of the fi's in terms of the independent variables is greatly simplified. Thus, if some hydrologic meaning can be attached to a component through an examination of the factor loadings, hydrologic significance can also be attached to the 6's. Unfortunately, in most hydrologic applications of principal components analysis, a clear and distinct interpretation of the principal components has not been possible. This, in turn, means the hydrologic significance of the fi's is unclear as well. Some authors (DeCoursey and Deal 1974) state that yet another advantage for using regression on principal components as compared to normal multiple regression is that the resulting regression coefficients are more stable when applied to a new set of data because the coefficients are fitted on the basis of only statistically significant orthogonal components. This could imply that using an equation based on regression on principal components for prediction on a sample not included in the equation development would have a smaller standard error on this sample than would a normal multiple regression equation. If this is the case, it would be an important advantage for the regression on principal components technique. An adequate demonstration of this hypothesis needs to be developed, however. A disadvantage of using principal components in a regression analysis is that even if all but one of the components is eliminated, all of the original variables (the X's) must still be measured because each component is a function of all of the X's (equation 12.4). In reporting the results of a regression on principal components, it is generally desirable to transform the resulting regression equation into an equation in terms of the original X variables. This can be done since yi = Yi - Y, the 6 ' s are known constants, zil = akjx,, and x-1 ..~ = (X.' J. - Xj)/sj. Thus equation 12.22 becomes

,.

Equation 12.33 can then be simplified by collecting terms to be of the form

where the p*'s are constants. If only q(q < p) components are retained in the final regression equation, and the components are rearranged so that the first q components are retained, the first summation in equation 12.33 would run from I to q; however, the second summation would still run from 1 to p. This means the summation in equation 12.34 would run from 1 to p. It also means that even though the equation contains only q components, all p of the original variables must be measured to predict Y. Some of the original X variables can be eliminated from the analysis before any regressions are performed by examining the factor loadings and eliminating variables that are not highly correlated with any of the components. The remaining X variables are then resubmitted

3 10

CHAPTER 12

to a principal components analysis with the multiple regression being performed on the new components. This procedure has the advantage of reducing the number of variables that must be measured to use the resulting regression equation. It has the disadvantage of eliminating X variables rather arbitrarily (there is no statistical test for the significance of the factor loadings) without ever having them in a position to determine their usefulness in explaining the variation in the dependent variable, Y. In many applications of regression on principal components, the last p - q components are discarded before the regression is performed. The number of retained components, q, is selected X is accounted for. This procedure reduces the so that a large proportion of the variance of number of coefficients that must be estimated but runs the risk of eliminating a component that may explain a significant amount of the variation in Y even though it explains little of the variance of X. Equation 12.31 gives = a;XIY/(n - 1)X, whereas equation 12.32 gives ~ a r ( 8 , = ) u2/(n - l)Xj. The statistical significance of Pj is tested using equation 10.17 with Po = 0.Thus the test statistic is

There is no reason to believe before the regression is performed that this test statistic will be nonsignificant for small values of Xj (i.e., for the last p - q components). Therefore, the regression should be performed on all of the components and then the components that prove to be nonsignificant can be eliminated. The value of the test statistic given by equation 12.35 can be shown to be proportional to the correlation between Y and -Jz . as follows:

Therefore

or the significance of the jh component is directly proportional to its correlation with the dependent variable. Equation 12.38 can be used to test the significance of the jthcomponent. At this point, it should be noted that if a dependent variable Y is regressed on p principal components extracted from a p X p correlation matrix and then transformed via equation 12.33, the results are identical to those that would be obtained by a direct regression of Y on the original p variables. This is because multiple regression is a linear operation and the principal components are independent linear functions of the original variables that explain all of the variance of the variables.

MULTIVARIATE ANALYSIS

311

1MULTIVARIATEMULTIPLE REGRESSION Occasionally, it is desirable to predict several dependent variables from the same set of independent variables. Such a situation might be predicting the mean annual flood, 10-year peak flow, and 25-year peak flow for a setting where it is desirable to maintain the correlation among the dependent variables. This can be accomplished using a multivariate extension of multivariate regression. The prediction model would be

where Y is an n X q matrix of dependent variables, X - is an n X p matrix of independent variables, and j3 is a p X q matrix of coefficients. Press (1972) discusses this model in more detail. The coefficients, j3, can be estimated in a manner similar to that employed in multiple regression as

This equation can be written as

b

Y are partitioned into q p where - and -

X

1 vectors. Furthermore

demonstrating that the solution to equations 12.40 is equivalent to q multiple regressions each involving the same X but a different vector of dependent variables. Tests of hypothesis concerning p J can be made using the procedures set forth in chapter 10. In multivariate regression as in multiple regression, one commonly has a large number of independent variables all of which are not important in predicting the q dependent variables. If q separate multiple regressions are performed and independent variables eliminated using the procedures of chapter 10, it would be unlikely that the resulting equations would contain the same set of independent variables. If the multivariate regression model is used, all q of the prediction equations will contain the same set of independent variables. Press (1972) presents a procedure for testing the hypothesis that j3. = PT where -P, is a 1 X q vector made up of the coefficients associated with the ithindependent -1 variable for each of the q dependent variables and PT is a 1 X q vector of constants. To test that the ithindependent variable was not significant would be equivalent to the test that Pi = -0. Thus, a procedure is available for eliminating variables from the regression to produce a useable model. One distinct advantage in using the same independent variables for estimating several dependent variables is that the correlation structure of the dependent variables is preserved. DeCoursey (1973) used such an approach to derive prediction equations for the 2-, 5-, lo-, and 25-year peak flows on watersheds in Oklahoma. In situations like this it is highly desirable to retain the observed correlations among the dependent variables in the resulting prediction

312

CHAPTER 12

equations. In the case of flood flows, if this is not done it might be possible to have equations that are inconsistent and predict, say, a 10-year peak to be greater than the 25-year peak flow. Another place where retention of the correlation structure among a set of dependent variables is important is in estimating the parameters descriptive of runoff hydrographs. Rice (1967) discusses this application of multivariate, multiple regression in simultaneously estimating the runoff volume, peak discharge, and a base time parameter for runoff hydrographs based on data presented by Reich (1962). Rice states that even though three separate regressions produce slightly better fits to the original pool of data, the multivariate solution might be more effective in predicting hydrographs for storms on watersheds not included in the original data sample. CANONICAL CORRELATION Canonical correlation examines the relationship between two sets of variables. Consider the n X p matrix X with covariance matrix 2. Partition X and 2 so that

X = [Y -,Z -]

whereY - i s n X p, a n dZ i s n X p2and

Zl1

where 2 - is p X p, is p1 x pl, 2 1 2 is PI X p2, 2 2 1 is p2 X P I ,and &2 is p2 x p2 with pl + p2 = p and p, 5 p2. In this formulation Ell = Var(Y), = Var(Z), - and C12= 22, = Cov(Y, - Z). Canonical correlation investigates the correlation between Y and Z. Linear functions of Y and Z are formed and then the correlation between these linear functions determined. Define U, = a;Y'

and V, =

z2,

a;zl

so that a,is p, X 1, Y is p, X n, a, is p2 X 1 and Z is p2 X n. The linear functions U, and V, are formed in such a way as to maximize the correlation between them.

U, and V, are The variances of -

a;cl,

Var(U,) - = var(a; - Y') = - - g, and Therefore

- Because correlation is not changed Our goal is to find the a, and a, that maximizes Cor(Ul, - V,). by linear operations, we must use a constraint to get a unique solution. We will use

MULTIVARIATE -ANALYSIS

313

a normalizing constraint. In this case Cor(U1, - V,) = a;Z12s2. As was done in the case of principal components, Lagrangian multipliers are used to maximize

r comes from matrix multiplication of the partitions of Z so that Unfortunately, r is not symmetric, so the determination of the resulting p2 values of the h's may require special computing techniques. hi is numerically equal to the square of the correlation Ui and Vi. For convenience the hi are arranged so that hl > h2 > - - . > A,,. between The hope is that hi is sufficiently large that other h's can be dropped so that attention can be focused on U1 and V,, which are vectors, rather than Y and Z, which are matrices. Of course, there will be a -1U. and a Vi - associated with each of the Xi's. If h2 is sufficiently large, two vectors on U and two vectors on V may have to be considered. The vector art used to transform Z to V is found by determining the eigenvectors of

a, is found, the vector a,transforming X to U is found from Once -

X into Y and Z - has to be done up front by the investigator and is not a result The partitioning of of the analysis. Interpretations and use of canonical correlation in hydrology appears to have some of the same drawbacks associated with principal components. The problem might be reduced from considering a p, variate Y and a p2 variate Z to single variates U and V, yet U is a function of all p, Y's and V is a function of all p2 Z's. Some investigators eliminate some of the Y and Z variables a,or a, are "small" on those particular variables. if the coefficients in CLUSTER ANALYSIS The main objective of a regional flood frequency analysis is to develop regional regression models which can be used to estimate flow characteristics at ungaged stream sites. Hydrologic data from several gaging stations in hydrologically homogeneous regions are collected and analyzed to obtain estimates of the regression parameters. Identification of these hydrologically homogeneous regions is a vital component in any regional frequency analysis. One method used to identify these regions is a multivariate statistical procedure known as cluster analysis. Cluster analysis is a method used to group objects with similar characteristics. Two clustering methods are used for this purpose. The first type of procedures is known as hierarchical methods, and they attempt to group objects by a series of successive mergers. The most similar objects are first grouped and as the similarity decreases, all subgroups are progressively merged into a single cluster. The second type of procedures is collectively referred to as nonhierarchical clustering techniques

and, if required, can be used to group objects into a specified number of clusters. The clustering process starts from an initial set of seed points, which will form the nuclei of the final clusters. The most commonly used similarity measure in cluster analysis is the Euclidean distance, defined by:

where Di, is the Euclidean distance from site i to site j, p is the number of variables included in the computation of the distance (i.e., the basin and climatic variables) and zi,, is a standardized value for variable k at site i. In many applications the variables describing the objects to be clustered (discharges, watershed areas, stream lengths, etc.) will not be measured in the same units. It is reasonable to assume that it would not be sensible to treat, say, discharge measured in cubic meters per second, area in square kilometers, and stream length in kilometers as equivalent in determining a measure of similarity. The solution suggested most often is to standardize each variable to unit variance prior to analysis. This is done by dividing the variables by the standard deviations calculated from the complete set of objects to be clustered. The standardization process eliminates the units from each variable and reduces any differences in the range of values among the variables. To get a feel for how cluster analysis works, consider six precipitation stations and their associated annual precipitation in mm: station precipitation

1 1000

2 1200

3 600

4 700

5 500

6 1100

It is desired to see if these stations can be grouped into homogeneous groups based on the average annual precipitation. The first thing that is done is to standardize the precipitation values. For this set of data, the mean is 850 and the standard deviation is 288. Table 12.1 contains the data and results. Equation 12.47 is used to calculate Dij.For example, DlY2is d ( 0 . 5 2 - 1.21)~which equals (0.52 - 1.21), or 0.69. The results for all of the Dij are shown in Section A of table 12.1. The next step is to find the minimum value of the similarity measure, Di,,.This value is seen to be 0.35. The value 0.35 appears several times. The pair (3,4) was arbitrarily chosen as the first similar pair. Section B of table 12.1 contains the Dij values from Section A except for the (3, 4) row. This row contains the minimum of D3,jand D4,jfor j = 1, 2, 5, and 6. For example, D3,1is 1.39 and D4.1 is 1.04. Therefore, the (3,4), 1 entry in Section B is 1.04. Other values in the (3,4) row are similarly determined. Again, the minimum entry in Section B is found to be 0.35 corresponding to the (1,6) pair. Thus (1,6) is clustered as in Section C and entries for Section C are determined from Section B in the same manner as entries in Section B were determined from Section A. The next step results in (1,6) and 2 being clustered to form (1, 2, 6). This is followed by (3, 4) being clustered with 5 to form (3,4, 5). Table 12.2 is similar to table 12.1 except that the value of precipitation for the third station is changed from 600 to 1050 mm. Carrying through the analysis as was done for table 12.1 results in forming the clusters (4,5) and (1, 2, 3,6). In table 12.3, the third station value is changed to 1800 mm. The cluster results are (1,2,4, 5, 6) and 3. In all of these analyses, the Di, entry is a measure of the similarity that exists. For

Table 12.1. First cluster analysis of rainfall data Station

1

2

3

4

5

6

Mean

St dev

Precipitation z

1000 0.52

I200 1.21

600 -0.87

700 -0.52

500 -1.21

1100 0.87

850 0

288 1

Table A

Table B

Table C

Table D

Table E

Table 12.3. Third cluster analysis of rainfall data Stiition

1

2

3

4

5

Precipitation z

1000 -0.11

1200 0.33

1800 1.66

700 -0.78

- 1.22

Table A

Table B

Table C

Table D

Table E

500

6

Mean

St dev

1100 0.11

1050 -7E- 18

45 1 1

318

CHAPTER 12

example, in table 12.3, the Dij values of 0.22 indicate strong similarity. The value of 0.44 shows that stations 4 and 5 are not as similar as are stations 1, 2, and 6. The value 0.67 shows that the cluster (4,5) and (1,2,6) are less similar than either 4 and 5 or l , 2 , and 6. Finally, the value 1.33 shows that 3 is not very similar to the cluster (1, 2,4,5,6). Clustering may stop when there is a significantjump in the similarity measure. In table 12.3 one might conclude with three clusters, (1,2,6), (4,5), and (3),or with two clusters, (1,2,4,5,6) and 3. Table 12.4 extends the analysis to consideration of two measures of the stations being considered, precipitation and potential evapotranspiration. Again, Section A was constructed from equation 12.47. For example, the Dl,, entry is calculated from standardized values as D I 2 = d ( - 0 . 1 1 - 0.33)~+ (-1.21 - 1.21)~or 2.47. The analysis is completed based on Section A in the same manner as for tables 12.1-12.3. Here a satisfactory clustering doesn't appear to exist. It looks as though 2 and 6 might be clustered but possibly the other stations can not be clustered. Table 12;5 is based on the ratio of precipitation to potential evapotranspiration. Using this system measure, 2, 4, and 6 certainly form a cluster. Depending on the purpose of the analysis, one might conclude that (1, 3) and (2,4,5, 6) represent the final clustering. Exercises 12.1. Calculate the correlation matrix for the first two variables contained in the tables of exercise 10.8. 12.2. Calculate the characteristic values and characteristic vectors associated with the correlation matrix of exercise 12.1. 12.3. Compute the numerical values of the principal components of the data in the first two columns of the table in exercise 10.8 (based on the correlation matrix). 12.4. (a) Work exercise 12.1 using the first three variables. (b) Work exercise 12.2 based on the first three variables. (c) Work exercise 12.3 based on the first three variables. 12.5. (a) Work exercises 12.1, 12.2, and 12.3 based on the covariance matrix. (b) Work exercise 12.4 based on the covariance matrix. 12.6. Work exercise 12.4 using all of the variables in the table of exercise 10.8 except Q,. (Note: Don't try this without a computer-life is too short!) 12.7. Calculate the factor loadings for the data of (a) exercise 12.2, (b) exercise 12.4, or (c) exercise 12.5. 12.8. Show that Z'Z - = (n - 1)D, - by using as an example the data of exercise (a) 12.2, (b) 12.4, or (c) 12.5.

Table 12.4. Fourth cluster analysis of rainfall data --

Station Precipitation 21

PET 22

1

2

1000 -0.11 500 - 1.21

1200 0.33 1200 1.21

3 1800 1.66 600 -0.87

4

5

6

Mean

700 -0.78 700 -0.52

500 -1.22 1000 0.52

1100 0.11 1100 0.87

1050 -7E-18 850 2E- 17

Table A

Table B

Table C

Table D

Table E

St dev 45 1 1 288 1

Table 12.5. Cluster analysis of rainfall-evaporationdata Station

1

2

3

4

Ratio z

Table A

Table B

Table C

Table D

Table E

5

6

Mean

St dev

13. Data Generation CHAPTER 15 discusses several stochastic models that have been found useful in hydrology. Stochastic models contain random components. These random components contain random elements. If a stochastic model is to be used to generate hydrologic data, methods must be available for generating the random elements of the models. A random element is usually thought of as an element selected in a fashion such that each element in the population has an equal chance of being selected. If the sample results from choosing a number at random from a population of numbers in such a fashion that each number in the population has an equal chance of being selected, the procedure is equivalent to sampling from a uniform distribution. More generally, a random element can be selected from any probability distribution as long as the elements are independent of each other. This chapter first sets forth techniques for generating random samples from probability distributions. Next, a method for generating a multivariate random sample that preserves the correlations between the variates is presented. Finally, several possible areas of application for data generation methods are discussed. In any application of data generation methods, it must be kept firmly in mind that data generation cannot improve or overcome faulty data. At best, one can generate a set of data having statistical properties equal to the properties of the sample used in estimating the population parameters. In addition to this, data generated stochastically is subject to the same sampling errors as natural data. As a matter of fact, data generation has been widely used to study sampling errors, an application discussed later in the chapter. UNIVAFUATE DATA GENERATION A random number is defined as a number selected from a population of numbers in such a fashion that the probability of the number being in some interval is strictly governed by the probability density function of the population. If this probability is proportional to the length

of the interval, the pdf is a uniform distribution. A random digit would be one of the numbers 0, 1, ..., 9 selected in such a fashion that any one of these numbers would have an equal ability of being selected. In a sample of 100 random digits, the expected result (but with a very low probability of occurrence) would be ten each of the digits 0, 1, ..., 9. Tables of random numbers are generally available in many statistics books. Computer routines for generating random numbers are included as a part of the program libraries for most computers. Care must be exercised when using computer routines in that some generate biased samples. Many computer routines generate uniformly distributed random numbers in the interval (0, 1). A uniform random number, Y, in the interval (a, b) can be generated from a uniform random number in the interval (0, l), R,, by the relationship Y = (b - a)R, + a. Random observations may be generated from probability distributions by making use of the fact that the cumulative probability function for any continuous variate is uniformly distributed over the interval 0 to 1. Thus, for any random variable Y with probability density function py(y), the variate

is uniformly distributed over (0, 1). A procedure, illustrated in figure 13.1, for generating a random value y from pY(x)is 1. Select a random number R, from a uniform distribution in the interval (0, 1).

2. Set P,(y) = R, in equation 13.1.

3. Solve for y. Step 3 in this procedure is known as obtaining the inverse transform of the probability distribution.

Fig. 13.1. Procedure for generating a random observation from a probability distribution.

DATA GENERATION

323

As an example, consider the Weibull distribution with

and

Solving for y results in the inverse transform

By substituting Ru for Py(y), random values of Y from the 3-parameter Weibull distribution can be generated from

For some distributions it is not possible to solve equation 13.1 explicitly for y. That is, an analytic inverse transform cannot be found. The normal and gamma distributions are examples of this. Fortunately, in the case of the normal distribution, numerically generated tables of standard random normal deviates are widely available. A standard random normal deviate is a random observation from a standard normal distribution. Random observations for any normal distribution can be generated from the relationship

where RNis a standard random normal deviate and p, and o are the parameters of the desired normal distribution of Y. Computer routines are available for generating standard random normal deviates. For some distributions, relationships with other distributions can be used in the generating process. For example, a gamma variate with integer values for rl has been shown to be the sum of rl exponential variates each with parameter A. Therefore, gamma variates with integer values for rl can be generated by summing rl values generated from an exponential distribution. Whittaker (1973) discusses a method for generating random gamma variates with any shape parameter rl. Because the gamma distribution is closed under addition, a gamma random variable with any shape parameter can be constructed if one with a shape parameter in the internal 0 < rl < 1 can be constructed. Let Rul,Ru2,and Ru3be independent uniform random variables on (0, 1). Define S1 and S2by

then if S1 + S2 5 1, define Y and Z as Z = Sl/(S,

+ S2)and

DATA GENERATION

325

Then Y has a gamma distribution with shape parameter -q and scale parameter A. This procedure requires the generation of at least 3 uniform random variables. If S, + S2 > 1, then R,, and RU2are rejected and new values generated. The probability that S, + S2 5 1 is given as .rr-q(l - -q) cosec(.rr-q) and has a minimum of n / 4 at -q = 6 and is symmetric about this value. To generate a gamma variate with -q > 1, a gamma variate with an integer shape parameter, and a shape parameter z,-,/, where z,-,/~ is the standardized normal z value with probability z > z,-,/, equal to a/2, the hypothesis of equality of means is rejected. Conover (1971, 1980) should be consulted if the data has many ties or groupings of equal values. Example 14.3. Below are annual flow data for Beaver Creek in western Oklahoma. The data are plotted in figure 14.8. It has been hypothesized that after the 28th year, the flow regime has

TIlME SERIES

347

0

5

10

15

20

25

30

35

40

45

5

Year

Fig. 14.8. Beaver Creek annual flow. distinctly changed. Test the hypothesis that the flow for years 1-28 differs from the flow for years 2 9 4 6 using (a) a normality assumption and (b) a nonparametric approach. Solution: (a) If a normality assumption is made, the test statistic comes from equation 8.17. The mean and standard deviation of the first 28 observations are 83,684 and 75,046 and for observations 2 9 4 6 they are 33,046 and 3 1,759, respectively. Using equation 8.17, the t statistic is calculated as

which would indicate the two parts of the records have different means. Year

Flow

Rank

Year

Flow

Rank

Year

Flow

Rank

(b) Based on the nonparametric approach, equations 14.15 and 14.16 are used. The sum S is computed as

and the test statistic is

indicating that the hypothesis of a difference would probably be rejected. Comment: Which of the two conclusions one would adopt would depend on the distributions of the annual flows. The flows for the two periods should be examined and a determination made as to the validity of a normality assumption.

AUTOCORRELATION One method of characterizing correlation within a time series over time is the autocorrelation function, P(T),given by

For T = 0 equation 14.17 indicates that p(0) = 1 because Cov(X(t), X(t + T)) = Cov(X(t), X(t)) = Var(X(t>> From figure 14.2 it can be seen that for small values of T the covariance term would be positive because for the most part like signs are being multiplied (X(t) - X and X(t + T) - X have the same sign for small T).As T increases, a point is reached where the covariance, and thus P(T),may become negative. Some authors call Cov(X(t), X(t + T)) the autocorrelation function. In keeping with the terminology established earlier in this book, the Cov(X(t), X(t + T)) will be called the autocovariance. A plot of the autocorrelation function against the lag T is called a correlogram. For random data such as shown in figure 14.la, the correlogram would appear as in figure 14.9a. In the case of data containing a cyclic and stochastic component such as shown in figure 14. lc, the correlogram would be cyclic as in figure 14.9b where p is the period of the cycle. Correlograms are useful in determining if successive observations are independent. If the correlogram indicates a correlation between X(t) and X(t + T), the observations cannot be assumed to be independent. The autocorrelation function may thus be said to indicate the "memory" of a stochastic process. When p ( ~ becomes ) zero, the process is said to have no memory for what occurred prior to time t - T. In practice, P(T) should be zero for large T for most random

TIME SERIES

349

r

Z

Fig. 14.9. Typical correlograms: (a) Random process. (b) Random process superimposed on a periodic process. processes. If p ( ~ for ) large T exhibits a pattern that is not zero, it may be an indication of a deterministic component. For example, if the correlogram appears as in figure 14.9b, it indicates the data contains a periodic component. A hydrologic time series representing a process involving significant storage is likely to have values at time t 1 that are correlated with values at the previous time t. The correlation may extend over several time increments so that X(t k) is correlated with X(t) k time units earlier. Daily flows in a stream and daily, monthly, and possibly annual groundwater levels are examples of hydrologic time series that often exhibit correlation over time. Annual maximum peak flow is an example of a time series that is unlikely to exhibit correlation over time. For a discrete time scale, the autocorrelation function becomes p(k) where k is the lag or number of time intervals separating X(t) and X(t + T).The relationship between T and k is given by

+

+

where At is the length of the time interval (e-g., 1 day, 1 month, 1 year, etc.). If p(k) = 0 for all k f 0, the process is said to be a purely random one. This indicates that the observations are linearly independent of each other. If p(k) # 0 for some k # 0, the observations k time increments apart are dependent in a statistical sense and the process is referred to as simply a random one. If a time series is nonstationary, p(k) will not be zero for all k # 0 because of the deterministic element, even if the random element is itself a purely random time series (Matalas 1967b). Unless the deterministic element is removed, one cannot determine to what extent nonzero values of p(k) are affected by the deterministic element. The population autocorrelation function, p(k), may be estimated by r(k), which is given by n-k n-k

xixi+k -

r(k) = n-k

(2:;

2

Ci=l xi

-

n-k

n-k

Ci=l Xi

Xi+k

n-k n-k 2 xi=lX i + k

- (2:

xi+, )2 n-k

1

1/2

with Xi = X(ti), XI+, = X(ti + kAt), and n is the total number of observations. Some authors use ) serial correlation function for r(k) the terminology autocorrelation function for p(k) or p ( ~ and or r ( ~ )This . distinction is not made in this text.

350

CHAPTER 14

For any observed series, it is unlikely that r(k) will be exactly zero. If r(k) differs from zero by more than is expected by chance, then the observations k time periods apart cannot be assumed independent. Procedures are available for testing the hypothesis that p(k) = 0 and for placing confidence intervals on p(k) (see chapter 11). Often, computer programs that generate correlograms also compute confidence intervals such that if the computed autocorrelation falls outside the confidence interval for a particular value of k, a hypothesis of r(k) = 0 would be rejected. PERIODICITY Autocorrelation analyzes a time series in the time domain. It provides information on the behavior of the series over time, especially with regard to the memory of the process or how the process at one instance of time is dependent on, or related to, the process at some prior time. An alternate analytic approach is to examine the series in the frequency domain. With this approach, an attempt is made to quantify the variability in the series in terms of repeating patterns having fixed periods or, what is equivalent, fixed frequencies. The variance of the process is partitioned among all possible frequencies so that the predominate frequencies can be identified. Let X, = Xi = X(t = ih) for i = 1 to n. That is, the X's are equally spaced in the time domain. We can express X, as a Fourier series

The maximum value for the number of terms in the series, q, is given by q = (n - 1)/2 if n is odd and q = n/2 if n is even. The frequency, fi = i/n, represents the i" harmonic of the fundamental frequency l/n. The coefficients a and b can be estimated from % =;

1

X:=lxt= x

$ = -2E n,= 1 X, COS 2Tfit n

2 bi = n

X:=

X, sin 2.rrfit

for n odd and i = l , 2 , 3, ..., q. For n even

and b, = 0 or

35 1

TIME SERIES The periodogram, I(fi), is defined as

For a discrete time series, the angular frequency, mi, is equal to 2nfi or 2ni/n. The variance of X, is given by VdXt) = Va[%

+ C:= 1 a, cos wit + Cy= bi sin wit]

Because % is a constant,

By definition, the Var(cos oit) is

A property of the cos function is that E 6 G i $

=

1

=

otherwise

Similarly, Var(sin wit) = 1/2 =

0

i

+ 0, i # n/2

otherwise

The net result of these manipulations is 1 Var(X,) = - Cq=,(a: 2 var(X,) =

+ b:)

n odd

51 ZP= (a: + b:) + a t

0. Therefore

n even

or, the variance of XIhas been partitioned among the frequencies so that the variance associated with fi is M($ bi2).I(fi)is n times the variance associated with fi. By definition, Var(X1) = 2.Therefore

+

I(fi) Let g(fi) = - so that nu2

The function g(fi) is the spectral density function representing the fraction of the variance of X, associated with fi. Recall that

0

0 02

0 04

0 06

0 08

0 1

0 12

0 14

0 16

Frequency

Fig. 14.10. C o s ( 2 ~ k12) / (top) and its correlogram (middle) and spectral density (bottom).

353

TIME SERIES

Some plot the spectral density function, g(fi), versus fi and some plot the periodogram, I(i), versus i. Because p = l/f = l/(i/n) = n/i, the period associated with any i can be easily determined as n/i. Peaks or spikes in g(fi) or I(i) indicate frequencies that predominate in determining the variance of X,. Figures 14.10 and 14.11 show some correlograms and spectral density functions for wellbehaved functions. In figure 14.10 the function is

For this function f is 1/12 cycles per time unit and the wave length is 12 time units. A software package, NCSS 2000 (1998), was used to make the calculations used in generating the plots of

-0.8 J

Lag

0

0.05

0.1

0.15

0.2

0.2 5

Frsqusncy

Fig. 14.11. Sum of 3 cosines (top) and its correlogram (middle) and spectral density (bottom).

CHAPTER 14

354

figure 14.10. The frequency axis of the periodograrn is actually 27r/f in this plot. Note that for this deterministic function the correlograrn reflects the function exactly. The function for figure 14.11 is

"0

50

100

1 50

Month

Autocorrelation Plot

Periodograrn

0.1

Fig. 14.12. Cave Creek monthly stream flow.

0.2 Frequency

200

250

355

TIME SERIES

Here again, the correlogram reflects the deterministic function. The three frequencies of 116, 1/ 12, and 1/24 can easily be seen in the periodogram. Figure 14.12 is a similar analysis of the monthly stream flow on Cave Creek near Lexington, Kentucky. The correlograrn reflects a cycle of 12 months but does not reproduce the flow record. The maximum correlations of 50.4 are considerably less than the + 1.0 for the deterministic functions reflecting a combination of deterministic and random components in the data. The periodogram clearly shows the periodic nature of monthly flow at this location with a period of 12 months.

AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS (ARIMA) Autoregressive integrated moving average models, ARIMA, are often known as BoxJenkins models because of their early work on this class of models (Box and Jenkins, 1976). ARMA models are a subclass of ARIMA time series models. ARMA stands for autoregressive moving average models. These models make an observation at time t a function of observations and errors at time t, t - 1, t - 2, and so on. Autoregressive (AR) implies that an observation z, is a linear function of 2,-,, z,-,, .... Moving average (MA) implies that an observation z, is a linear function of white noise at t, t - 1, t - 2, .... Several software packages are available for estimating ARIMA models. ARMA models assume that the series is stationary in the mean. Nonstationarities such as periodicities, trends, jumps, and so forth should be removed prior to an ARMA analysis. IE there is a trend or drift in the mean, this can be removed via differencing. For example, if Z, =

a

+ bt and w, = z, - 2,-,

then

so that w, is stationary in the mean. w, is the first difference of z,. If z, = a + bt + ct2 then w, = z, - ztPl = a + bt + ct2 - a c + 2ct, which is a linear trend. Taking the second difference

-

b(t - 1) - c(t

-

1)2 = b -

Thus y, is stationary in the mean. The nthdifference is denoted as Vnz,. For a linear trend, Vz, is stationary. For a quadratic trend, V2z, is stationary. Differencing is indicated by the term "integrated. An autoregressive integrated moving average model is denoted by ARIMA(p, d, q), where p is the autoregressive order, d is the order of differencing, and q is the order of the moving average components. If z, is ARIMA(p, 1, q) then Vz, is ARIMA(p, 0, q) or ARMA(p, q). If z, is ARIMA(p, 2, q), then V2z,is ARIMA(p, 0, q) or ARMA(p, q). A general ARMA(p, q) model is

356

CHAPTER 14

where the C$lq- ... C$+q-,represents the p" order AR and a, - 0 .-. -Oqa,-q represents the q" order MA. Most hydrologic applications never get any more complex than an ARIMA(2, 1, 2). Moving Average Processes (MA) This treatment follows Cryer (1986) chapters 4 and 7 which should be consulted for a more complete coverage. A moving average process of order q, MA(q), is defined as

where q represents an unobserved white noise series. The q are identically and independently distributed random variables (iid rvs) with a mean pa= 0 and variance o:,and z, is a stationary time series with zero mean. A mean term can be added later if necessary. MA( 1) A first-order moving average process, MA(1), is given by

For notational convenience let y, = Cov(zt, zt-,,). Some properties of a MA(1) series are:

Note that this last equation presents a way of estimating O1. We can estimate p, by r1 and equate

and solve for 8,. This is a moment estimator and is not very efficient.

For 0, to be real, 4r: must be less than 1. This implies that rf must be less than '/4 or -% 5 r1 5 %. For a MA(1) process, p, must lie between 2%. When p, = -0.5,0, = 1. When p, = +0.5, el = - 1. Therefore, - 1 5 8, 5 1. Because of the randomness of a sample, it is possible for IrlI > % but if that occurs it brings into question the appropriateness of a MA(1) as a descriptor of the process.

A second order moving average process is given by

with the following properties:

358

CHAPTER 14 Moment estimates can be obtained by solving the following equations for 81 and 02.

Certain restrictions apply to these results.

Once again, the expression for p1and p2provide a means for estimating 8, and 02; however, the two simultaneous equations may be difficult to solve. MA(q) The general result for pk for a MA(q) can now be written as

Note the numerator for p, is - 3' , and pk = 0 fork > q. This can be used to identify the order of a MA(q) process. Autoregressive Processes An autoregressive process of order p is given by

z, has a mean of zero and a, is independent of zt-,, zt-,,

A first-order autoregressive process is written as

...

359

TIME SERIES and has the properties that

/+,I

This last equationshows that < 1 or else yo would be less than zero, which is not possible. 4, must be inside the unit circle.

Fork = 1

For k = 2

For k = k

+

Because I+,\ < 1, pk exponentially decays toward zero. For positive, pk is positive. For negative, pk alternates between positive and negative values. For k = 1, 4, = Y,/Y, = pl and can be estimated by r,. Another way to estimate 4, is through linear regression using the model

+

+

which is analogous to Y = a bX E. Because zt has a mean of 0 and the Var(zt) = Var(z,-,), regression will result in a = 0 and b = r,. AR(2) A second-order autoregressive process is given by

360

CHAPTER 14

with the property Yk

= E(zz-~) = E ( $ ~ z - l & - k

+ $2~t-2~t-k +

a,~-k)

is the same as & z , - ( ~ - ~ , and &2zt-k is the same as Z,Z,-(,-~)

It can be noted that

so

that

Dividing by yo

with k = 1 and p, = 1 and p-, = pl , pl is given by

Successive values of pk can be determined fiom pk = $,pk- + The autocorrelation function can take on many shapes. In all cases, however, p, dies out exponentially fast as k gets large. This die-out may be strictly exponential or it may be in the form of a damped sine wave. Again, regression techniques can be used to estimate the $'s because

AR@) A general pb order autoregressive process is given by

with the property Yk

= E(zt&-k)

= E($l&-l&-k

= $lE(~-i&-k) = $lyk-l

+

+ $ 2 z t - 2 ~ t - k + ". + $ p ~ t - ~ ~ - + k qzt-k)

$2E(~-2zt-k)

+

+ $ 2 ~ k - 2 + .- - + $pyk-p

+

+pE(zt-p~-k)

for k > 1.

+

E(%zt-k)

TIME SERIES

36 1

Dividing y, by y o with po = 1 and p-, = p, results in

This leads to the Yule-Walker equations

which may be written as

and has solution

For an AR(1) process, the Yule-Walker solution is pl = 4,. For an AR(2) process, the Yule-Walker relationships are

Having as their solution

To get an estimate of 4, and 0

GEOSTATISTICS

43 1

The linear model is given by

0'0

yh = --lhl

for 0 < lhl

R. The linear model is not strictly positive definite, and positive definiteness should be checked when using this equation. Isaaks and Srivastava (1989) discuss these checks. A nugget model may also be used which is given by

The nugget model may be used when Z(s) and Z(t) are independent for all spacings greater than or equal to the smallest available spacing. The choice of models is usually dependent on the behavior of the sample semivariogram near the origin. If the sample semivariogram shows a parabolic behavior near the origin, the Gaussian model may be the most appropriate. If the semivariogram behaves linearly near the origin, the exponential or spherical model will be the best choice. If a straight line through the first few points on the sample semivariogram intersects the sill at approximately one-fifth the range, the exponential model will be a better fit. If the line intercepts at about two-thirds the range, the spherical model will likely fit better. Figure 17.4 illustrates these various models. Moser and Macchiavelli (1996) and Cressie (1991) discuss the estimation of the parameters of semivariogram models. Least squares, weighted least squares, and maximum likelihood are three methods that can be used. A variety of statistical software is available for carrying out the actual computations for most geostatistical procedures discussed in this book (Deutsch and Journel 1998).

c Distance, h

Fig. 17.4. Semivariogram models.

.

Spherical

-

*A

-

Exponential / 1 I , //,/

*I/'

/// I /

t;/.

/:I

I"

I

0.0

0.5

f

1.0

1.5

2.0

2.5

3.0

3.5

I

I

4.0

4.5

5.0

Distance, h (miles)

Fig. 17.5. Fitted semivariograms to example transect data.

The example elevation transect data is modeled with the spherical and exponential models in figure 17.5.

COMBINATION SEMIVARIOGRAM MODELS Often, one of these simple models will not satisfactorily describe the structure of the data. Fortunately, any linear combination of positive definite semivariogram models with positive coefficients is also a positive definite model. This provides a large family of models ~ ( h= ) Zy= wiyi(h) for

wi > 0 and

Zr= wi = 1 '=,

that will be positive definite as long as the n individual models are positive definite. This linear combination forms a model of nested structures, where each nested structure corresponds to a term in the linear combination in equation 17.22. Different model forms can be included in a single combination. When selecting a model to represent the sample semivariogram, generally it is important to remember that simpler is better. That is to say, if the major features of the semivariogram can be satisfactorily characterized by a basic model and a combination of models, use the basic model. Figure 17.6 shows the best fit model which is a combination of two models. This model is a combination of two spherical models with range values of 0.8 and 3.5 and sill values of 980 and 1730. The coefficient of determination (12) for this combination model is 0.978, as opposed to 0.975 for the basic exponential model. Because both semivariograms satisfactorily characterize the data, the exponential model is a better choice to represent the structure in the semivariogram for estimation purposes because of its simplicity compared to the combination model.

*

Combination

I

7-•

+ +

v

*

I

I

I

4 /

+/

/ /

/

I 1

0.0

0.5

,

I

I

I

4

1.0

1.5

2.0

2.5

3.0

3.5

I

I

4.0

4.5

5.0

Distance, h (miles)

Fig. 17.6. Fitted semivariogram to example transect data showing the combination of two spherical models. ESTIMATION Isaaks and Srivastava (1989) present an excellent introduction to estimation using geostatistics. An estimate, q(xo), for the random variable, V, at a point xo, can be made in terms of a linear combination of observed values of V at a number, n, of nearby points, xi, for i = 1 to n from

The error in the estimate, R(xo), is given by

For R(xo) to be unbiased, we must have E(R(xo)) = 0.

Because we are dealing with a stationary process,

Therefore

For a stationary process and an unbiased estimate, the sum of the weights is 1. Our goal is to find values for the weights subject to the constraint that they sum to 1. There are several estimation procedures that can be used. All estimation procedures require one to select a pattern of spatial continuity. The arithmetic average of all relevant points is possibly the simplest method.

This method requires one to define the n points that influence the value of V at x, and then assign equal importance to each of these n points. For the arithmetic average, all of the weights are equal to l/n. A second common approach is to use a weighting scheme based on the inverse of the distance.

Again, the points to be included in the calculations must be defined. Points closer to x, have more influence on the estimate of V(x,) than more distant points. The inverse distance weighting can be generalized using a power, p, on the distance. For inverse distance squared weighting, p = 2.

For p > 1, more emphasis is placed on points closer to x, and the importance of more distant points is diminished. An estimation process known as kriging (after D. G. Krige, a South African mining engineer and pioneer in the use of statistical methods in mineral evaluation) produces estimation weights that minimize the variance of the estimation errors. The variance of the errors, a;, is estimated by s; which is given by

This method requires knowledge of the true values of V at all of the Xi. We have only sample values of V at Xi. Because we want to estimate points at locations where we have no measurements,

435

GEOSTATISTICS

we will assume a model for the variance-covariance structure. We do this by using a semivariogram. Since we cannot minimize s;, we will minimize the model error variance, s&, by setting the partial derivatives of the model error variance with respect to the weights to zero. Equation 17.24 indicates that R(X,) = +(x,) - V(X,). In general, the variance of a weighted linear combination Ey= qYi is given by ~ a r [ C y =,=qYi] =

Cy==, ELI aiqCov(Yi, Y,)

Applying this to R(X,) with m = 2, a, =

+ 1, and a,

(17.30) =

- 1 results in

At this point we introduce the notation C i j = Cov(V,, V,). Then

We also have var(V(X,)) = s2 and ZCOV(+(X~), V(Xo)) = 2Cov((C wi Vi)Vo) =

2 E ( 2 wiViV0) - 2E E (wiVi)E(V0)

= 2 C. wiE(ViV0) - 2 2 wiE(Vi)E(Vo) =

2 C. wiCov(vi, V,)

=

2 C wicio

Combining these results in

We want to find the n wi7sthat minimize si. Normally, we would do this by taking n partial derivatives with respect to the n weights and set them equal to zero. However, in this case we have imposed a constraint that 2 wi = 1. Therefore, we have a constrained optimization which we will solve by using Lagrangian multipliers.

+

We now have n 1 unknowns, the n wi's and A, the Lagrangian parameter. Taking partial derivatives with respect to A and setting it equal to zero results in 0 = 2 ( E wi - 1) or E wi = 1, which is the desired unbiasedness condition. We now take partial derivatives with respect to the wi7s.

CHAPTER 17

436

Only the derivative with respect to w, is shown. The others are similar. The third term of no w For the first term, dropping all terms without w, results in

,.

The second term has only one component with w,

The last term has only one w, component

Therefore

In general, for all wi7s Z j " = , w j ~ , , + X = C i o for i = l , ..., n

These equations can be written in matrix notation as

si has

GEOSTATISTICS

437

So we anive at a solution for the weights of our estimation equation (equation 17.23). Note that we have weights as a function of the Cij which are covariances. In practice, the Cij are generally not known and must be estimated from an assumed model giving C(h). The set of weights we have determined minimizes the error variance si. The error variance can be determined by multiplying equation 17.35 by wi and adding the n equations to get

Because the weights sum to 1, the second term is A.

Substituting this into equation 17.33 results in

a; is referred to as the ordinary kriging variance.

Continuing to assume a second-order stationary process, we can develop some relationships between the model semivariograrn and the model covariances. We have previously defined the semivariogram by

from which we can get

Also, because Ci, Pij = 1 u-

we have

In terms of the semivariogram, we have

and

02,

=

Xr=, wiyio -

In terms of the correlogram, we have

and

The estimates obtained by kriging have the advantage that an estimate for the variance of v(xo) is given as a;. This estimate takes into account the covariance structure of points used in the estimation process through the semivariogram. Kriging produces minimum variance estimates. Equation 17.40 shows that the variance of an estimate made fiom equation 17.23 depends on the weights assigned and the covariance structure of the system. The averaging and weighted averaging methods of equations 17.26 to 17.27 do not depend on this covariance structure and it is generally not available.

AN EXAMPLE Let us consider the example transect presented earlier in this chapter. The actual data are contained in table 17.1. Figure 17.3 is the semivariogram. Figure 17.5 shows how a spherical and exponential semivariogram model fits the example data. In the spherical model (equation 17.17), the best fit model has a sill of 2620 and range of 2.6. For the exponential model (equation 17.18), the sill (a:) is 2755 and the range (R) is 3.1. The exponential model displayed the best fit and was chosen to represent the semivariogram. The first thing we will do is to use ordinary kriging to estimate the elevation at the midpoints between the measured points beginning with a point at X = 0.625 miles and ending at X = 7.125 miles. We will use the exponential model to estimate the covariances required for equation 17.38. The covariances actually come from equation 17.15 as

with y, from equation 17.18. The C matrix of equation 17.36 will be based on 6 points, 3 on either side of the midpoint being estimated. Thus y, for h = 0, 0.25, 0.50, 0.75, 1.00, and 1.25 are required. For the D matrix, h = 0.125,0.375, and 0.625.

439

GEOSTATISTICS Table 17.1. Example transect data Distance (mi>

Elevation (ft>

Distance (mi)

Elevation (ft>

Distance (mi>

Elevation (ft>

The distance involved in determining the C and D matrices are

0.25

0.50 0.25

0.75 0.50

distance for D

-

1 :::::/

To illustrate the calculation of an element of the C or D matrix, the value for C2,3will be determined. The separation distance for the points is 0.25 miles since they are adjacent points.

440

CHAPTER 17

The resulting C and D matrices are

The inverse of C is

w, are The weights, -

L- 16365.3j The elevation at the midpoints can now be determined from equation 17.23 as

Using this estimation, the elevation at 0.625 miles is

Grn= 0.002548(1630) + 0.001052(1660) + 0.496400(1640) + 0.496400(1710) + 0.001052(1640) + 0.002548(1590) = 1674.6 feet

1500 1450 1400 0

1

2

3

4

5

6

7

Transect distance (miles)

Fig. 17.7. Example transect fitted data.

The standard error of the elevation, a,, can be determined from equation 17.40 as

a, = 18.22 feet

The estimates for elevation are shown in Figure 17.7. The ordinary kriging estimates for the midpoint are nearly equal to the average of the values on either side of the midpoint. Ordinary kriging was also completed using the best fit combination semivariogram model shown in figure 17.6, which combines two spherical models. If this best fit combination model is used, the estimate at 0.625 miles is 1674.8 feet with a standard error of 19.27 feet, almost identical to the basic exponential model estimate. The basic model, therefore, is more desirable because it is much easier to use. To get the elevation at some nonmidpoint location, the same procedure could be followed. The C matrix would be unchanged. If an estimate is desired 40% of the way from one point to another, or 0.1 mile, the distances in the D matrix would be 0.60,0.35,0.10, 0.15,0.40, and 0.65 mile. The D matrix would be -

and

w =C-' D -

or

w= -

The estimate for the elevation is still largely dependent on the elevations immediately on either side of the point but more weight is given to the closer known elevation.

If we use this method to estimate the elevation right on a known point, the C matrix remains unchanged. The distance in the D would be (0.50,0.25.0,0.25,0.50, 0.75) a n b

resulting in

or the elevation is predicted to be exactly itself. Ordinary kriging predicts exact values at measured points. The variance at this point is

The variance at the estimated point is zero if the estimated point is also a point of measurement. Finally, consider a three-dimensional situation with points as shown in figure 17.8 having attributes Point

Elevation

Fig. 17.8. Three-dimensional example.

x

Y

The distance between points is determined from equation 17.1, resulting in distance matrices of 1.03 1.12 C ii.03 0 o.go] For 1.12 0.90 0

for

[":I

0.64

Continuing to use the semivariogram from the example transect data, the resulting C and D matrices are

and

The elevation of the unknown point P is

Ep = 0.497(1610)

+ 0.125(1700) + 0.378(1750) = 1674 feet

with a standard error determined from a : = u2 - w'D - = 1370.4 UR

=

37.0 feet

These examples indicate that point estimation via kriging requires the use of a semivariogram model. The estimates obtained are dependent on the model chosen. Earlier it was indicated that the behavior of the semivariogram model near the estimation point has a large influence on the estimates that are made. The weights resulting from the kriging process can also be seen to be a function of distance from x,. Also, if a basic model and a combination model both satisfactorily characterize the structure of the semivariogram, use the basic model.

ANISOTROPY Thus far we have considered only isotropic situations so that the semivariogram is a function of distance but not direction. For an anisotropic situation, both the distance between points and the angular direction from one point to another are of concern. Thus only points within the sector h + Ah and 0 A0 would be included where h is the separation distance and 0 the angu-

+

CHAPTER 17

444

-El I

I

Horizontal

I

I

/

,' /,.' /

-

1 I I

I

I

I

/

I/

I

I

-

, I

/

/

/

I /

i,' 0.o

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

4.0

4.5

5.0

4.0

4.5

5.t

Distance, h 1.2 Vertical

, ,,

\

Horizontal

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Distance, h

-

-El

:

Horizontal

,,,,/ / /

I /

I

\

Vertical 0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Distance, h

Fig. 17.9. Semivariograms demonstrating (a) geometric anisotropy, (b) zonal anisotropy due to stratification, and (c) zonal anisotropy due to areal trends.

GEOSTATISTICS

445

lar direction between points. The semivariogram should then be indicated as yh.+ It is possible that the nugget, sill, or range of a semivariogram model are directionally dependent. Isaaks and Srivastava (1989) discuss this situation. If the range changes with direction but not the sill or nugget, and a plot of range versus directional angle is an ellipse, the anisotropy is known as geometric anisotropy. If only the sill changes with direction, zonal anisotropy is said to exist (Goovaerts, 1997). Kupfersberger and Deutsch (1999) describe two types of zonal anisotropy. If the sill is greater in the vertical direction, the anisotropy is characterized as zonal due to stratification. If the sill is greater in the horizontal direction, the semivariogram is said to be zonal due to areal trends. Figure 17.9 shows the semivariograms for each of these types of anisotropies. If anisotropy is present, the orientation or the axes of anisotropy must be identified. Physical characteristics of the area should be a great aid in this step. Geologic features, prevailing winds, and predominate direction of storm movements may contribute to anisotropy of different dependent variables and may be an aid in determining axes of anisotropy. Anisotropy can often be identified from contour plots. For an anisotropic situation, the contour lines tend to be ellipses, whereas for isotropic cases, the contours are circular. With sufficient data, one might compute directional semivariograms for several different directions in an effort to identify the required axes. If anisotropy is identified from semivariograms, supporting evidence from the field should be sought. Isaaks and Srivastava (1989) and Goovaerts (1997) suggest that one might combine directional semivariograms into a single semivariogram by using a transformation that standardizes the range to one. To do this the separation distances must be transformed so that the standardized model will provide a semivariogram value that is identical to any of the directional models for the given separation distance. Davis (1986) and Goovaerts (1997) discuss kriging in situations where there is nonstationarity in the form of a drift or trend in the mean value of the variable. That is, the mean is changing with location. One method to model this is to remove the drift to produce a variable stationary in the mean and then apply kriging to this stationary variable. For estimation, the drift has to be added back to the estimate based on the stationary semivariogram. Davis (1986) illustrates the treatment of drift by considering points in the near vicinity of the point where an estimate is to be made. In this way a "local" drift is considered.

COKRIGING Estimates based on kriging use only information contained in the variable of interest. If this variable is correlated to a second variable, information on this second variable, if incorporated into the estimation process, may improve the estimates. For example, in estimating soil nitrogen levels over a field, it may be found that soil nitrogen is correlated to soil phosphorus. It would be reasonable to use this correlation along with sampled soil phosphorus values in addition to the soil nitrogen values that have been measured to estimate soil nitrogen at unsarnpled locations. The estimation equation becomes

Here V(.) might represent soil nitrogen and U(.) soil phosphorus.

446

CHAPTER 17

As with kriging, in cokriging weights ai and bj that minimize the error variance in the estimate are sought. Isaaks and ~rivastava(1 989) should be consulted for additional explanation and details. LOCAL AND GLOBAL ESTIMATION Often, we need an estimate of some physical quantity that is representative of an area such as a field or catchment rather than at a point. Such an estimate is termed a global estimate if the area encompasses a large part of the data field. If only a part of the data field is under consideration, the estimate might be termed a local area estimate or a local estimate. One simple global estimate is the average of all the available data concerning the quantity of interest. Such an estimate might be very good if data from several points were available and each sample point represented approximately the same fraction of the total area. For this to be the case, the sample locations would have to have been carefully selected most likely on a regular, gridded pattern. In many sets of data, sample locations tend to be clustered in some areas and very sparse in others. In such a situation, a simple average would place more weight on the value of the property of interest where data clustering existed and little weight on the value(s) where only isolated data existed. If some degree of spatial continuity existed in the data, the area where clustering occurred would be over-represented in the resulting estimate of the average. Of course if absolutely no spatial continuity existed-that is, the data were completely independent from each other with respect to location-then a straight average would be as good as any other averaging process because information at one location would be independent of that at any other. In keeping with the purpose of this chapter, we will assume some spatial continuity exists in the data. This is equivalent to saying that a semivariograrn is not purely a nugget model. We will develop a global estimate as a weighted linear combination of the available sample. Global estimation generally refers to estimation over a large part, possibly all of, the sample space. Local estimation refers to estimation over an area that is a part of the sample space. Global estimation is often done by polygon declustering or cell declustering. Local area estimation is generally done by using averages of point kriging or by using block kriging. Having said this, there is no hard and fast rules as to what is global and what is local in the sense used here. Thus, local area estimates may be done by defining the "local" area to be the desired global area of interest. In this case, large covariance matrices with many small or zero covariances may be involved. If there are many observations in the area of interest, global estimation techniques are often appropriate. If there are few data in the area of interest, local estimation techniques may be used. Polygon Declustering The first method we will consider for global estimation is familiar to hydrologists and climatologist who have estimated total rainfall over an area based on point rainfall measurements. Hydrologists call the method the Thiessen polygon method. In geostatistics it is often simply called polygon declustering. The global estimate is given by

where the weight, wi, is the area closest to the point xi. This area is determined by drawing polygons whose sides are the perpendicular bisectors of lines connecting the various sample points.

GEOSTATISTICS

447

With this method, in areas where samples are closely spaced, the corresponding weights are small. In locations where data is sparse, the weights are large. Cell Declustering Cell declustering is a two-step process. The total area is divided into regular cells all of the same size. The average value of sample points within a cell is taken as the cell value. The average of the cell values is then taken as the global estimate of the property of interest. Thus, in cells where samples are clustered, each sample value enters the calculations for the cell mean and has a weight inversely proportional to the number of samples in the cell. For example, in a cell with 10 samples, each sample value would have a weight of 0.1 with respect to the cell average. In a cell with only one point, a weight of 1 would be assigned to that data point in computing the cell average. Because the global average is a simple average of cell averages, the relationship among the data point weights would carry over into the global average. Point Kriging A straightforward procedure that uses the covariance structure of the sample points but may require considerable calculation is to define many points within the global area, likely on a regular grid to ensure equal coverage; estimate the value of the quantity of interest by ordinary kriging at each point; and then compute the simple average of these kriged estimates. Block Kriging D in equation 17.36 contains the Block kriging is similar to point kriging except the matrix covariance values between the random variables Vi at the sample locations and the blocks. These point-to-block covariances are given by (Isaaks and Srivastava, 1989)

where the Vj7sare the sample values in the block and nj is the number of such points. Thus, the point-to-block covariance is the average of the nj covariances of the Vj in the block and the sample point in question. Recall that Cov(Vj, Vi) is estimated using a semivariogram model. Block estimation is then done using

A

A

where the wi are given by equation 17.39 using CiBin equation 17.37 rather than Cio. Isaaks and Srivastava (1989) show that block estimation via averaging point kriged estimates as explained above and block kriging are the same.

ESTIMATION OF CUMULATIVE DISTRIBUTIONS Cumulative distributions and moments of distributions of properties can be made globally or over local areas using the techniques discussed in the previous section of this chapter. Let us consider an empirical estimate of the cdf or the probability of a value, V(x), below some value x,.

448

CHAPTER 17

Recall that Pv(xo) = prob(V(x) < x,). Further consider we are looking for a global estimate of this quantity. The procedure is to replace all actual V(xi) with an indicator variable I(xi) such that

Using this transformation, the mean of I is the fraction of the values in the sample that are less than or equal to x,. This estimate does not take into account clustering and is a rather crude estimate of P,(a), just as the mean of the sample values is a crude estimate of the mean of the global mean because sample location and clustering are not considered. We can decluster the estimate for Pv(xo) just as we used declustering to estimate a global mean. We do this by letting

where the weights are determined as before using the polygon or cell declustering technique. Similarly, block kriging can be used to get local area estimates using

where wi are the kriging weights and the summation is over all appropriate points for the block of interest. By repeatedly using this approach while letting x, range incrementally from the minimum to the maximum sample value, the entire empirical cdf can be estimated. From the definition of the pdf and the cdf, the pdf can then be estimated as can the probabilities of estimates falling in various ranges. Similarly, we can define the moment estimates other than the mean. For example the variance is the average squared deviation from the mean. A global estimate for the variance is

where V(x) is the global mean. The skewness might be estimated from

Analogous results can be used for the estimation of local area moments using weights based on kriging. UNCERTAINTY We have discussed the estimation of point values, mean values, empirical probability distributions, and various sample moments. One might logically inquire as to how good these various estimates are. In statistics, we often quantify uncertainty in estimates in terms of confidence in-

GEOSTATISTICS

449

tervals, as previously discussed in this book. The width of a confidence interval is a measure of the uncertainty associated with a particular estimate. In a theoretical sense we need to know the underlying pdf of a process to determine confidence intervals. In a practical sense we never know that so we have to use an estimation procedure. We have already seen how we can estimate an empirical cdf for points, blocks, or local areas. If we desire a 100a% confidence interval, we can determine them by finding the C, and C, satisfying prob(Q 5 C1) = a/2 prob(Q

5

C,) = 1 - a/2

where Q is the quantity of interest. The values of C1and C, represent the lower and upper confidence limit such that prob(C, 5 Q

5

C,) = a

(17.55)

C1and C, may be determined from the empirically determined cdf of Q. Alternatively, if one is willing to assume a pdf for Q, the parameters of this pdf can be estimated from the method of moments and C, and C, analytically determined from the pdf. Often, a normal distribution is assumed requiring knowledge of a mean and variance. Equation 17.55 can be solved for C, and C, based on the assumed distribution. In either case, the confidence limits are approximate but convey considerable information in terms of the uncertainty associated with the estimated quantity, Q.

MODELING USING GEOSTATISTICS Many hydrologic properties can be treated as random variables by using the spatial variability identified with the use of geostatistics. Much research has been completed using field and laboratory data and well logs to model the spatial characteristics of hydrologic variables such as land surface (Oliver and Webster 1986), permeability (Kupfersberger and Deutsch 1999), fractures (Chilks and Gentier 1993; Tavchandjian et al. 1993), hydraulic conductivity (Kupfersberger and Deutsch 1999; Schafmeister and Pekdeger 1993), and geochemistry (Couto et al. 1997; Miesch 1975). Recent research has also explored the use of the geostatistics of ground-penetrating radar (Tercier et al. 2000), computer tomography (Grevers and de Jong 1994; Ioannidis et al. 1999), remote sensing (Jupp et al. 1989) and other image analysis data (Muge et al. 1993)to characterize soil properties.

Appendixes A. 1. COMMON DISTRIBUTIONS

Hvvergeometric Distribution

Binomial Distribution

Geometric Distribution fx(x) = p(l - p)" - I

X

=

1,2, 3, ...

452

APPENDIXES

Negative Binomial Distribution

Poisson Distribution

Uniform Distribution

Triangular Distribution

A(=)

px(x) = p - a

a

6-a

- p - a( p - s ) -

18aPS - 3[aP(a

Y

=

Exvonential Distribution

Statistical Methods in HYDROLOGY-Haan

Short Description

Description

Comments

We need your help!