December 21, 2016 | Author: edduardoc | Category: N/A
Sampling, Regression, Experimental Design and Analysis for Environmental Scientists, Biologists, and Resource Managers
C. J. Schwarz Department of Statistics and Actuarial Science, Simon Fraser University
[email protected] December 21, 2012
Contents 1
2
In the beginning... 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Effective note taking strategies . . . . . . . . . . . . . . . . . . . . . . . 1.3 It’s all Γρκ to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Which computer package? . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 FAQ - Frequently Asked Question . . . . . . . . . . . . . . . . . . . . . 1.5.1 Accessing journal articles from home . . . . . . . . . . . . . . . 1.5.2 Downloading from the web . . . . . . . . . . . . . . . . . . . . 1.5.3 Printing 2 pages per physical page and on both sides of the paper . 1.5.4 Is there an on-line textbook? . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
15 15 16 19 19 28 28 28 28 29
Introduction to Statistics 2.1 TRRGET - An overview of statistical inference . . . . . . . . . . . . 2.2 Parameters, Statistics, Standard Deviations, and Standard Errors . . . 2.2.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Theoretical example of a sampling distribution . . . . . . . . 2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Some practical advice . . . . . . . . . . . . . . . . . . . . . 2.3.3 Technical details . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 A review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Comparing the population parameter against a known standard 2.4.3 Comparing the population parameter between two groups . . 2.4.4 Type I, Type II and Type III errors . . . . . . . . . . . . . . . 2.4.5 Some practical advice . . . . . . . . . . . . . . . . . . . . . 2.4.6 The case against hypothesis testing . . . . . . . . . . . . . . 2.4.7 Problems with p-values - what does the literature say? . . . . Statistical tests in publications of the Wildlife Society . . . . . The Insignificance of Statistical Significance Testing . . . . . Followups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Meta-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Scales of measurement . . . . . . . . . . . . . . . . . . . . . 2.5.2 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
30 31 34 34 39 41 42 48 49 50 50 51 58 62 65 66 69 69 69 71 71 71 73
1
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
CONTENTS 2.5.3 Roles of data . . . . . . . . . . . . . . . . . . . . . . . . Bias, Precision, Accuracy . . . . . . . . . . . . . . . . . . . . . . Types of missing data . . . . . . . . . . . . . . . . . . . . . . . . Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Conditions under which a log-normal distribution appears 2.8.3 ln() vs. log() . . . . . . . . . . . . . . . . . . . . . . . . 2.8.4 Mean vs. Geometric Mean . . . . . . . . . . . . . . . . . 2.8.5 Back-transforming estimates, standard errors, and ci . . . Mean on log-scale back to MEDIAN on anti-log scale . . 2.8.6 Back-transforms of differences on the log-scale . . . . . . 2.8.7 Some additional readings on the log-transform . . . . . . 2.9 Standard deviations and standard errors revisited . . . . . . . . . 2.10 Other tidbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.1 Interpreting p-values . . . . . . . . . . . . . . . . . . . . 2.10.2 False positives vs. false negatives . . . . . . . . . . . . . 2.10.3 Specificity/sensitivity/power . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
74 74 77 79 79 80 80 81 82 82 83 84 95 104 104 104 104
Sampling 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Difference between sampling and experimental design . . . . . 3.1.2 Why sample rather than census? . . . . . . . . . . . . . . . . . 3.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . . . 3.1.4 Probability sampling vs. non-probability sampling . . . . . . . 3.1.5 The importance of randomization in survey design . . . . . . . 3.1.6 Model vs. Design based sampling . . . . . . . . . . . . . . . . 3.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . . . 3.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . 3.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Panel design - suitable for long-term monitoring . . . . . . . . 3.2.7 Sampling non-discrete objects . . . . . . . . . . . . . . . . . . 3.2.8 Key considerations when designing or analyzing a survey . . . 3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Simple Random Sampling Without Replacement (SRSWOR) . . . . . . 3.4.1 Summary of main results . . . . . . . . . . . . . . . . . . . . . 3.4.2 Estimating the Population Mean . . . . . . . . . . . . . . . . . 3.4.3 Estimating the Population Total . . . . . . . . . . . . . . . . . 3.4.4 Estimating Population Proportions . . . . . . . . . . . . . . . . 3.4.5 Example - estimating total catch of fish in a recreational fishery What is the population of interest? . . . . . . . . . . . . . . . . What is the frame? . . . . . . . . . . . . . . . . . . . . . . . . What is the sampling design and sampling unit? . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
106 108 108 109 109 110 111 114 115 115 115 117 120 124 126 128 129 129 131 131 132 133 133 134 134 136 137 137
2.6 2.7 2.8
3
c
2012 Carl James Schwarz
2
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
December 21, 2012
CONTENTS Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.5 Sample size determination for a simple random sample . . . . . . . . . . . . . . . . . . . . 141 3.5.1 Example - How many angling-parties to survey . . . . . . . . . . . . . . . . . . . . 144 3.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 3.6.1 Advantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.6.2 Disadvantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.6.3 How to select a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 3.6.5 Technical notes - Repeated systematic sampling . . . . . . . . . . . . . . . . . . . . 149 Example of replicated subsampling within a systematic sample . . . . . . . . . . . . 149 3.7 Stratified simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 3.7.1 A visual comparison of a simple random sample vs. a stratified simple random sample154 3.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 3.7.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 3.7.4 Example - sampling organic matter from a lake . . . . . . . . . . . . . . . . . . . . 164 3.7.5 Example - estimating the total catch of salmon . . . . . . . . . . . . . . . . . . . . 168 What is the population of interest? . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 What is the sampling frame? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 What is the sampling design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 When should the various estimates be used? . . . . . . . . . . . . . . . . . . . . . . 175 3.7.6 Sample Size for Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 3.7.7 Allocating samples among strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 3.7.8 Example: Estimating the number of tundra swans. . . . . . . . . . . . . . . . . . . 183 3.7.9 Post-stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 3.7.10 Allocation and precision - revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 189 3.8 Ratio estimation in SRS - improving precision with auxiliary information . . . . . . . . . . 190 3.8.1 Summary of Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 3.8.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Post mortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population total . . 201 Excel analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 SAS analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Post mortem - a question to ponder . . . . . . . . . . . . . . . . . . . . . . . . . . 209 3.9 Additional ways to improve precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 3.9.1 Using both stratification and auxiliary variables . . . . . . . . . . . . . . . . . . . . 210 3.9.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 3.9.3 Sampling with unequal probability - pps sampling . . . . . . . . . . . . . . . . . . 211 3.10 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 3.10.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 3.10.2 Advantages and disadvantages of cluster sampling compared to SRS . . . . . . . . . 219 3.10.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
c
2012 Carl James Schwarz
3
December 21, 2012
CONTENTS . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
220 221 222 225 227 227 231 235 235 236 237 238 240 240 242 242 246 247 247 248 248 249 249
Designed Experiments - Terminology and Introduction 4.1 Terminology and Introduction . . . . . . . . . . . . . . . . . . . . 4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Treatment, Experimental Unit, and Randomization Structure 4.1.3 The Three R’s of Experimental Design . . . . . . . . . . . 4.1.4 Placebo Effects . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Single and bouble blinding . . . . . . . . . . . . . . . . . . 4.1.6 Hawthorne Effect . . . . . . . . . . . . . . . . . . . . . . . 4.2 Applying some General Principles of Experimental Design . . . . . 4.2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Experiment 5 . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Some Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Salk Vaccine Experiment . . . . . . . . . . . . . . . . 4.3.2 Testing Vitamin C - Mistakes do happen . . . . . . . . . . . 4.4 Key Points in Design of Experiments . . . . . . . . . . . . . . . . . 4.4.1 Designing an Experiment . . . . . . . . . . . . . . . . . . . 4.4.2 Analyzing the data . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Writing the Report . . . . . . . . . . . . . . . . . . . . . . 4.5 A Road Map to What is Ahead . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
250 251 251 252 255 257 257 258 258 259 259 259 260 260 261 261 262 262 263 264 264 265
3.11
3.12 3.13 3.14
4
3.10.4 Summary of main results . . . . . . . . . . . . . . . . . . 3.10.5 Example - estimating the density of urchins . . . . . . . . Excel Analysis . . . . . . . . . . . . . . . . . . . . . . . SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . Planning for future experiments . . . . . . . . . . . . . . 3.10.6 Example - estimating the total number of sea cucumbers . SAS Analysis . . . . . . . . . . . . . . . . . . . . . . . . Multi-stage sampling - a generalization of cluster sampling . . . . 3.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.3 Summary of main results . . . . . . . . . . . . . . . . . . 3.11.4 Example - estimating number of clams . . . . . . . . . . Excel Spreadsheet . . . . . . . . . . . . . . . . . . . . . SAS Program . . . . . . . . . . . . . . . . . . . . . . . . 3.11.5 Some closing comments on multi-stage designs . . . . . . Analytical surveys - almost experimental design . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . 3.14.1 Confusion about the definition of a population . . . . . . 3.14.2 How is N defined . . . . . . . . . . . . . . . . . . . . . . 3.14.3 Multi-stage vs. Multi-phase sampling . . . . . . . . . . . 3.14.4 What is the difference between a Population and a frame? 3.14.5 How to account for missing transects. . . . . . . . . . . .
c
2012 Carl James Schwarz
4
December 21, 2012
CONTENTS 4.5.1 4.5.2 4.5.3 5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Experimental Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Some Common Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Single Factor - Completely Randomized Designs (a.k.a. One-way design) 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Using a random number table . . . . . . . . . . . . . . . . . . . . . . Assigning treatments to experimental units . . . . . . . . . . . . . . . Selecting from the population . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Using a computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Randomly assign treatments to experimental units . . . . . . . . . . . Randomly selecting from populations . . . . . . . . . . . . . . . . . . 5.3 Assumptions - the overlooked aspect of experimental design . . . . . . . . . . 5.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . . . 5.3.2 No outliers should be present . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Equal treatment group population standard deviations? . . . . . . . . . 5.3.4 Are the errors normally distributed? . . . . . . . . . . . . . . . . . . . 5.3.5 Are the errors are independent? . . . . . . . . . . . . . . . . . . . . . 5.4 Two-sample t-test- Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Example - comparing mean heights of children - two-sample t-test . . . . . . . 5.6 Example - Fat content and mean tumor weights - two-sample t-test . . . . . . . 5.7 Example - Growth hormone and mean final weight of cattle - two-sample t-test 5.8 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Basic ideas of power analysis . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Prospective Sample Size determination . . . . . . . . . . . . . . . . . 5.8.3 Example of power analysis/sample size determination . . . . . . . . . Using tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using a package to determine power . . . . . . . . . . . . . . . . . . . 5.8.4 Further Readings on Power analysis . . . . . . . . . . . . . . . . . . . 5.8.5 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . 5.8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 ANOVA approach - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5.9.1 An intuitive explanation for the ANOVA method . . . . . . . . . . . . 5.9.2 A modeling approach to ANOVA . . . . . . . . . . . . . . . . . . . . 5.10 Example - Comparing phosphorus content - single-factor CRD ANOVA . . . . 5.11 Example - Comparing battery lifetimes - single-factor CRD ANOVA . . . . . . 5.12 Example - Cuckoo eggs - single-factor CRD ANOVA . . . . . . . . . . . . . . 5.13 Multiple comparisons following ANOVA . . . . . . . . . . . . . . . . . . . . 5.13.1 Why is there a problem? . . . . . . . . . . . . . . . . . . . . . . . . . 5.13.2 A simulation with no adjustment for multiple comparisons . . . . . . . 5.13.3 Comparisonwise- and Experimentwise Errors . . . . . . . . . . . . . . 5.13.4 The Tukey-Adjusted t-Tests . . . . . . . . . . . . . . . . . . . . . . . 5.13.5 Recommendations for Multiple Comparisons . . . . . . . . . . . . . . 5.13.6 Displaying the results of multiple comparisons . . . . . . . . . . . . . 5.14 Prospective Power and sample sizen - single-factor CRD ANOVA . . . . . . .
c
2012 Carl James Schwarz
5
December 21, 2012
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
272 273 274 275 275 276 276 277 281 285 286 286 287 288 289 289 290 297 303 310 311 312 313 314 314 319 320 321 322 323 328 331 343 353 366 366 367 369 370 372 373 375
CONTENTS
5.15 5.16
5.17 5.18 5.19 6
5.14.1 Using Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14.2 Using SAS to determine power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14.3 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pseudo-replication and sub-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16.1 What does the F -statistic mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16.2 What is a test statistic - how is it used? . . . . . . . . . . . . . . . . . . . . . . . . 5.16.3 What is MSE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16.4 Power - various questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What is meant by detecting half the difference? . . . . . . . . . . . . . . . . . . . . Do we use the std dev, the std error, or root MSE in the power computations? . . . . Retrospective power analysis; how is this different from regular (i.e., prospective) power analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What does power tell us? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . When to use retrospective and prospective power? . . . . . . . . . . . . . . . . . . When should power be reported . . . . . . . . . . . . . . . . . . . . . . . . . . . . What is done with the “total sample size” reported by JMP? . . . . . . . . . . . . . 5.16.5 How to compare treatments to a single control? . . . . . . . . . . . . . . . . . . . . 5.16.6 Experimental unit vs. observational unit . . . . . . . . . . . . . . . . . . . . . . . . 5.16.7 Effects of analysis not matching design . . . . . . . . . . . . . . . . . . . . . . . . Table: Sample size determination for a two sample t-test . . . . . . . . . . . . . . . . . . . Table: Sample size determination for a single factor, fixed effects, CRD . . . . . . . . . . . Scientific papers illustrating the methods of this chapter . . . . . . . . . . . . . . . . . . . . 5.19.1 Injury scores when trapping coyote with different trap designs . . . . . . . . . . . .
Single factor - pairing and blocking 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Randomization protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Some examples of several types of block designs . . . . . . . . . . Completely randomized design - no blocking . . . . . . . . . . . . Randomized complete block design - RCB design . . . . . . . . . . Randomized complete block design - RCB design - missing values . Incomplete block design - not an RCB . . . . . . . . . . . . . . . . Generalized randomized complete block design . . . . . . . . . . . 6.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . 6.3.2 Additivity between blocks and treatments . . . . . . . . . . . . . . 6.3.3 No outliers should be present . . . . . . . . . . . . . . . . . . . . . 6.3.4 Equal treatment group standard deviations? . . . . . . . . . . . . . 6.3.5 Are the errors normally distributed? . . . . . . . . . . . . . . . . . 6.3.6 Are the errors independent? . . . . . . . . . . . . . . . . . . . . . 6.4 Comparing two means in a paired design - the Paired t-test . . . . . . . . . 6.5 Example - effect of stream slope upon fish abundance . . . . . . . . . . . . 6.5.1 Introduction and survey protocol . . . . . . . . . . . . . . . . . . . 6.5.2 Using a Differences analysis . . . . . . . . . . . . . . . . . . . . . 6.5.3 Using a Matched paired analysis . . . . . . . . . . . . . . . . . . .
c
2012 Carl James Schwarz
6
December 21, 2012
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
376 377 378 379 381 381 381 381 382 382 382 382 383 383 383 384 384 384 385 388 390 393 393 395 396 399 399 400 400 401 401 402 403 403 404 406 406 407 408 408 409 409 412 415
CONTENTS
6.6 6.7 6.8 6.9
6.10 6.11
6.12 6.13 6.14 6.15 6.16
6.17 6.18
6.5.4 Using a General Modeling analysis . . . . . . . . . . . . . . . . . . . 6.5.5 Which analysis to choose? . . . . . . . . . . . . . . . . . . . . . . . . 6.5.6 Comments about the original paper . . . . . . . . . . . . . . . . . . . Example - Quality check on two laboratories . . . . . . . . . . . . . . . . . . . Example - Comparing two varieties of barley . . . . . . . . . . . . . . . . . . Example - Comparing prep of mosaic virus . . . . . . . . . . . . . . . . . . . Example - Comparing turbidity at two sites . . . . . . . . . . . . . . . . . . . 6.9.1 Introduction and survey protocol . . . . . . . . . . . . . . . . . . . . . 6.9.2 Using a Differences analysis . . . . . . . . . . . . . . . . . . . . . . . 6.9.3 Using a Matched paired analysis . . . . . . . . . . . . . . . . . . . . . 6.9.4 Using a General Modeling analysis . . . . . . . . . . . . . . . . . . . 6.9.5 Which analysis to choose? . . . . . . . . . . . . . . . . . . . . . . . . Power and sample size determination . . . . . . . . . . . . . . . . . . . . . . . Single Factor - Randomized Complete Block (RCB) Design . . . . . . . . . . 6.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.2 The potato-peeling experiment - revisited . . . . . . . . . . . . . . . . 6.11.3 An agricultural example . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.4 Basic idea of the analysis . . . . . . . . . . . . . . . . . . . . . . . . . Example - Comparing effects of salinity in soil . . . . . . . . . . . . . . . . . 6.12.1 Model building - fitting a linear model . . . . . . . . . . . . . . . . . . Example - Comparing different herbicides . . . . . . . . . . . . . . . . . . . . Example - Comparing turbidity at several sites . . . . . . . . . . . . . . . . . . Power and Sample Size in RCBs . . . . . . . . . . . . . . . . . . . . . . . . . Example - BPK: Blood pressure at presyncope . . . . . . . . . . . . . . . . . . 6.16.1 Experimental protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16.3 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . 6.18.1 Difference between pairing and confounding . . . . . . . . . . . . . . 6.18.2 What is the difference between a paired design and an RCB design? . . 6.18.3 What is the difference between a paired t-test and a two-sample t-test? 6.18.4 Power in RCB/matched pair design - what is root MSE? . . . . . . . . 6.18.5 Testing for block effects . . . . . . . . . . . . . . . . . . . . . . . . . 6.18.6 Presenting results for blocked experiment . . . . . . . . . . . . . . . . 6.18.7 What is a marginal mean? . . . . . . . . . . . . . . . . . . . . . . . . 6.18.8 Multiple experimental units within a block? . . . . . . . . . . . . . . . 6.18.9 How does a block differ from a cluster? . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
417 420 420 421 427 432 437 437 439 442 443 446 447 449 449 449 450 451 453 455 461 468 474 476 476 479 484 487 488 488 489 489 490 490 491 491 492 492
7
Incomplete block designs 493 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 7.2 Example: Investigate differences in water quality . . . . . . . . . . . . . . . . . . . . . . . 494
8
Estimating an over all mean with subsampling 501 8.1 Average flagellum length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 8.1.1 Average-of-averages approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
c
2012 Carl James Schwarz
7
December 21, 2012
CONTENTS 8.1.2 8.1.3 9
Using the raw measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Single Factor - Sub-sampling and pseudo-replication 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Example - Fat levels in fish - balanced data in a CRD . . . . . . . . . . . . . 9.2.1 Analysis based on sample means . . . . . . . . . . . . . . . . . . . . 9.2.2 Analysis using individual values . . . . . . . . . . . . . . . . . . . . 9.3 Example - fat levels in fish - unbalanced data in a CRD . . . . . . . . . . . . 9.4 Example - Effect of UV radiation - balanced data in RCB . . . . . . . . . . . 9.4.1 Analysis on sample means . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Analysis using individual values . . . . . . . . . . . . . . . . . . . . 9.5 Example - Monitoring Fry Levels - unbalanced data with sampling over time 9.5.1 Some preliminary plots . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Approximate analysis of means . . . . . . . . . . . . . . . . . . . . 9.5.3 Analysis of raw data . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Planning for future experiments . . . . . . . . . . . . . . . . . . . . 9.6 Example - comparing mean flagella lengths . . . . . . . . . . . . . . . . . . 9.6.1 Average-of-averages approach . . . . . . . . . . . . . . . . . . . . . 9.6.2 Analysis on individual measurements . . . . . . . . . . . . . . . . . 9.6.3 Followup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Final Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
513 514 514 516 519 524 525 528 531 535 538 540 544 545 547 550 562 568 568
10 Two Factor Designs - Single-sized Experimental units - CR and RCB designs 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Treatment structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why factorial designs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why not factorial designs? . . . . . . . . . . . . . . . . . . . . . . . . . . Displaying and interpreting treatment effects - profile plots . . . . . . . . . 10.1.2 Experimental unit structure . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.3 Randomization structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Putting the three structures together . . . . . . . . . . . . . . . . . . . . . 10.1.5 Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.6 Fixed or random effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.7 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.8 General comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Example - Effect of photo-period and temperature on gonadosomatic index - CRD . 10.2.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Preliminary summary statistics . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Hypothesis testing and estimation . . . . . . . . . . . . . . . . . . . . . . 10.3 Example - Effect of sex and species upon chemical uptake - CRD . . . . . . . . . . 10.3.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Preliminary summary statistics . . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
569 570 571 572 573 573 579 581 582 582 583 584 585 586 587 588 593 594 595 603 605 606 609
c
2012 Carl James Schwarz
8
December 21, 2012
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
CONTENTS 10.3.4 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Power and sample size for two-factor CRD . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Unbalanced data - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Example - Stream residence time - Unbalanced data in a CRD . . . . . . . . . . . . . . Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.1 Preliminary summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 The Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Hypothesis testing and estimation . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.4 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Example - Energy consumption in pocket mice - Unbalanced data in a CRD . . . . . . . 10.7.1 Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.2 Preliminary summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.3 The statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.4 Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.5 Hypothesis testing and estimation . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.6 Adjusting for unequal variances? . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8 Example: Use-Dependent Inactivation in Sodium Channel Beta Subunit Mutation - BPK 10.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.2 Experimental protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9 Blocking in two-factor CRD designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.1 How to determine sample size in two-factor designs . . . . . . . . . . . . . . . 10.10.2 What is the difference between a ‘block’ and a ‘factor’? . . . . . . . . . . . . . 10.10.3 If there is evidence of an interaction, does the analysis stop there? . . . . . . . . 10.10.4 When should you use raw means or LSmeans? . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
11 SAS CODE NOT DONE
609 619 624 626 627 628 629 631 641 641 643 643 646 646 648 656 656 656 656 657 668 669 669 669 670 671 673
12 Two-factor split-plot designs 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Example - Holding your breath at different water temperatures - BPK 12.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Standard split-plot analysis . . . . . . . . . . . . . . . . . . . 12.2.3 Adjusting for body size . . . . . . . . . . . . . . . . . . . . . 12.2.4 Fitting a regression to temperature . . . . . . . . . . . . . . . 12.2.5 Planning for future studies . . . . . . . . . . . . . . . . . . . 12.3 Example - Systolic blood pressure before presyncope - BPK . . . . . 12.3.1 Experimental protocol . . . . . . . . . . . . . . . . . . . . . 12.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Power and sample size determination . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
674 674 675 675 677 685 687 691 698 698 701 707
13 Analysis of BACI experiments 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 13.2 Before-After Experiments - prelude to BACI designs 13.2.1 Analysis of stream 1 - yearly averages . . . . 13.2.2 Analysis of Stream 1 - individual values . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
709 710 714 717 719
c
2012 Carl James Schwarz
9
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
December 21, 2012
CONTENTS 13.2.3 Analysis of all streams - yearly averages . . . . . . . . . . . . . . . . . . . . . . . . 13.2.4 Analysis of all streams - individual values . . . . . . . . . . . . . . . . . . . . . . . 13.3 Simple BACI - One year before/after; one site impact; one site control . . . . . . . . . . . . 13.4 Example: Change in density in crabs near a power plant - one year before/after; one site impact; one site control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Simple BACI design - limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 BACI with Multiple sites; One year before/after . . . . . . . . . . . . . . . . . . . . . . . . 13.7 Example: Density of crabs - BACI with Multiple sites; One year before/after . . . . . . . . . 13.7.1 Converting to an analysis of differences . . . . . . . . . . . . . . . . . . . . . . . . 13.7.2 Using ANOVA on the averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.7.3 Using ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.7.4 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8 BACI with Multiple sites; Multiple years before/after . . . . . . . . . . . . . . . . . . . . . 13.9 Example: Counting fish - Multiple years before/after; One site impact; one site control . . . 13.9.1 Analysis of the differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.9.2 ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.9.3 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.10Example: Counting chironomids - Paired BACI - Multiple-years B/A; One Site I/C . . . . . 13.10.1 Analysis of the differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.10.2 ANOVA on the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.10.3 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11Example: Fry monitoring - BACI with Multiple sites; Multiple years before/after . . . . . . 13.11.1 A brief digression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.2 Some preliminary plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.3 Analysis of the averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.4 Analysis of the raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.5 Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.12Closing remarks about the analysis of BACI designs . . . . . . . . . . . . . . . . . . . . . . 13.13BACI designs power analysis and sample size determination . . . . . . . . . . . . . . . . . 13.13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.13.2 Power: Before-After design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single Location studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Location studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.13.3 Power: Simple BACI design - one site control/impact; one year before/after; independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.13.4 Power: Multiple sites in control/impact; one year before/after; independent samples . 13.13.5 Power: One sites in control/impact; multiple years before/after; no subsampling . . . 13.13.6 Power: General BACI: Multiple sites in control/impact; multiple years before/after; subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Comparing proportions - Chi-square (χ2 ) tests 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 14.2 Response variables vs. Frequency Variables . . . . . . 14.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Single sample surveys - comparing to a known standard
c
2012 Carl James Schwarz
10
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
December 21, 2012
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
721 724 726 727 732 737 737 739 741 744 748 751 752 754 757 761 763 764 766 768 771 771 772 775 779 783 786 787 788 788 791 792 794 798 803 808 811 814 815 816 818 820
CONTENTS 14.4.1 Resource selection - comparison to known habitat proportions 14.4.2 Example: Homicide and Seasons . . . . . . . . . . . . . . . 14.5 Comparing sets of proportions - single factor CRD designs . . . . . . 14.5.1 Example: Elk habitat usage - Random selection of points . . . 14.5.2 Example: Ownership and viability . . . . . . . . . . . . . . 14.5.3 Example: Sex and Automobile Styling . . . . . . . . . . . . . 14.5.4 Example: Marijuana use in college . . . . . . . . . . . . . . . 14.5.5 Example: Outcome vs. cause of accident . . . . . . . . . . . 14.5.6 Example: Activity times of birds . . . . . . . . . . . . . . . . 14.6 Pseudo-replication - Combining tables . . . . . . . . . . . . . . . . . 14.7 Simpson’s Paradox - Combining tables . . . . . . . . . . . . . . . . . 14.7.1 Example: Sex bias in admissions . . . . . . . . . . . . . . . . 14.7.2 Example: - Twenty-year survival and smoking status . . . . . 14.8 More complex designs . . . . . . . . . . . . . . . . . . . . . . . . . 14.9 Final notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.10Appendix - how the test statistic is computed . . . . . . . . . . . . . 14.11Fisher’s Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.11.1 Sampling Protocol . . . . . . . . . . . . . . . . . . . . . . . 14.11.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.11.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 14.11.4 Example: Relationship between Aspirin Use and MI . . . . . Mechanics of the test . . . . . . . . . . . . . . . . . . . . . . 14.11.5 Avoidance of cane toads by Northern Quolls . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
820 826 830 830 834 839 843 847 851 853 857 857 858 859 859 860 862 864 864 864 867 868 870
15 Correlation and simple linear regression 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . 15.3.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . 15.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . . 15.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.2 Equation for a line - getting notation straight (no pun intended) . 15.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . . 15.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correct scale of predictor and response . . . . . . . . . . . . . Correct sampling scheme . . . . . . . . . . . . . . . . . . . . . No outliers or influential points . . . . . . . . . . . . . . . . . Equal variation along the line . . . . . . . . . . . . . . . . . . Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . Normality of errors . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
878 879 880 880 882 886 887 889 891 893 895 895 895 896 897 897 897 897 898 898 898 899
c
2012 Carl James Schwarz
11
December 21, 2012
CONTENTS X measured without error . . . . . . . . . . . . . . . 15.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . 15.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . 15.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . 15.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . 15.4.9 Example - Mercury pollution . . . . . . . . . . . . . . 15.4.10 Example - The Anscombe Data Set . . . . . . . . . . 15.4.11 Transformations . . . . . . . . . . . . . . . . . . . . 15.4.12 Example: Monitoring Dioxins - transformation . . . . 15.4.13 Example: Weight-length relationships - transformation A non-linear fit . . . . . . . . . . . . . . . . . . . . . 15.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . 15.4.15 The perils of R2 . . . . . . . . . . . . . . . . . . . . 15.5 A no-intercept model: Fulton’s Condition Factor K . . . . . . 15.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . 15.6.1 Do I need a random sample; power analysis . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
899 900 902 903 903 914 923 924 925 937 945 946 947 950 957 957
16 SAS CODE NOT DONE
959
17 SAS CODE NOT DONE
960
18 SAS CODE NOT DONE
961
19 Estimating power/sample size using Program Monitor 19.1 Mechanics of MONITOR . . . . . . . . . . . . . . 19.2 How does MONITOR work? . . . . . . . . . . . . 19.3 Incorporating process and sampling error . . . . . 19.4 Presence/Absence Data . . . . . . . . . . . . . . . 19.5 WARNING about using testing for temporal trends
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
962 963 972 977 986 989
20 SAS CODE NOT DONE
991
21 SAS CODE NOT DONE
992
22 SAS CODE NOT DONE
993
23 Logistic Regression - Advanced Topics 23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 23.2 Sacrificial pseudo-replication . . . . . . . . . . . . . . 23.3 Example: Fox-proofing mice colonies . . . . . . . . . 23.3.1 Using the simple proportions as data . . . . . . 23.3.2 Logistic regression using overdispersion . . . . 23.3.3 GLIMM modeling the random effect of colony 23.4 Example: Over-dispersed Seeds Germination Data . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
24 SAS CODE NOT DONE
c
2012 Carl James Schwarz
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
994 994 995 996 997 999 1000 1002 1010
12
December 21, 2012
CONTENTS 25 A short primer on residual plots 25.1 Linear Regression . . . . . . . . . . . . . 25.2 ANOVA residual plots . . . . . . . . . . 25.3 Logistic Regression residual plots - Part I 25.4 Logistic Regression residual plots - Part II 25.5 Poisson Regression residual plots - Part I . 25.6 Poisson Regression residual plots - Part II
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
26 SAS CODE NOT DONE
1011 1012 1013 1015 1016 1017 1019 1021
27 Tables 27.1 A table of uniform random digits . . . . . . . . . . . . . . . . . 27.2 Selected Binomial individual probabilities . . . . . . . . . . . . 27.3 Selected Poisson individual probabilities . . . . . . . . . . . . . 27.4 Cumulative probability for the Standard Normal Distribution . 27.5 Selected percentiles from the t-distribution . . . . . . . . . . . 27.6 Selected percentiles from the chi-squared-distribution . . . . . 27.7 Sample size determination for a two sample t-test . . . . . . . . 27.8 Power determination for a two sample t-test . . . . . . . . . . . 27.9 Sample size determination for a single factor, fixed effects, CRD 27.10Power determination for a single factor, fixed effects, CRD . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
1022 1022 1026 1034 1037 1039 1040 1041 1043 1045 1049
28 THE END! 1053 28.1 Statisfaction - with apologies to Jagger/Richards . . . . . . . . . . . . . . . . . . . . . . . . 1053 28.2 ANOVA Man with apologies to Lennon/McCartney . . . . . . . . . . . . . . . . . . . . . . 1055 29 An overview of enviromental field studies 29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.1.1 Survey Methods . . . . . . . . . . . . . . . . . . . . . . Simple Random Sampling . . . . . . . . . . . . . . . . . Systematic Surveys . . . . . . . . . . . . . . . . . . . . . Cluster sampling . . . . . . . . . . . . . . . . . . . . . . Multi-stage sampling . . . . . . . . . . . . . . . . . . . . Multi-phase designs . . . . . . . . . . . . . . . . . . . . Summary comparison of designs . . . . . . . . . . . . . . 29.1.2 Permanent or temporary monitoring stations . . . . . . . . 29.1.3 Refinements that affect precision . . . . . . . . . . . . . . Stratification . . . . . . . . . . . . . . . . . . . . . . . . Auxiliary variables . . . . . . . . . . . . . . . . . . . . . Sampling with unequal probability . . . . . . . . . . . . . 29.1.4 Sample size determination . . . . . . . . . . . . . . . . . 29.2 Analytical surveys . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3 Impact Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3.1 Before/After contrasts at a single site . . . . . . . . . . . 29.3.2 Repeated before/after sampling at a single site. . . . . . . 29.3.3 BACI: Before/After and Control/Impact Surveys . . . . . 29.3.4 BACI-P: Before/After and Control/Impact - Paired designs c
2012 Carl James Schwarz
13
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
December 21, 2012
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1057 1058 1065 1065 1067 1070 1074 1076 1078 1079 1080 1080 1083 1083 1084 1084 1087 1088 1088 1089 1092
CONTENTS
29.4 29.5 29.6
29.7
29.3.5 Enhanced BACI-P: Designs to detect acute vs. chronic effects or to detect changes in variation as well as changes in the mean. . . . . . . . . . . . . . . . . . . . . . . 29.3.6 Designs for multiple impacts spread over time . . . . . . . . . . . . . . . . . . . . . 29.3.7 Accidental Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Selected journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.6.1 Designing Environmental Field Studies . . . . . . . . . . . . . . . . . . . . . . . . 29.6.2 Beyond BACI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.6.3 Environmental impact assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of studies for discussion - good exam questions! . . . . . . . . . . . . . . . . . . 29.7.1 Effect of burn upon salamanders . . . . . . . . . . . . . . . . . . . . . . . . . . . .
c
2012 Carl James Schwarz
14
December 21, 2012
1094 1096 1099 1108 1110 1112 1112 1113 1113 1114 1114
Chapter 1
In the beginning... Contents 1.1 1.2 1.3 1.4 1.5
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effective note taking strategies . . . . . . . . . . . . . . . . . . . . . It’s all Γρκ to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . Which computer package? . . . . . . . . . . . . . . . . . . . . . . . . FAQ - Frequently Asked Question . . . . . . . . . . . . . . . . . . . . 1.5.1 Accessing journal articles from home . . . . . . . . . . . . . . . 1.5.2 Downloading from the web . . . . . . . . . . . . . . . . . . . . 1.5.3 Printing 2 pages per physical page and on both sides of the paper 1.5.4 Is there an on-line textbook? . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
15 16 19 19 28 28 28 28 29
Introduction
To many students, statistics is synonymous with sadistics and is not a subject that is “enjoyable”. Obviously, I think this view is mistaken and hope to present some of the interesting things that can be done with statistics. Statistics is all about discovery - how to extract information in the face of uncertainty. In the past, learning about statistics was tedious because of the enormous amount of arithmetic that needed to be done. Now, we let the computer do the heavy lifting, but it now vitally important that you UNDERSTAND what a computer package is doing – after all, these computer packages don’t have a conceptual understanding of what the data are about. They will quite happily compute the average sex (where 0 codes males and 1 codes females) – only you can decide that this is a meaningless statistic to compute. These note try to operate at a conceptual level. There are many example which show how a typical 15
CHAPTER 1. IN THE BEGINNING... analysis might be performed using a statistical package. There is often no unique answer to a problem with several good alternatives always available, so don’t let my notes constrain your thinking. Statistics is fun! Just ask my family:
1.2
Effective note taking strategies
This section is taken from : The Tomorrow’s Professor Listserv - an email list server on topics of general interest to higher education. It is available at Stanford University at http://sll.stanford.edu/. It will soon become apparent, if it hasn’t already, that not all lectures are fascinating and stimulating, and that not all lecturers are born with a gift for public speaking. However, the information and ideas that they are trying to impart are just as important, and any notes that you take in the lectures must be understandable to you, not only five minutes after the lecture has finished,
c
2012 Carl James Schwarz
16
December 21, 2012
CHAPTER 1. IN THE BEGINNING... but in several months’ time, when you come to revise from them. The question, then, is how to retain your concentration and produce a good set of notes. There are a few misconceptions on the part of students as to what can be expected of a lecture session. Firstly, that the responsibility for the success of the lecture is entirely the instructor’s, and that the student’s role is to sit and listen or to take verbatim notes. Secondly, that the purpose of a lecture is to impart information which will be needed for an exam question. And thirdly, that attending the lecture, and taking notes, is an individual, even competitive, activity. This page aims to correct these ideas, and to help you develop successful note-taking strategies.
BEFORE THE LECTURE • If you know the subject of the lecture, do some background reading beforehand. This way, you will go into the lecture with a better understanding, and find it easier to distinguish the points worth noting. • Read through the notes of the previous lecture in the series just before the present one begins. This helps orient your thoughts to the subject in hand, especially if you have just come from a lecture on a completely different subject. DURING THE LECTURE • Think of a lecture as an active, learning process, rather than a passive, secretarial exercise. Writing verbatim what the lecturer says, or copying everything down from overheads, does not involve much thought, and subsequent reading of these notes often makes little sense. • Pages of continuous prose are the least helpful for revision. Some things said in the lecture are obviously more important than others, and the notes you take should reflect this. Try to give them some structure, by using headings and sub-headings, by HIGHLIGHTING or underlining key ideas and realizing the links between them. Alternative noting forms to linear notes, such as flow diagrams or star charts, can be used, although these are often more helpful to revise your notes (see After the Lecture). • In some situations, you may be directed in the amount of note-taking necessary. For example, the lecturer may start off by giving you some references on the subject he/she is to lecture on. A good strategy to adopt in this case would be to note down carefully the references, then just LISTEN to the lecture, making brief notes about the main points or specific examples. Taking notes from books is far easier than in lectures, as you can go at your own pace, stop and think about something, re-read a section etc. Use the lecture to try and understand the concepts being explained. • Or, the lecturer may give hand-outs to accompany the lecture. In this case, you don’t need to make copious notes of your own. Again, listen to what is being said, and annotate the hand-out with any extra information. It gives you more time to think, and perhaps raise questions of your own. • On the subject of questions, it is commonly believed by students that lecturers are not to be interrupted when they are in full flow. You may find that this isn’t always the case, and there is nothing wrong c
2012 Carl James Schwarz
17
December 21, 2012
CHAPTER 1. IN THE BEGINNING... with asking individual lecturers if they mind taking questions during the lecture. It is best to establish this at the beginning of the course of lectures. • However, there is also the problem of speaking out in front of your peers, perhaps asking something foolish, or not having the time to frame your question well. In this case, write down the question in the margin of your notes, to ask the lecturer later, or check with friends or in a textbook. It is far easier to recall the question you wanted to ask in this way, rather than rely on remembering after the lecture has finished (or even when you come to revise from your notes!) AFTER THE LECTURE • The best time to review your lecture notes is immediately following the lecture, although this is not always possible if, for example, you have to go straight to another one. However, the sooner you do it, with the lecture still fresh in your mind, the better chance you have to produce a good set of notes for revision. • Revising your notes does not mean writing them out neatly! • Try swapping notes with a friend, to check the accuracy/omissions of your own, and your understanding of the key points. • If you feel that your notes are incomplete, or if you jotted down any questions during the lecture, follow this up by asking your tutor, or by reading round the subject. • Transforming your lecture notes by using a different noting form can sometimes make them clearer, e.g. a flow diagram • Highlight key points; produce summaries for revision purposes. • Think how this topic relates to previous ones, or to other courses that you are studying, and begin to recognize themes and relationships. • Meet with a few friends after lectures, to discuss the lecture topic and answer each others questions. Discussion with your peers often leads to a better understanding of a subject, which in turn makes it easier to remember. Your group could also establish a reading syndicate, whereby reading lists can be divided between members, who each take notes on their allotted texts and give copies to the rest of the group. STORING YOUR NOTES • A little time spent at this stage in organizing your notes will make life much easier when you come to revise from them some months later. • Numbering pages, making a contents page, or using dividers in your file will all make your notes more accessible.
c
2012 Carl James Schwarz
18
December 21, 2012
CHAPTER 1. IN THE BEGINNING...
It’s all Γρκ to me
1.3
There are several common Greek letters and special symbols that are used in Statistics. In this section, we illustrate the common Greek letters and notation used. Check that the following symbols and small equations. • α - the Greek letter alpha • β - the Greek letter beta • λ - the Greek letter lambda • µ - the Greek letter mu (looks like a ‘u’ with a tail in front) • σ - the Greek letter sigma (looks like an ‘o’ with a top line) • X 2 , X2 - an X with a superscript 2 and then an X with a subscript 2 • Y - Y-bar, a Y with a bar across the top • Yb - Y-hat, a Y with a circumflex over it •
x y
- a fraction x/y in vertical format
• Z = X−µ - an equation with the x-mu above the Greek letter sigma σ √ • n - square root of n • pb - p-hat - a p with a circumflex over it • 6= - a not-equal sign • ± - a plus/minus sign • × - a multiplication sign Pn • i=1 something - a summation of something with the index i ranging from 1 to n.
1.4
Which computer package?
Modern Statistics relies heavily upon computing – many would say that many modern statistical methods would be infeasible without modern software. Rather than spending time on tedious arithmetic, or on trying to reinvent the wheel, many people rely upon modern statistical packages. Here are some of the most common packages in use today.
c
2012 Carl James Schwarz
19
December 21, 2012
CHAPTER 1. IN THE BEGINNING... • SAS. Available in Windoze and Unix flavours. Modern Macintoshs with the Intel chip can run SAS under Windoze. 1 One of the best packages around. SAS can handle nearly any type of data (dates, times, characters, numbers) with many posible analyses (over 100 different base analyses are currently available) and allows virtually arbitrary input formats and structures. SAS is extremely flexible and powerful but has a very steep learning curve. This is the premier statistical procedure – virtually all statistical analyses can be done with SAS. This is the package that I, as a Professional Statistician2 use regularly in my job. Not only does SAS have modern statistical procedures, but is also a premiere database management system. It is designed for heavy duty computing. Refer to http://www.sas.com for more details on this package. The SAS program includes a module SAS/INSIGHT that is virtually identical to JMP (see below). • SPSS. This is a fairly powerful package (but not nearly as broad as SAS). It is very popular with Social Sciences researchers, but I personally, prefer SAS. Refer to http://www.spss.com for more details on this package. • JMP. JMP was originally developed by one of the two SAS developers (who were the 68 and 138 richest people in USA/NA in 2003). John Saul developed JMP. He did this originally for the MacIntosh platform and called it John’s MacIntosh Product ergo the name JMP. JMP runs on Macintosh, Linux, and Windoze platforms. JMP is easy to use and fairly powerful package. You should be able to do most things in this course in JMP. JMP does not have the range of procedures as in SAS, nor can it deal with as complex data structures. However, my guestimation is that most people can do 80% of their statistical computing using JMP. Refer to http://www.jmp.com for more details on this package. • SYSTAT. This package has good graphical procedures, a fairly wide range of statistical procedures, but the package is showing its age. I find SYSTAT clumsy compared to using JMP and SAS and everytime I use it, I quickly get frustrated by its limitations and clumsy operations. A review of SYSTAT is available in Hilbe, J. M. (2008). Systat 12.2: An overview American Statistician, 62, 177-178 http://dx.doi.org/10.1198/000313008x299339 Refer to http://www.systat.com for more details. • STATA I have never used STATA but a nice review of the package is found in Hilbe, J.M. (2005). A review of Stata 9.0. The American Statistician, 59, 335-348. 1 There
is a VERY old version of SAS that runs under older Macintoshes. This is a very old version and should not be used. Statistical Society of Canada has undertaken a program to accredit statisticians in Canada. Visit http://www.ssc.ca for more details. Yours truly proudly bears the title P.Stat. 007 2 The
c
2012 Carl James Schwarz
20
December 21, 2012
CHAPTER 1. IN THE BEGINNING... According to this review, Stata would be of interest to biostatisticians, mediacal/health outcomes, econometric, and social science research. • S-PLUS/R. S-PLUS and R are based on the S-programing language. As the name implies, S-PLUS is an extended version of S with a nice graphical interface. R is a freeware version of S − P lus with the basically the same functionality and can be freely downloaded from the WWW. It does not have the nice graphical interface. These packages are commonly used by statisticians when developing new statistical procedures. They are yery flexible, but require a somewhat steep learning curve. Refer to http://www.insightful.com/ for information on S-PLUS and http://www.r-project. org/ for information about R. • Excel. This is the standard spreadsheet program in the MSOffice Suite. Excel comes with some basic statistics but nothing too fancy. While EXCEL has its uses, you will find quickly that it can’t handly more complex analysis situations and gives wrong results without warning! Except for very simple statistics, I RECOMMEND AGAINST the use of Excel to do statistical analyses. People are wedded to Excel for often spurious reasons: – "Its free." So is R and you get a much superior product. – "It is easy to use". Yes, and easy to get WRONG answers without warnings. – "It has good graphs". Excel has the largest selection of BAD graphs in the world. Hardly any of them are useful! The following articles discuss some of the problems with Excel. They can also be accessed directly from the web by clicking on their respective links. – J. Cryer from the University of Iowa discusses some of the problems with using Excel for analyzing data at http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf. – Yet more problems with Excel are discussed in : Practical Stat with Excel? available at http: //www.practicalstats.com/xlsstats/excelstats.html and a copy of which is included in these notes. – Yet more problems with Excel: Using Excel in Statistics? available at http://www.umass. edu/acco/statistics/handout/excel.html. – An article by the Statistical Consulting Service at the University of Reading has a brief discussion of the pros and cons of using Excel or analyzing data at http://www.rdg.ac.uk/ ssc/software/excel/home.html. Basically, the graphs presented by Excel are often inappropriate for data presentation and you quickly run into limitations of the analysis routines available. – How to use the basic functions of Excel for Statistics. This page also has a link to discussion about regression in Excel. Using Excel functions in Statistics available at http://physicsnt. clemson.edu/chriso/tutorials/excel/stats.html – Spreadsheet Addiction. Some of the problems in Excel and alterntives. Has a long bibliography on the problems with Excel. Very nice summary document - well worth the read. Available at http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html. c
2012 Carl James Schwarz
21
December 21, 2012
CHAPTER 1. IN THE BEGINNING... There are a number of “add ons” available for Excel that seem to be reasonable priced and extend the analyses available. Nevertheless, the algorithms used in Excel to do the actual computations are flawed and can give INCORRECT results without any warning that something has gone wrong! For this reason, I generally use Excel only for simple problems - for anything more complex than a simple mean, I reach for a package such a JMP or SAS. Friends don’t let friends do statistics in Excel!
c
2012 Carl James Schwarz
22
December 21, 2012
Statistics With Excel
10/27/08 10:48
Practical Stats Make sense of your data! Home Consulting Upcoming Classes Applied Environmental Stats Nondetects (NADA) Multivariate Relationships Newsletter Downloads Statistics With Excel? Which Test Do I Use? Who Else Has Taken These? Contact Us Is Microsoft Excel an Adequate Statistics Package?
It depends on what you want to do, but for many tasks, the answer is ‘No’. Excel is available to many people as part of Microsoft Office. It contains some statistical functions in its basic installation. It also comes with statistical routines in the Data Analysis Toolpak, an add-in found separately on the Office CD. You must install the Toolpak from the CD in order to get these routines on the Tools menu. Once installed, these routines are at the bottom of the Tools menu, in the "Data Analysis" command. People use Excel as their everyday statistics software because they have already purchased it. Excel’s limitations, and occasionally its errors, make this a problem. Below are some of the concerns with using Excel for statistics that are recorded in journals, on the web, and from personal experience. Limitations of Excel
1. Many statistical methods are not available in Excel. Excel's biggest problem. Commonly-used statistics and methods NOT available within Excel include: * Boxplots 23 * p-values for the correlation coefficient * Spearman’s and Kendall’s rank correlation coefficients * 2-way ANOVA with unequal sample sizes (unbalanced data) * Multiple comparison tests (post-hoc tests following ANOVA) http://www.practicalstats.com/xlsstats/excelstats.html
Page 1 of 5
Statistics With Excel
10/27/08 10:48
* p-values for two-way ANOVA * Levene’s test for equal variance * Nonparametric tests, including rank-sum and Kruskal-Wallis * Probability plots * Scatterplot arrays or brushing * Principal components or other multivariate methods * GLM (generalized linear models) * Survival analysis methods * Regression diagnostics, such as Mallow’s Cp and PRESS ( it does compute adjusted r-squared) * Durbin-Watson test for serial correlation * LOESS smooths Excel's lack of functionality makes it difficult to use for more than computing summary statistics and simple univariate regression. Third-party add-ins to Excel attempt to compensate for these limitations, adding new functionality to the program (see "A Partial Solution", below). 2. Several Excel procedures are misleading. Probability plots are a standard way of judging the adequacy of the normality assumption in regression. In statistics packages, residuals from the regression are easily, or in some cases automatically, plotted on a normal probability plot. Excel’s regression routine provides a Normal Probability Plot option. However, it produces a probability plot of the Y variable, not of the residuals, as would be expected. Excel’s CONFIDENCE function computes z intervals using 1.96 for a 95% interval. This is valid only if the population variance is known, which is never true for experimental data. Confidence intervals computed using this function on sample data will be too small. A t-interval should be used instead. Excel is inconsistent in the type of P-values it returns. For most functions of probabilities, Excel acts like a lookup table in a textbook, and returns one-sided p-values. But in the TINV function, Excel returns a 2sided p-value. Look carefully at the documentation of any Excel function you use, to be certain you are getting what you want. Tables of standard distributions such as the normal and t distributions return p-values for tests, or are used to confidence intervals. With Excel, the user must be careful about what is being returned. To compute a 95% t confidence interval around the mean, for example, the standard method is to look up the t-statistic in a textbook by entering the table at a value of alpha/2, or 0.025. This t-statistic is multiplied by the standard error to produce the length of the t-interval on each side of the mean. Half of the error (alpha/2) falls on each side of the mean. In Excel the TINV function is entered using the value of alpha, not alpha/2, to return the same number. For a one-sided t interval at alpha=0.05, standard practice would be to look up the t-statistic in a textbook for alpha=0.05. In Excel, the TINV function must be called using a value of 2*alpha, or 0.10, to get the value for alpha=0.05. This nonstandard entry point has led several reviewers to state that Excel’s distribution functions are incorrect. If not incorrect, they are certainly nonstandard. Make sure you read the help menu descriptions carefully to know what each function produces. 24 3. Distributions are not computed with precision. NEW In reference (1), the authors show that all problems found in Excel 97 are still there in Excel 2000 and http://www.practicalstats.com/xlsstats/excelstats.html
Page 2 of 5
Statistics With Excel
10/27/08 10:48
XP. They say that "Microsoft attempted to fix errors in the standard normal random number generator and the inverse normal function, and in the former case actually made the problem worse." From this, you can assume that the problems listed below are still there in the current versions of the software. Statistical distributions used by Excel do not agree with better algorithms for those distributions at the third digit and beyond. So they are approximately correct, but not as exact as would be desired by an exacting statistician. This may not be harmful for hypothesis tests unless the third digit is of concern (a p-value of 0.056 versus 0.057). It is of most concern when constructing intervals (multiplying a std dev of 35 times 1.96 give 68.6; times 1.97 gives 69.0) As summarized in reference 2: "…the statistical distributions of Excel already have been assessed by Knusel (1998), to which we refer the interested reader. He found numerous defects in the various algorithms used to compute several distributions, including the Normal, Chi-square, F and t, and summarized his results concisely: So one has to warn statisticians against using Excel functions for scientific purposes. The performance of Excel in this area can be judged unsatisfactory." 4. Routines for handling missing data were incorrect. This was the largest error in Excel, but a 'band-aid' has been added in Office 2000. In earlier versions of Excel, computations and tests were flat out wrong when some of the data cells contained missing values, even for simple summary statistics. See (3) , (5), and page 4 of (6). Error messages are now displayed in Excel 2000 when there are missing values, and no result is given. Although this is still inferior to computing correct results it is somewhat of an improvement. In reference to pre-2000, "Excel does not calculate the paired t-test correctly when some observations have one of the measurements but not the other." E. Goldwater, ref. (5) 5. Regression routines are incorrect for multicollinear data. This affects multiple regression. A good statistics package will report errors due to correlations among the X variables. The Variance Inflation Factor (VIF) is one measure of collinearity. Excel does not compute collinearity measures, does not warn the user when collinearity is present, and reports parameter estimates that may be nonsensical. See (6) for an example on data from an experiment. Are multicollinear data of concern in ‘practical’ problems? I think so -- I find many examples of collinearity in environmental data sets. Excel also requires the X variables to be in contiguous columns in order to input them to the procedure. This can be done with cut and paste, but is certainly annoying if many multiple regression models are to be built. 6. Ranks of tied data are computed incorrectly. When ranking data, standard practice is to assign tied ranks to tied observations. The value of these ranks should equal the median of the ranks that the observations would have had, if they had not been tied. For example, three observations tied at a value of 14 would have had the ranks of 7, 8 and 9 had they not been tied. Each of the three values should be assigned the rank of 8, the median of 7, 8 and 9. Excel assigns the lowest of the three ranks to all three25observations, giving each a rank of 7. This would result in problems if Excel computed rank-based tests. Perhaps it is fortunate none are available. 7. Many of Excel's charts violate standards of good graphics. http://www.practicalstats.com/xlsstats/excelstats.html
Page 3 of 5
Statistics With Excel
10/27/08 10:48
Use of perspective and glitz (donut charts?) violate basic principles of graphics. Excel's charts are more suitable to USA Today than to scientific reports. This bothers some people more than others. "Good graphs should….[a list of traits]…However, Excel meets virtually none of these criteria. The vast majority of chart types produced by Excel should never be used!" -- Jon Cryer, ref (3). "Microsoft Excel is an example of a package that does not allow enough user control to consistently make readable and concise graphs from tables." - A. Gelman et al., 2002, The American Statistician 56, p.123. A partial solution: Some of these difficulties (parts of 1,2, 6 and 7) can be overcome by using a good set of add-in routines. One of the best is StatPlus, which comes with an excellent textbook, "Data Analysis with Microsoft Excel". With StatPlus, Excel becomes an adequate statistical tool., though still not in the areas of multiple regression and ANOVA for more than one factor. Without this add-in Excel is inadequate for anything beyond basic summary statistics and simple regression. Data Analysis with Microsoft Excel by Berk and Carey published by Duxbury (2000). Opinion: Get this book if you're going to use Excel for statistics. (I have no connection with the authors of StatPlus and get no benefit from this recommendation. I'm just a satisfied user.) Some advice from others: "If you need to perform analysis of variance, avoid using Excel, unless you are dealing with extremely simple problems." - Statistical Services Centre, Univ. of Reading, U.K. (at A, below) "Enterprises should advise their scientists and professional statisticians not to use Microsoft Excel for substantive statistical analysis. Instead, enterprises should look for professional statistical analysis software certified to pass the (NIST) Statistical Reference Datasets tests to their users' required level of accuracy." - The Gartner Group References: 1) On the accuracy of statistical procedures in Microsoft Excel 2000 and Excel XP B.D. McCullough and B. Wilson, (2002), Computational Statistics & Data Analysis, 40, pp 713 - 721 (2) On the accuracy of statistical procedures in Microsoft Excel ‘97 B.D. McCullough and B. Wilson, (1999), Computational Statistics & Data Analysis, 31, pp 27-37 (3) Problems with using Microsoft Excel for statistics [pdf Download] J.D. Cryer, (2001), presented at the Joint Statistical Meetings, American Statistical Association, 2001, Atlanta Georgia [pdf download] 26 (4) Use of Excel for statistical analysis Neil Cox, (2000), AgResearch Ruakura http://www.practicalstats.com/xlsstats/excelstats.html
Page 4 of 5
Statistics With Excel
10/27/08 10:48
(5) Using Excel for statistical data analysis Eva Goldwater, (1999), Univ. of Massachusetts Office of Information Technology [pdf download] (6) Statistical analysis using Microsoft Excel [pdf download] Jeffrey Simonoff, (2002) [pdf download] (7) Spreadsheet addiction Patrick Burns (8) On the Accuracy of Statistical Distributions in Microsoft Excel 97 Leo Knuesel [pdf download] (9) Statistical flaws in Excel Hans Pottel [pdf download] Guides to Excel on the web: (A) A Beginner's Guide to Excel - Univ. of Reading, UK (B) An Intermediate Guide to Excel - Univ. of Reading, UK Note: All opinions other than those cited as coming from others are my own. Home > Statistics With Excel? > © 2007 Practical Stats Email Us
27
http://www.practicalstats.com/xlsstats/excelstats.html
Page 5 of 5
CHAPTER 1. IN THE BEGINNING...
1.5 1.5.1
FAQ - Frequently Asked Question Accessing journal articles from home
I make reference to several journal articles in these notes. These are often available in e-journals so the link should take you there directly IF you are authorized to access this journal. For example, if you try and access these articles from a computer with an SFU IP address, you should likely be granted permission without problems. However, if you are trying to access these from home, you must go through the SFU library site and access the e-journal via the catalogue. You will then be prompted to enter your SFU ACS userid and password to grant you access to this journal
1.5.2
Downloading from the web When ever I try and download an Excel file, it seems to be corrupted and can’t be opened.
Through out the notes, reference is made to spreadsheets or SAS programs available at my web site. The SAS programs and listings are simple text files and should transfer to your computer without much problem. If you trying to download an Excel spreadsheet, be sure to specify that the file should be transfered as a source document rather than as text. If you transfer the sheets as text, you will find that the data are corrupted.
1.5.3
Printing 2 pages per physical page and on both sides of the paper The notes look as if I could print 2 per page. Is this possible, and can I print on both sides of the paper?
Yes, it is possible to print two logical pages per physical page - the text is a bit small, but still readable. On a Macintosh System with a recent OS, when you select Print, it presents the standard print options menu. Under the popdown menu is a Layout option. Select 2 logical pages per physical page. This will work with ALL applications that use the standard print dialogue. I’m not familiar enough with Windoze machines to offer any advice. To print on both sides of the paper, you need a printer capable of duplex printing, i.e. on both sides of the paper. I believe that most printers in the public areas of campus are capable of this. You will have to consult c
2012 Carl James Schwarz
28
December 21, 2012
CHAPTER 1. IN THE BEGINNING... your own printer manual if you are printing at home. Otherwise, you have to print first the odd pages, then take the paper, reverse it, and print the even pages - a recipe for disaster.
1.5.4
Is there an on-line textbook? Are there any online textbooks in statistics?
Yes, there are several - it is easiest to search the web using google. Beware that some of the advice on the web may be less than perfect. StatSoft has a highly regarded statistical online text book at http://www.statsoft.com/textbook/ stathome.html.
Contents 1.1 1.2 1.3 1.4 1.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effective note taking strategies . . . . . . . . . . . . . . . . . . . . . It’s all Γρκ to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . Which computer package? . . . . . . . . . . . . . . . . . . . . . . . . FAQ - Frequently Asked Question . . . . . . . . . . . . . . . . . . . . 1.5.1 Accessing journal articles from home . . . . . . . . . . . . . . . 1.5.2 Downloading from the web . . . . . . . . . . . . . . . . . . . . 1.5.3 Printing 2 pages per physical page and on both sides of the paper 1.5.4 Is there an on-line textbook? . . . . . . . . . . . . . . . . . . .
c
2012 Carl James Schwarz
29
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
December 21, 2012
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
15 16 19 19 28 28 28 28 29
Chapter 2
Introduction to Statistics Statistics was spawned by the information age, and has been defined as the science of extracting information from data. Technological developments have demanded methodology for the efficient extraction of reliable statistics from complex databases. As a result, Statistics has become one of the most pervasive of all disciplines. Theoretical statisticians are largely concerned with developing methods for solving the problems involved in such a process, for example, finding new methods for analyzing (making sense of) types of data that existing methods cannot handle. Applied statisticians collaborate with specialists in other fields in applying existing methodologies to real world problems. In fact, most statisticians are involved in both of these activities to a greater or lesser extent, and researchers in most quantitative fields of enquiry spend a great deal of their time doing applied statistics. The public and private sector rely on statistical information for such purposes as decision making, regulation, control and planning. Ordinary citizens are exposed to many ‘statistics’ on a daily basis. For example: • “In a poll of 1089 Canadians, 47% were in favor of the constitution accord. This result is accurate to within 3 percentage points, 19 time out of 20.” • “The seasonally adjusted unemployment rate in Canada was 9.3%”. • “Two out of three dentists recommend Crest.” What does this all mean? Our goal is not to make each student a ‘professional statistician’, but rather to give each student a subset of tools with which they can confidently approach many real world problems and make sense of the numbers. 30
CHAPTER 2. INTRODUCTION TO STATISTICS
2.1
TRRGET - An overview of statistical inference
Section summary: 1. Distinguish between a population and a sample 2. Why it is important to choose a probability sample 3. Distinguish among the roles of randomization, replication, and blocking 4. Distinguish between an ‘estimate’ or a ‘statistic’ and the ‘parameter’ of interest. Most studies can be broadly classified into either surveys or experiments. In surveys, the researcher is typically interested in describing some population - there is usually no attempt to manipulate units within the population. In experiments, units from the population are manipulated in some fashion and a response to the manipulation is observed. There are four broad phases to the survey or the experiment. These phases define the paradigm of Statistical Inference. These phases will be illustrated in the context of a political poll of Canadians on some issue as illustrated in the following diagram.
c
2012 Carl James Schwarz
31
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS The four phases are: 1. What is the population of interest and what is the parameter of interest? This formulates the research question - what is being measured and what is of interest. In this case, the population of interest is likely all eligible voters in Canada and the parameter of interest is the proportion of all eligible voters in favor of the accord. It is conceivable, but certainly impractical, that every eligible voter could be contacted and their opinion recorded. You would then know the value of the parameter exactly and there would be no need to do any statistics. However, in most real world situations, it is impossible or infeasible to measure every unit in the population. Consequently, a sample is taken. 2. Selecting a sample We would like our sample to be as representative as possible - how is this achieved? We would like our answer from our sample to be as precise as possible - how is this achieved? And, we may like to modify our sample selection method to take into account known division of the population - how is this achieved? Three fundamental principles of Statistics are randomization, replication and blocking. Randomization This is the most important aspect of experimental design and surveys. Randomization “makes” the sample ‘representative’ of the population by ensuring that, ‘on average’, the sample will contain a proportion of population units that is about equal, for any variable as found in the population. If an experiment is not randomized or a survey is not randomly collected, it rarely (if ever) provides useful information. Many people confuse ‘random’ with ‘haphazard’. The latter only means that the sample was collected without a plan or thought to ensure that the sample obtained is representative of the population. A truly ‘random’ sample takes surprisingly much effort to collect! E.g. The Gallup poll uses random digit dialing to select at random from all households in Canada with a phone. Is this representative of the entire voting population? How does the Gallup Poll account for the different patterns of telephone use among genders within a household? A simple random sample is an example of a equal probability sample where every unit in the population has an equal chance of being selected for the sample. As you will see later in the notes, the assumption of equal probability of selection not crucial. What is crucial is that every unit in the population have a known probability of selection, but this probability could vary among units. For example, you may decide to sample males with a higher probability than females. Replication = Sample Size This ensures that the results from the experiment or the survey will be precise enough to be of use. A large sample size does not imply that the sample is representative - only randomization ensures representativeness. Do not confuse replication with repeating the survey a second time. In this example, the Gallup poll interviews about 1100 Canadians. It chooses this number of people to get a certain precision in the results. c
2012 Carl James Schwarz
32
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Blocking (or stratification) In some experiments or surveys, the researcher knows of a variable that strongly influences the response. In the context of this example, there is strong relationship between the region of the country and the response. Consequently, precision can be improved, by first blocking or stratifying the population into more homogeneous groups. Then a separate randomized survey is done in each and every stratum and the results are combined together at the end. In this example, the Gallup poll often stratifies the survey by region of Canada. Within each region of Canada, a separate randomized survey is performed and the results are then combined appropriately at the end. 3. Data Analysis Once the survey design is finalized and the survey is conducted, you will have a mass of information - statistics - collected from the population. This must be checked for errors, transcribed usually into machine readable form, and summarized. The analysis is dependent upon BOTH the data collected (the sample) and the way the data was collected (the sample selection process). For example, if the data were collected using a stratified sampling design, it must be analyzed using the methods for stratified designs - you can’t simply pretend after the fact that the data were collected using a simple random sampling design. We will emphasize this point continually in this course - you must match the analysis with the design! For example, consider a Gallup Poll where 511 out of 1089 Canadians interviewed were in favor of an issue. Then our statistics is that 47% of our sample respondents were in favor. 4. Inference back to the Population Despite an enormous amount of money spent collecting the data, interest really lies in the population, not the sample. The sample is merely a device to gather information about the population. How should the information from the sample, be used to make inferences about the population? Graphing A good graph is always preferable to a table of numbers or to numerical statistics. A graph should be clear, relevant, and informative. Beware of graphs that try to mislead by design or accident through misleading scales, chart junk, or three dimensional effects. There a number of good books on effective statistical graphics - these should be consulted for further information.1 Unfortunately, many people rely upon the graphical tools available in spreadsheet software such as Excel which invariably leads to poor graphs. As a rule of thumb, Excel has the largest collection of bad graphical designs available in the free world! You may enjoy the artilce on “Using Microsoft Excel to obscure your data and annow your readers” available at http://www.biostat.wisc.edu/~kbroman/presentations/graphs_uwpath08_ handout.pdf. Estimation The number obtained from our sample is an estimate of the true, unknown, value of the population parameter. How precise is our estimate? Are we within 10 percentage points of the correct answer? A good survey or experiment will report a measure of precision for any estimate. In this example, 511 of 1089 people were in favor of the accord. Our estimate of the proportion of all Canadian voters in favor of the accord is 511/1089 = 47%. These results are ‘accurate to 1 An
“perfect” thesis defense would be to place a graph of your results on the overhead and then sit down to thunderous applause!
c
2012 Carl James Schwarz
33
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS within 3 percentage points, 19 times out of 20’, which implies that we are reasonably confident that the true proportion of voters in favor of the accord is between 47% − 3% = 44% and 47% + 3% = 50%. Technically, this is known as a 95% confidence interval - the details of which will be explored later in this chapter. (Hypothesis) Testing Suppose that in last month’s poll (conducted in a similar fashion), only 42% of voters were in favor. Has the support increased? Because each percentage value is accurate to about 3 percentage points, it is possible that in fact there has been no change in support!. It is possible to make a more formal ‘test’ of the hypothesis of no change. Again, this will be explored in more detail later in this chapter.
2.2
Parameters, Statistics, Standard Deviations, and Standard Errors
Section summary: 1. Distinguish between a parameter and a statistic 2. What does a standard deviation measure? 3. What does a standard error measure? 4. How are estimated standard errors determined (in general)?
2.2.1
A review
DDTs is a very persistent pesticide. Once applied, it remains in the environment for many years and tends to accumulate up the food chain. For example, birds which eat rodents which eat insects which ingest DDT contaminated plants can have very high levels of DDT and this can interfere with reproduction. [This is similar to what is happening in the Great Lakes where herring gulls have very high levels of pesticides or what is happening in the St. Lawrence River where resident beluga whales have such high levels of contaminants, they are considered hazardous waste if they die and wash up on shore.] DDT has been banned in Canada for several years, and scientists are measuring the DDT levels in wildlife to see how quickly it is declining. The Science of Statistics is all about measurement and variation. If there was no variation, there would be no need for statistical methods. For example, consider a survey to measure DDT levels in gulls on Triangle Island off the coast of British Columbia, Canada. If all the gulls on Triangle Island had exactly the same DDT level, then it would suffice to select a single gull from the island and measure its DDT level. Alas, the DDT level can vary by the age of the gull, by where it feeds and a host of other unknown and uncontrollable variables. Consequently the average DDT level over ALL gulls on Triangle Island seems like c
2012 Carl James Schwarz
34
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS a sensible measure of the pesticide load in the population. We recognize that some gulls may have levels above this average, some gulls below this average, but feel that the average DDT level is indicative of the health of the population, and that changes in the population mean (e.g. a decline) are an indication of an improvement.. Population mean and population standard deviation. Conceptually, we can envision a listing of the DDT levels of each and every gull on Triangle Island. From this listing, we could conceivably compute the true population average and compute the (population) standard deviation of the DDT levels. [Of course in practice these are unknown and unknowable.] Statistics often uses Greek symbols to represent the theoretical values of population parameters. In this case, the population mean is denoted by the Greek letter mu (µ) and the population standard deviation by the Greek letter sigma (σ). The population standard deviation measures the variation of individual measurements about the mean in the population. In this example, µ would represent the average DDT over all gulls on the island, and σ would represent the variation of values around the population mean. Both of these values are unknown. Scientists took a random sample (how was this done?) of 10 gulls and found the following DDT levels (ppm). 100, 105, 97, 103, 96, 106, 102, 97, 99, 103. The data is available in the ddt.csv file in the Sample Program Library at http://www.stat.sfu. ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual fashion:
data ddt; infile ’ddt.csv’ dlm=’,’ dsd missover firstobs=2; input ddt; run;
The raw data are shown below:
c
2012 Carl James Schwarz
35
Obs
ddt
1
100
2
105
3
97
4
103
5
96
6
106
7
102
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Obs
ddt
8
97
9
99
10
103
Sample mean and sample standard deviation The sample average and sample standard deviation could be computed from these value using a spreadsheet, calculator, or a statistical package. Proc Univariate provides basic plots and summary statistics in the SAS system.
ods graphics on; proc univariate data=ddt plots cibasic; var ddt; histogram ddt /normal; qqplot ddt /normal; ods output Moments=DDTmoments; ods output BasicIntervals=DDTci; run; ods graphics off;
This gives the following plots and summary table:
VarName
Statistic
ddt
N
ddt
Mean
ddt
Std Deviation
3.521363
Variance
12.400000
ddt
Skewness
0.035116
Kurtosis
-1.429377
ddt
Uncorrected SS
c
2012 Carl James Schwarz
Value 10.000000 100.800000
36
101718
Statistic
Value
Sum Weights
10.000000
Sum Observations
Corrected SS
1008.000000
111.600000 December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS VarName
Statistic
ddt
Coeff Variation
Value 3.493416
Statistic
Value
Std Error Mean
1.113553
A different notation is used to represent sample quantities to distinguish them from population parameters. In this case the sample mean, denoted Y and pronounced Y-bar, has the value of 100.8 ppm, and the sample standard deviation, denoted using the letter s, has the value of 3.52 ppm. The sample mean is a measure of the middle of the sample data and the sample standard deviation measures the variation of the sample data around the sample mean. What would happen if a different sample of 10 gulls was selected? It seems reasonable that the sample mean and sample standard deviation would also change among samples, and we hope that if our sample is large enough, that the change in the statistics would not be that large. Here is the data from an additional 8 samples, each of size 10:
Set 1 2 3 4 5 6 7 8
102 100 101 101 107 102 94 104
DDT levels in the gulls 102 103 95 105 97 95 104 103 99 98 95 98 94 100 96 106 102 104 95 98 103 100 99 90 102 99 105 92 98 101 100 100 98 107 99 102 101 101 92 94 104 100 101 100 100 96 101 100 98 102 97 104 97 99 100 100
98 90 108 100 104 101 94 109
103 103 104 102 98 97 98 102
Sample mean std 100.4 3.8 98.0 4.1 101.7 4.2 99.0 4.6 101.2 3.6 99.4 3.8 98.2 2.7 101.4 3.7
Note that the statistics (Y - the sample mean, and s - the sample standard deviation) change from sample to sample. This is not unexpected as it highly unlikely that two different samples would give identical results. What does the variation in the sample mean over repeated samples from the same population tell us? For example, based on the values of the sample mean above, could the true population mean DDT over all gulls be 150 ppm? Could it be 120 ppm? Could it be 101 ppm? Why? If more and more samples were taken, you would end up with a a large number of sample means. A histogram of the sample means over the repeated samples could be drawn. This would be known as the sampling distribution of the sample mean. The latter result is a key concept of statistical inference and is quite abstract because, in practice, you never see the sampling distribution. The distribution of individual values over the entire population can be visualized; the distribution of individual values in the particular sample can be examined directly as you have actual data; the hypothetical distribution of a statistics over repeated samples from the population is always present, but remains one level of abstraction away from the actual data. Because the sample mean varies from sample to sample, it is theoretically possible to compute a standard c
2012 Carl James Schwarz
37
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS deviation of the statistic as it varies over all possible samples drawn from the population. This is known as the standard error (abbreviated SE) of the statistic (in this case it would be the standard error of the sample mean). Because we have repeated samples in the gull example, we can compute the actual standard deviation of the sample mean over the 9 replicates (the original sample, plus the additional 8 sample). This gives an estimated standard error of 1.40 ppm. This measures the variability of the statistic (Y ) over repeated samples from the same population. But - unless you take repeated samples from the same population, how can the standard error ever be determined? Now statistical theory comes into play. Every statistic varies over repeated samples. In some cases, it is possible to derive from statistical theory, how much the statistic will vary from sample to sample. In the case of the sample mean for a sample selected using a simple random sample2 from any population, the se is theoretically equal to: σ se(Y ) = √ n Note that every statistic will have a different theoretical formula for its standard error and the formula will change depending on how the sample was selected. But this theoretical standard error depends upon an unknown quantity (the theoretical population standard deviation σ). It seems sensible to estimate the standard error by replacing the value of σ by an estimate - the sample standard deviation s. This gives: √ √ Estimated Std Error Mean = s/ n = 3.5214/ 10 = 1.1136 ppm. This number is an estimate of the variability of Y in repeated samples of the same size selected at random from the same population. SAS reports the se in the lower right corner of the summary statistics seen earlier. A Summary of the crucial points: • Parameter The parameter is a numerical measure of the entire population. Two common parameters are the population mean (denoted by µ) and the population standard deviation (denoted by σ). The population standard deviation measures the variation of individual values over all units in the population. Parameters always refer to the population, never to the sample. • Statistics or Estimate: A statistic or an estimate is a numerical quantity computed from the SAMPLE. This is only a guess as to the true value of the population parameter. If you took a new sample, your estimate computed from the second sample, would be different than the value computed form the first sample. Two common statistics are the sample mean (denoted Y ), and the sample standard deviation (denotes s). The sample standard deviation measures the variation of individual values over the units in the sample. Statistics always refer to the sample, never to the population. 2 A simple random sample implies that every unit in the population has the same chance of being selected and that every unit is selected independently of every other unit.
c
2012 Carl James Schwarz
38
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS • Sampling distribution Any statistic or estimate will change if a new sample is taken. The distribution of the statistic or estimate over repeated samples from the same population is known as the sampling distribution. • Theoretical Standard error: The variability of the estimate over all possible repeated samples from the population is measured by the standard error of the estimate. This is a theoretical quantity and could only be computed if you actually took all possible samples from the population. • Estimated standard error Now for the hard part - you typically only take a single sample from the population. But, based upon statistical theory, you know the form of the theoretical standard error, so you can use information from the sample to estimate the theoretical standard error. Be careful to distinguish between the standard deviation of individual values in your sample and the estimated standard error of the statistic as they refer to different types of variation. The formula for the estimated standard error is different for every statistic and also depends upon the way the sample was selected. Consequently it is vitally important that the method of sample selection and the type of estimate computed be determined carefully before using a computer package to blindly compute standard errors. The concept of a standard error is the MOST DIFFICULT CONCEPT to grasp in statistics. The reason that it is so difficult, is that there is an extra layer of abstraction between what you observe and what is really happening. It is easy to visualize variation of individual elements in a sample because the values are there for you to see. It is easy to visualize variation of individual elements in a population because you can picture the set of individual units. But it is difficult to visualize the set of all possible samples because typically you only take a single sample, and the set of all possible samples is so large. As a final note, please do NOT use the ± notation for standard errors. The problem is that the ± notation is ambiguous and different papers in the same journal and different parts of the same paper use the ± notation for different meanings. Modern usage is to write phrases such as “the estimated mean DDT level was 100.8 (SE 1.1) ppm.”
2.2.2
Theoretical example of a sampling distribution
Here is more detailed examination of a sampling distribution where the actual set of all possible samples can be constructed. It shows that the sample mean is unbiased and that its standard error computed from all possible samples matches that derived from statistical theory. Suppose that a population consisted of five mice and we wish to estimate the average weight based on a sample of size 2. [Obviously, the example is hopelessly simplified compared to a real population and sampling experiment!] Normally, the population values would not be known in advance (because then why would you have to take a sample?). But suppose that the five mice had weights (in grams) of: 33, 28, 45, 43, 47. c
2012 Carl James Schwarz
39
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS The population mean weight and population standard deviation are found as: • µ = (33+28+45+43+47) = 39.20 g, and • σ = 7.39 g. The population mean is the average weight over all possible units in the population. The population standard deviation measures the variation of individual weights about the mean, over the population units. Now there are 10 possible samples of size two from this population. For each possible sample, the sample mean and sample standard deviation are computed as shown in the following table. Sample units
Sample Mean
Sample std dev
(Y )
(s)
33
28
30.50
3.54
33
45
39.00
8.49
33
43
38.00
7.07
33
47
40.00
9.90
28
45
36.50
12.02
28
43
35.50
10.61
28
47
37.50
13.44
45
43
44.00
1.41
45
47
46.00
1.41
43
47
45.00
2.83
Average
39.20
7.07
Std dev
4.52
4.27
This table illustrates the following: • this is a theoretical table of all possible samples of size 2. Consequently it shows the actual sampling distribution for the statistics Y and s. The sampling distribution of Y refers to the variation of Y over all the possible samples from the population. Similarly, the sampling distribution of s refers to the variation of s over all possible samples from the population. • some values of Y are above the population mean, and some values of Y are below the population mean. We don’t know for any single sample if we are above or below the true value of the population parameter. Similarly, values of s (which is a sample standard deviation) also varies above and below the population standard deviation. • the average (expected) value of Y over all possible samples is equal to the population mean. We say such estimators are unbiased. This is the hard concept! The extra level of abstraction is here - the c
2012 Carl James Schwarz
40
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS statistic computed from an individual sample, has a distribution over all possible samples, hence the sampling distribution. • the average (expected) value of s over all possible samples is NOT equal to the population standard deviation. We say that s is a biased estimator. This is a difficult concept - you are taking the average of an estimate of the standard deviation. The average is taken over possible values of s from all possible samples. The latter is an extra level of abstraction from the raw data. [There is nothing theoretically wrong with using a biased estimator, but most people would prefer to use an unbiased estimator. It turns out that the bias in s decreases very rapidly with sample size and so is not a concern.] • the standard deviation of Y refers to the variation of Y over all possible samples. We call this the standard error of a statistic. [The term comes from an historical context that is not important at this point.]. Do not confuse the standard error of a statistic with the sample standard deviation or the population standard deviation. The standard error measures the variability of a statistic (e.g. Y ) over all possible samples. The sample standard deviation measures variability of individual units in the sample. The population standard deviation measures variability of individual units in the population. • if the previous formula for the theoretical standard error was used in this example it would fail to give the correct answer: √ = 5.22 The reason that this formulae didn’t work is that the sample i.e. se(Y ) = 4.52 6= √σn = 7.39 2 size was an appreciable fraction of the entire population. A finite population correction needs to be applied in these cases. As you will see in later chapters, the se in this case is computed as: s r r 7.39 σ p 2 5 N = √ = 4.52 se(Y ) = √ (1 − f ) (1 − (N − 1) 5 4 n 2
Refer to the chapter on survey sampling for more details.
2.3
Confidence Intervals
Section summary: 1. Understand the general logic of why a confidence interval works. 2. How to graph a confidence interval for a single parameter. 3. How to interpret graphs of several confidence intervals. 4. Effect of sample size upon the size of a confidence interval. 5. Effect of variability upon the size of a confidence interval. 6. Effect of confidence level upon the size of a confidence interval.
c
2012 Carl James Schwarz
41
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
2.3.1
A review
The basic premise of statistics is that every unit in a population cannot be measured – consequently, a sample is taken. But the statistics from a sample will vary from sample to sample and it is highly unlikely that the value of the statistic will equal the true, unknown value of the population parameter. Confidence intervals are a way to express the level of certainty about the true population parameter value based upon the sample selected. The formulae for the various confidence intervals depend upon the statistic used and how the sample was selected, but are all derived from a general unified theory. The following concepts are crucial and will be used over and over again in what follows: • Estimate: The estimate is the quantity computed from the SAMPLE. This is only a guess as to the true value of the population parameter. If you took a new sample, your estimate computed from the second sample, would be different than the value computed form the first sample. It seems reasonable that if you select your sample carefully that these estimates will sometimes be lower than the theoretical population parameters; sometimes it will be higher. • Standard error: The variability of the estimate over repeated samples from the population is measured by the standard error of the estimate. It again seems reasonable that if you select your sample carefully, that the statistics should be ‘close’ to the true population parameters and that the standard error should provide some information about the closeness of the estimate to the true population parameter. Refer back to the DDT example considered in the last section. Scientists took a random sample of gulls from Triangle Island (off the coast of Vancouver Island, British Columbia) and measured the DDT levels in 10 gulls. The following values were obtained (ppm): 100, 105, 97, 103, 96, 106, 102, 97, 99, 103. What does the sample tell us about the true population average DDT level over all gulls on Triangle Island? We again use Proc Univariate to compute summary statistics in the SAS system.
ods graphics on; proc univariate data=ddt plots cibasic; var ddt; histogram ddt /normal; qqplot ddt /normal; ods output Moments=DDTmoments; ods output BasicIntervals=DDTci; run; ods graphics off;
c
2012 Carl James Schwarz
42
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS VarName
Statistic
Value
ddt
N
ddt
Mean
ddt
Std Deviation
3.521363
Variance
12.400000
ddt
Skewness
0.035116
Kurtosis
-1.429377
ddt
Uncorrected SS
ddt
Coeff Variation
10.000000 100.800000
101718 3.493416
Statistic
Value
Sum Weights
10.000000
Sum Observations
Corrected SS Std Error Mean
1008.000000
111.600000 1.113553
The sample mean, Y = 100.8 ppm, measures the middle of the sample data and the sample standard deviation, s = 3.52 ppm, measures the spread of the sample data around the sample mean. Based on this sample information, is it plausible to believe that the average DDT level over ALL gulls could be as high as 150 ppm? Could it be a low a 50 ppm? Is it plausible that it could be as high as 110 ppm? As high as 101 ppm? Suppose you had the information from the other 8 samples.
Set 1 2 3 4 5 6 7 8
102 100 101 101 107 102 94 104
DDT levels in the gulls 102 103 95 105 97 95 104 103 99 98 95 98 94 100 96 106 102 104 95 98 103 100 99 90 102 99 105 92 98 101 100 100 98 107 99 102 101 101 92 94 104 100 101 100 100 96 101 100 98 102 97 104 97 99 100 100
98 90 108 100 104 101 94 109
103 103 104 102 98 97 98 102
Sample mean std 100.4 3.8 98.0 4.1 101.7 4.2 99.0 4.6 101.2 3.6 99.4 3.8 98.2 2.7 101.4 3.7
Based on this new information, what would you believe to be a plausible value for the true population mean? It seems reasonable that because the sample means when taken over repeated samples from the same population seem to lie between 98 and 102 ppm that this should provide some information about the true population value. For example, if you saw in the 8 additional samples that the range of the sample means varied between 90 and 110 ppm – would your plausible interval change? Again statistical theory come into play. A very famous and important (for statisticians!) theorem, the Central Limit Theorem, gives the theoretical sampling distribution of many statistics for most common sampling methods. In this case, the Central Limit Theorem states that the sample mean from a simple random sample from a large population should have an approximate normal distribution with the se(Y ) = √σn . The se of Y c
2012 Carl James Schwarz
43
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS measures the variability of Y around the true population mean when different samples of the same size are taken. Note that the sample mean is LESS variable than individual observations – does this makes sense? Using the properties of a Normal distribution, there is a 95% probability that Y will vary within about ±2se of the true mean (why?). Conversely, there should be about a 95% probability that the true mean should be within ±2se of Y ! This is the crucial step in statistical reasoning. Unfortunately, σ - the population standard deviation is unknown so we can’t find the se of Y . However, it seems reasonable to assume that s, the sample standard deviation, is a reasonable estimator of σ, the population standard deviation. So, √sn , should be a reasonable estimator of √σn . This is what is reported in √ √ the above output and we have that the Estimated Std Error Mean = s/ n = 3.5214/ 10 = 1.1136 ppm. This number is an estimate of how variable Y is around the true population mean in repeated samples of the same size from the same population. Consequently, it seems reasonable that there should be about a 95% probability, that the true mean is within ±2 estimated se of the sample mean, or, we state that an approximate 95% confidence interval is computed as: Y ± 2(estimated se) or 100.8 ± 2(1.1136) = 100.8 ± 2.2276 = (98.6 → 103.0) ppm. It turns out that we have to also account for the fact that s is only an estimate of σ (s can also vary from sample to sample) and so the estimated se may not equal the theoretical standard error. Consequently, the multiplier (2) has to be increased slightly to account for this. Proc Univariate also reports the 95% confidence interval if you specify the cibasic option on the Proc statement:
VarName
Parameter
Estimate
Lower 95% Confidence Limit
Upper 95% Confidence Limit
ddt
Mean
100.80000
98.28097
103.31903
ddt
Std Deviation
3.52136
2.42212
6.42864
ddt
Variance
12.40000
5.86665
41.32737
For large samples (typically greater than 30), the multiplier is vey close to 2 (actually has the value of 1.96) and there is virtually no difference in the intervals because then s is a very good estimator of σ and no additional correction is needed. We say that we are 95% confident the true population mean (what ever it is) is somewhere in the interval (98.3 → 103.3) ppm. What does this mean? We are pretty sure that the true mean DDT is not 110 ppm, nor is it 90 ppm. But we don’t really know if it is 99 or 102 ppm. Plausible values for the true mean DDT for ALL gulls is any value in the range 98.3 → 103.3 ppm. Note that the interval is NOT an interval for the individual values, i.e. it is NOT CORRECT to say that a 95% confidence interval includes 95% of the raw data. Rather the confidence interval tells you a plausible c
2012 Carl James Schwarz
44
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS range for the true population mean µ. Also, it is not a confidence interval for the sample mean (which you know to be 100.8) but rather for the unknown population mean µ. These two points are the most common mis-interpretations of confidence intervals. To obtain a plot of the confidence interval, use Proc Ttest in SAS. Note that Proc Ttest could also have been used to obtain the basic summary statistics and graphs similar to that from Proc Univariate.
Notice that the upper and lower bars of a box-plot and the upper and lower limits of the confidence intervals are telling you different stories. Be sure that you understand the difference! Many packages and published papers don’t show confidence intervals, but rather simply show the mean and then either ±1se or ±2se from the mean as approximate 68% or 95% confidence intervals such as below:
c
2012 Carl James Schwarz
45
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
There really isn’t any reason to plot ±1se as these are approximate 68% confidence limits which seems kind of silly. The reason this type of plot persists is because it is the default option in Excel which has the largest collection of bad graphs in the world. [A general rule of thumb - DON’T USE EXCEL FOR STATISTICS!] What are the likely effects of changing sample sizes, different amount of variability, and different levels of confidence upon the confidence interval width? It seems reasonable that a large sample size should be ‘more precise’, i.e. have less variation over repeated samples from the same population. This implies that a confidence interval based on a larger sample should be narrower for the same level of confidence, i.e. a 95% confidence interval from a sample with n = 100 should be narrower than a 95% confidence interval from a sample with n = 10 when taken from the same population. Also, if the elements in a population are more variable, then the variation of the sample mean should be larger and the corresponding confidence interval should be wider. And, why stop at a 95% confidence level - why not find a 100% confidence interval? In order to be 100% confident, you would have to sample the entire population – which is not practical for most cases. Again, it seems reasonable that interval widths will increase with the level of confidence, i.e. a 99% confidence
c
2012 Carl James Schwarz
46
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS interval will be wider than a 95% confidence interval. How are several groups of means compared if all were selected using a random sample? For now, one simple way to compare several groups is through the use of side-by-side confidence intervals. For example, consider a study that looked at the change in weight of animals when given one of three different drugs (A, D, or placebo). Here is a side-by-side confidence interval plot: As you will see in later chapters, the procedure for comparing more than one group’s mean depends on how the data are collected. In many cases, SAS produces plots similar to:
These are known as notched-box plots. The notches on the box-plot indicate the confidence intervals for the mean. If the notches from two groups do not overlap, then there is evidence that the population means could differ. What does this show? Because the 95% confidence intervals for drug A and drug D have considerable overlap, there doesn’t appear to be much of a difference in the population means (the same value could be common to both groups). However, the small overlap in the confidence intervalss of the Placebo and the other drugs provides evidnece that the population means may differ. [Note the distinction between the c
2012 Carl James Schwarz
47
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS sample and population means in the above discussion.] As another example, consider the following graph of barley yields for three years along with 95% confidence intervals drawn on the graphs. The data are from a study of crop yields downwind of a coal fired generating plant that started operation in 1985. What does this suggest?
Because the 95% confidence intervals for 1984 and 1980 overlap considerably, there really isn’t much evidence that the true mean yield differs. However, because the 95% confidence interval for 1988 does not overlap the other two groups, there is good evidence that the population mean in 1988 is smaller than in the previous two years. In general, if the 95% confidence intervals of two groups do not overlap, then there is good evidence that the group population means differ. If there is considerable overlap, then the population means of both groups might be the same.
2.3.2
Some practical advice
• In order for confidence intervals to have any meaning, the data must be collected using a probability sampling method. No amount of statistical wizardry will give valid inference for data collected in a haphazard fashion. Remember, haphazard does not imply a random selection. • If you consult statistical textbooks, they are filled with many hundreds of formulae for confidence intervals under many possible sampling designs. The formulae for confidence intervals are indeed different for various estimators and sampling designs – but they are all interpreted in a similar fashion. • A rough and ready rule of thumb is that a 95% confidence interval is found as estimate ± 2se and a 68% confidence interval is found as estimate ± 1se. Don’t worry too much about the exact formulae - if a study doesn’t show clear conclusions based on these rough rules, then using more exact methods
c
2012 Carl James Schwarz
48
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS won’t improve things. • The crucial part is finding the se. This depends upon the estimator and sampling design – pay careful attention that the computer package you are using and the options within the computer package match the actual data collection methods. I can’t emphasize this too much! This is the most likely spot where you may inadvertently use inappropriate analysis! • Confidence intervals are sensitive to outliers because both the sample mean and standard deviation are sensitive to outliers. • If the sample size is small, then you must also make a very strong assumption about the population distribution. This is because the central limit theorem only works for large samples. Recent work using bootstrap and other resampling methods may be an alternative approach. • The confidence interval only tells you the uncertainty in knowing the true population parameter because you only measured a sample from the population3 . It does not cover potential imprecision caused by nonresponse, under-coverage, measurement errors etc. In many cases, these can be orders of magnitude larger - particularly if the data was not collected according to a well defined plan.
2.3.3
Technical details
The formula for a confidence interval for a single mean when the data are collected using a simple random sample from a population with normally distributed data is: Y ± tn−1 × se or
s Y ± tn−1 √ n
where the tn−1 refers to values from a t-distribution with (n − 1) degrees of freedom. Values of the tdistribution are tabulated in the tables located at http://www.stat.sfu.ca/~cschwarz/CourseNotes/. For the above example for gulls on Triangle Island, n = 10, so the multiplier for a 95% confidence interval is t9 = 2.2622 and the confidence interval was found as: 100.8±2.262(1.1136) = 100.8±2.5192 = (98.28 → 103.32) ppm which matches the results provided by JMP. Note that different sampling schemes may not use a t-distribution and most certainly will have different degrees of freedom for the t-distribution. This formula is useful when the raw data is not given, and only the summary statistics (typically the sample size, the sample mean, and the sample standard deviation) are given and a confidence interval needs to be computed. What is the effect of sample size? If the above formula is examined, the primary place where the sample size n comes into play is the denominator of√the standard error. So as n increases, the se decreases. However note that the se decreases as a function of n, i.e. it takes 4× the sample size to reduce the standard error 3 This
is technically known as the sampling error
c
2012 Carl James Schwarz
49
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS by a factor of 2. This is sensible because as the sample size increases, Y should be less variable (and usually closer to the true population mean). Consequently, the width of the interval decreases. The confidence level doesn’t change - we would still be roughly 95% confident, but the interval is smaller. The sample size also affects the degrees of freedom which affects the t-value, but this effect is minor compared to that change in the se. What is the effect of increasing the confidence level? If you wanted to be 99% confident, the t-value from the table increases. For example, the t-value for 9 degrees of freedom increases from 2.262 for a 95% confidence interval to 3.25 for a 99% confidence interval. In general, a higher confidence level will give a wider confidence interval.
2.4
Hypothesis testing
Section summary: 1. Understand the basic paradigm of hypothesis testing. 2. Interpret p-values correctly. 3. Understand Type I, Type II, and Type III errors. 4. Understand the limitation of hypothesis testing.
2.4.1
A review
Hypothesis testing is an important paradigm of Statistical Inference, but has its limitations. In recent years, emphasis has moved away from formal hypothesis testing to more inferential statistics (e.g. confidence intervals) but hypothesis testing still has an important role to play. There are two common hypothesis testing situations encountered in ecology: • Comparing the population parameter against a known standard. For example, environmental regulations may specify that the mean contaminant loading in water must be less than a certain fixed value. • Comparing the population parameter among 2 or more groups. For example, is the average DDT loading the same for males and female birds. The key steps in hypothesis testing are: • Formulate the hypothysis of NO CHANGE in terms of POPULATION parameters.
c
2012 Carl James Schwarz
50
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS • Collect data using a good sampling design or a good experimental design paying careful attention to the RRRs. • Using the data, compute the difference between the sample estimate and the standard, or the difference in the sample estimates among the groups. • Evaluate if the observed change (or difference) is consistent with NO EFFECT. This is usually summarized by a p-value.
2.4.2
Comparing the population parameter against a known standard
Again consider the example of gulls on Triangle Island introduced in previous sections. Of interest is the population mean DDT level over ALL the gulls. Let (µ) represent the average DDT over all gulls on the island. The value of this population parameter is unknown because you would have to measure ALL gulls which is logistically impossible to do. Now suppose that the value of 98 ppm is a critical value for the health of the species. Is there evidence that the current population mean level is different than 98 ppm? Scientists took a random sample (how was this done?) of 10 gulls and found the following DDT levels. 100, 105, 97, 103, 96, 106, 102, 97, 99, 103. Proc Ttest in SAS can be used to examine if the true mean is 98. First examine the confidence interval for the mean based on the sample of 10 birds selected earlier. [Proc Univariate was used earlier, but the following output is from Proc Ttest.]
Variable ddt
Mean
Lower Limit of Mean
Upper Limit of Mean
100.8
98.2810
103.3
First examine the 95% confidence interval presented above. The confidence interval excludes the value of 98 ppm, so one is fairly confident that the population mean DDT level differs from 98 ppm. Furthermore the confidence interval gives information about what the population mean DDT level could be. Note that the hypothesized value of 98 ppm is just outside the 95% confidence interval. A hypothesis test is much more ‘formal’ and consists of several steps: 1. Formulate hypothesis. This is a formal statement of two alternatives. The null hypothesis (denoted c
2012 Carl James Schwarz
51
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS as H0 or H) indicates the state of ignorance or no effect. The alternate hypothesis (denoted as H1 or A) indicates the effect that is to be detected if present. Both the null and alternate hypothesis can be formulated before any data are collected and are always formulated in terms of the population parameter. It is NEVER formulated in terms of the sample statistics as these would vary from sample to sample. In this case, the null and alternate hypothesis are: H:µ = 98, i.e. the mean DDT levels for ALL gulls is 98 ppm. A:µ 6= 98, i.e. the mean DDT levels for ALL gulls is not 98 ppm. This is known as a two-sided test because we are interested if the mean is either greater than or less than 98 ppm.4 2. Collect data. Again it is important that the data be collected using probability sampling methods and the RRRs. The form of the data collection will influence the next step. 3. Compute a test-statistic and p-value. The test-statistic is computed from the data and measures the discrepancy between the observed data and the null hypothesis, i.e. how far is the observed sample mean of 100.8 ppm from the hypothesized value of 98 ppm? The JMP output is:
‘ We use the H0=98 option on the Proc 4 It is possible to construct what are known as one-sided tests where interest lies ONLY if the population mean exceeds 98 ppm, or a test if interest lies ONLY if the population mean is less than 98 ppm. These are rarely useful in ecological work.
c
2012 Carl James Schwarz
52
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Ttest statement in SAS ods graphics on; proc ttest data=ddt dist=normal h0=98; title2 ’Test if the mean is 98’; var ddt; ods output TTests=Test98; ods output ConfLimits=CIMean1; run; ods graphics off; giving:
Variable ddt
t Value
DF
Pr > |t|
2.51
9
0.0331
The output examines if the data are consistent with the hypothesized value (98 ppm) followed by the estimate (the sample mean) of 100.8 ppm. From the earlier output, we know that the se of the sample mean is 1.11. How discordant is the sample mean of 100.8 ppm with the hypothesized value of 98 ppm? One discrepancy measure is known as a T-ratio and is computed as: T =
(estimate − hypothesized value) (100.8 − 98) √ = 2.5145 = estimated se 3.52136/ 10
This implies the estimate is about 2.5 se different from the null hypothesis value of 98. This T-ratio is labelled as the Test Statistic in the output. Note that there are many measures of discrepancy of the estimate with the null hypothesis - JMP also provides a ‘non-parametric’ statistic suitable when the assumption of normality in the population may be suspect - this is not covered in this course. How is this measure of discordance between the sample mean (100.8) and the hypothesized value of 98 assessed? The unusualness of the test statistic is measured by finding the probability of observing the current test statistic assuming the null hypothesis is true. In other words, if the hypothesis were true (and the true population mean is 98 ppm), what is the probability finding a sample mean of 100.8 ppm?5 This is denoted the p-value. Notice that the p-value is attached to the data - it measures the probability of the sample mean given the hypothesis is true. Probabilities CANNOT be attached to hypotheses it would be incorrect to say that there is a 3% that the hypothesis is true. The hypothesis is either true or false, it can’t be “partially” true!6 5 In actual fact, the probability is computed of finding a value of 100.8 or more distant from the hypothesized value of 98. This will be explained in more detail later in the notes. 6 This is similar to asking a small child if they took a cookie. The truth is either “yes” or “no”, but often you will get the response of “maybe” which really doesn’t make much sense!
c
2012 Carl James Schwarz
53
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS It is possible to construct what are known as “one-sided” tests using Proc Ttest – these are not pursued in this course and contact me for more details. 4. Make a decision. How are the test statistic and p-value used? The basic paradigm of hypothesis testing is that unusual events provide evidence against the null hypothesis. Logically, rare events “shouldn’t happen” if the null hypothesis is true. This logic can be confusing! We will discuss it more in class. In our case, the p-value of 0.0331 indicates there is an approximate 3.3% chance of observing a sample mean that differs from the hypothesized value of 98 if the null hypothesis were true. Is this unusual? There are no fixed guidelines for the degree of unusualness expected before declaring it to be unusual. Many people use a 5% cut-off value, i.e. if the p-value is less than 0.05, then this is evidence against the null hypothesis; if the p-value is greater than 0.05 then this not evidence against the null hypothesis. [This cut-off value is often called the α-level.] If we adopt this cut-off value, then our observed p-value of 0.0331 is evidence against the null hypothesis and we find that there is evidence that the true mean DDT level is different than 98 ppm. The plot at the bottom of the output that is presented by JMP is helpful in trying to understand what is going on. [No such equivalent plot is readily available in R or SAS.] It tries to give a measure of how unusual the sample mean of 100.8 is relative to the hypothesized value of 98. If the hypothesis were true, and the true population mean was 98, then you would expect the sample means to be clustered around the value of 98. The bell-shaped curve shows the distribution of the SAMPLE MEANS if repeated samples are taken from the same population. It is centered over the true population mean (98) with a variability measured by the se of 1.11. The small vertical tick mark just under the value of 101, represents the observed sample mean of 100.8. You can see that the observed sample mean of 100.8 is somewhat unusual compared to the population value of 98. The shaded areas in the two tails represents the probability of observing the value of the sample mean so far away from the hypothesised value (in either direction) if the hypothesis were true and represents the p-value. If you repeated the same steps with a hypothesized value of 80, you would see that the observed sample mean of 100.8 is extremely unusual relative to the population value of 80: The JMP output is:
c
2012 Carl James Schwarz
54
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
We use the H0=80 option on the Proc Ttest statement in SAS
ods graphics on; proc ttest data=ddt dist=normal h0=80; title2 ’Test if the mean is 80’; var ddt; ods output TTests=Test80; run; ods graphics off;
giving:
Variable ddt
t Value
DF
Pr > |t|
18.68
9
|t|
18.68
9
|T0 |) and is found to be 0.0331. • In some very rare cases, the population standard deviation σ is known. In these cases, the true standard error is known, and the test-statistics is compared to a normal distribution. This is extremely rare in practise. • The assumption of normality of the population values can be relaxed if the sample size is sufficiently large. In those cases, the central limit theorem indicates that the distribution of the test-statistics is known regardless of the underlying distribution of population values.
c
2012 Carl James Schwarz
57
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
2.4.3
Comparing the population parameter between two groups
A common situation in ecology is to compare a population parameter in two (or more) groups. For example, it may be of interest to investigate if the mean DDT levels in males and females birds could be the same. There are now two population parameters of interest. The mean DDT level of male birds is denoted as µm while the mean DDT level of female birds is denoted as µf . These would be the mean DDT level of all birds of their respective sex – once again this cannot be measured as not all birds can be sampled. The hypotheses of interest are: • H : µm = µf
or
H : µm − µf = 0
• A : µm 6= µf
or
H : µm − µf 6= 0
Again note that the hypotheses are in terms of the POPULATION parameters. The alternate hypothesis indicates that a difference in means in either direction is of interest, i.e. we don’t have an a prior belief that male birds have a smaller or larger population mean compared to female birds. A random sample is taken from each of the populations using the RRR. The raw data are read in the usual way:
data ddt2g; infile ’ddt2g.csv’ dlm=’,’ dsd missover firstobs=2; input sex $ ddt; run;
giving:
Obs
c
2012 Carl James Schwarz
58
sex
ddt
1
m
100
2
m
98
3
m
102
4
m
103
5
m
99
6
f
104
7
f
105
8
f
107 December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Obs
sex
ddt
9
f
105
10
f
103
Notice there are now two columns. One column identifies the group membership of each bird (the sex) and is nominal or ordinal in scale. The second column gives the DDT reading for each bird.7 We start by using Proc SGplot to creates a side-by-side dot-plots and box plots:
proc sgplot data=ddt2g; title2 ’Plot of ddt vs. sex’; scatter x=sex y=ddt; xaxis offsetmin=.05 offsetmax=.05; run;
and
proc sgplot data=ddt2g; title2 ’Box plots’; vbox ddt / group=sex notches; /* the notches options creates overlap region to compare if medians are equal */ run;
which gives
7 The
columns can be in any order. As well, the data can be in any order and male and female birds can be interspersed.
c
2012 Carl James Schwarz
59
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Next, compute simple summary statistics for each group: Proc Tabulate is used to construct a table of means and standard deviations: proc tabulate data=ddt2g; title2 ’some basic summary statistics’; class sex; var ddt; table sex, ddt*(n*f=5.0 mean*f=5.1 std*f=5.1 stderr*f=7.2 run;
lclm*f=7.1 uclm*f=7.1) /
which gives:
ddt sex
N
Mean
Std
StdErr
95_LCLM
95_UCLM
5
104.8
1.5
0.66
103.0
106.6
5
100.4
2.1
0.93
97.8
103.0
f m
The individual sample means and se for each sex are reported along with 95% confidence intervals for the population mean DDT of each sex. The 95% confidence intervals for the two sexes have virtually no overlap which implies that a single plausible value common to both sexes is unlikely to exist. Because we are interested in comparing the two population means, it seems sensible to estimate the difference in the means. This can be done for this experiment using a statistical technique called (for historical reasons) a “t-test”8 Proc Ttest is used to perform the test of the hypothesis that the two means are the same:
ods graphics on; proc ttest data=ddt2g plot=all dist=normal; title2 ’test of equality of ddts between the two sexs’; class sex; var ddt; ods output ttests = TtestTest; ods output ConfLimits=TtestCL; ods output Statistics=TtestStat; run; 8 The
t-test requires a simple random sample from each group.
c
2012 Carl James Schwarz
60
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS ods graphics off;
The output is voluminous, and selected portions are reproduced below: t Value
DF
Pr > |t|
Variable
Method
Variances
ddt
Pooled
Equal
3.86
8
0.0048
ddt
Satterthwaite
Unequal
3.86
7.2439
0.0058
Variable
sex
Method
Variances
Mean
Lower Limit of Mean
ddt
Diff (1-2)
Pooled
Equal
4.4000
1.7708
7.0292
ddt
Diff (1-2)
Satterthwaite
Unequal
4.4000
1.7222
7.0778
Variable
sex
N
Mean
Std Error
Lower Limit of Mean
Upper Limit of Mean
ddt
f
5
104.8
0.6633
103.0
106.6
ddt
m
5
100.4
0.9274
97.8252
103.0
ddt
Diff (1-2)
_
4.4000
1.1402
1.7708
7.0292
and a final plot of:
c
2012 Carl James Schwarz
61
December 21, 2012
Upper Limit of Mean
CHAPTER 2. INTRODUCTION TO STATISTICS
The first part of the output estimates the difference in the population means. Because each sample mean is an unbiased estimator for the corresponding population mean, it seems reasonable that the difference in sample means should be unbiased for the difference in population mean. Unfortunately, many packages do NOT provide information on which order the difference in means was computed. Many packages order the groups alphabetically, but this can be often be changed. The estimated difference in means is −4.4 ppm. The difference is negative indicating that the sample mean DDT for the males is less than the sample mean DDT for the females. As usual, a measure of precision (the se) should be reported for each estimate. The se for the difference in means is 1.14 (refer to later chapters on how the se is computed), and the 95% confidence interval for the difference in population means is from (−7.1 → −1.72). Because the 95% confidence interval for the difference in population means does NOT include the value of 0, there is evidence that the mean DDT for all males could be different than the mean DDT for all females. The t ratio is again a measure of how far the difference in sample means is from the hypothesized value of 0 difference and is found as the observed difference divided by the se of the difference. The p-value of .0058 indicates that the observed difference in sample means of −4.4 is quite unusual if the hypothesis were true. Because the p-value is quite small, there is strong evidence against the hypothesis of no difference. The comparison of means (and other parameters) will be explored in more detail in future chapters.
2.4.4
Type I, Type II and Type III errors
Hypothesis testing can be thought of as analogous to a court room trial. The null hypothesis is that the defendant is “innocent” while the alternate hypothesis is that the defendant is “guilty”. The role of the prosecutor is to gather evidence that is inconsistent with the null hypothesis. If the evidence is so unusual (under the assumption of innocence), the null hypothesis is disbelieved.
c
2012 Carl James Schwarz
62
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Obviously the criminal justice system is not perfect. Occasionally mistakes are made (innocent people are convicted or guilty parties are not convicted). The same types of errors can also occur when testing scientific hypotheses. For historical reasons, two possible types of errors than can occur are labeled as Type I and Type II errors: • Type I error. Also known as a false positive. A Type I error occurs when evidence against the null hypothesis is erroneously found when, in fact, the hypothesis is true. How can this occur? Well, the p-value measures the probability that the data could have occurred, by chance, if the null hypothesis were true. We usually conclude that the evidence is strong against the null hypothesis if the p-value is small, i.e. a rare event. However, rare events do occur and perhaps the data is just one of these rare events. The Type I error rate can be controlled by the cut-off value used to decide if the evidence against the hypothesis is sufficiently strong. If you believe that the evidence strong enough when the p-value is less than the α = .05 level, then you are willing to accept a 5% chance of making a Type I error. • Type II error. Also known as a false negative. A Type II error occurs when you believe that the evidence against the null hypothesis is not strong enough, when, in fact, the hypothesis is false. How can this occur? The usual reasons for a Type II error to occur are that the sample size is too small to make a good decision. For example, suppose that the confidence interval for the gull example, extended from 50 to 150 ppm. There is no evidence that any value in the range of 50 to 150 is inconsistent with the null hypothesis. There are two types of correct decision: • Power or Sensitivity. The power of a hypothesis test is the ability to conclude that the evidence is strong enough against the null hypothesis when in fact is is false, i.e. the ability to detect if the null hypothesis is false. This is controlled by the sample size. • Specificity. The specificity of a test is the ability to correctly find no evidence against the null hypothesis when it is true. In any experiment, it is never known if one of these errors or a correct decision has been made. The Type I and Type II errors and the two correct decision can be placed into a summary table:
c
2012 Carl James Schwarz
63
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Action Taken
True state of nature
p-value is < α. Evidence against the null hypothesis.
p-value is > α. No evidence against the null hypothesis.
Null Hypothesis true
Type I error = False positive error. This is controlled by the α-level used to decide if the evidence is strong enough against the null hypothesis.
Correct decision. Also known as the specificity of the test.
Null Hypothesis false
Correct decision. This is known as the power of the test or the sensitivity of the test. Controlled by the sample size with a larger sample size having greater power to detect a false null hypothesis.
Type II error= False negative error. Controlled by sample size with a larger sample size leading to fewer Type II errors.
In the context of a monitoring design to determine if there is an evironmental impact due to some action, the above table reduces to: Action Taken
True state of nature
p-value is < α. Evidence against the null hypothesis. Impact is apparently observed.
p-value is > α. No evidence against the null hypothesis. Impact is apparently not observed
Null Hypothesis true. No environmental impact.
Type I error= False positive error. An environmental impact is “detected” when in fact, none occured.
Correct decision. No environmental impact detected.
Null Hypothesis false. Environmental impact exists.
Correct decision. Environmental impact detected.
Type II error= False negative error. Environmental impact not detected.
Usually, a Type I error is more serious (convicting an innocent person; false detecting an environmental impact and fining an organization millions of dollar) and so we want good evidence before we conclude against the null hypothesis. We measure the strength of the evidence by the p-value. Typically, we want
c
2012 Carl James Schwarz
64
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS the p-value to be less than about 5% before we believe that the evidence is strong enough against the null hypothesis, but this can be varied depending on the problem. If the consequences of a Type I error are severe, the evidence must be very strong before action is taken, so the α level might be reduced to .01 from the “usual” .05. Most experimental studies tend to ignore power (and Type II error) issues. However, these are important – for example, should an experiment be run that only has a 10% chance of detecting an important effect? What are the consequence of failing to detect an environmental impact. What is the price tag to letting a species go extinct without detecting it? We will explore issues of power and sample size in later chapters. What is a Type III error? This is more whimsical, as it refers to a correct answer to the wrong question! Too often, researchers get caught up in their particular research project and spent much time and energy in obtaining an answer, but the answer is not relevant to the question of interest.
2.4.5
Some practical advice
• The p-value does NOT measure the probability that the null hypothesis is true. It measures the probability of observing the sample data assuming the null hypothesis were true. You cannot attach a probability statement to the null hypothesis in the same way you can’t be 90% pregnant! The hypothesis is either true or false – there is no randomness attached to a hypothesis. The randomness is attached to the data. • A rough rule of thumb is there is sufficient evidence against the hypothesis if the observed test statistic is more than 2 se away from the hypothesized value. • The p-value is also known as the observed significance level. In the past, you choose a prespecified significance level (known as the α level) and if the p-value is less than α, you concluded against the null hypothesis. For example, α is often set at 0.05 (denoted α =0.05). If the p-value is < α = 0.05, then you concluded tat the evidence was strong aganst the null hypothesis; otherwise you the evidence was not strong enought against the null hypothesis. Scientific papers often reported results using a series of asterisks, e.g. “*” meant that a result was statistically significant at α = .05; “**” meant that a result was statistically significant at α = .01; “***” meant that a result was statistically significant at α = .001. This practice reflects a time when it was quite impossible to compute the exact p-values, and only tables were available. In this modern era, there is no excuse for failing to report the exact p-value. All scientific papers should report the actual p-value for a test so that the reader can use their own personal significance level. • Some ‘traditional’ and recommended nomenclature for the results of hypothesis testing: p-value
Traditional
Recommended
p-value < 0.05
Reject the null hypothesis
There is strong evidence against the null hypothesis.
.05 .15
Fail to reject the null hypothesis.
There is no evidence against the null hypothesis.
c
2012 Carl James Schwarz
65
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS However, the point at which we conclude that there is sufficient evidence against the null hypothesis (the α level which was .05 above) depends upon the situation at hand and the consequences of wrong decisions (see later in this chapter).. • It is not good form to state things like: – accept the null hypothesis; – accept the alternate hypothesis; – the null hypothesis is true; – the null hypothesis is false. The reason is that you haven’t ‘proved’ the truthfulness or falseness of the hypothesis; rather you have not or have sufficient evidence that contradict it. It is the same reasons that jury trials return verdicts of ‘guilty’ (evidence against the hypothesis of innocence) or ‘not guilty’ (insufficent evidence against the hypothesis of innocence). A jury trial does NOT return an ‘innocent’ verdict. • If there is evidence against the null hypothesis, a natural question to ask is ‘well, what value of the parameter are plausible given this data’? This is exactly what a confidence interval tells you. Consequently, I usually prefer to find confidence intervals, rather than doing formal hypothesis testing. • Carrying out a statistical test of a hypothesis is straightforward with many computer packages. However, using tests wisely is not so simple. Hypothesis testing demands the RRR. Any survey or experiment that doesn’t follow the three basic principles of statistics (randomization, replication, and blocking) is basically useless. In particular, non randomized surveys or experiments CANNOT be used in hypothesis testing or inference. Be careful that ‘random’ is not confused with ‘haphazard’. Computer packages do not know how you collected data. It is your responsibility to ensure that your brain in engaged before putting the package in gear. Each test is valid only in circumstances where the method of data collection adheres to the assumptions of the test. Some hesitation about the use of significance tests is a sign of statistical maturity. • Beware of outliers or other problems with the data. Be prepared to spend a fair amount of time examining the raw data for spurious points.
2.4.6
The case against hypothesis testing
In recent years, there has been much debate about the usefulness of hypothesis testing in scientific research (see the next section for a selection of articles). There a number of “problems” with the uncritical use of hypothesis testing: • Sharp null hypothesis The value of 98 ppm as a hypothesized value seems rather arbitrary. Why not 97.9 ppm or 98.1 ppm. Do we really think that the true DDT value is exactly 98.000000000 ppm? Perhaps it would be more reasonable to ask “How close is the actual mean DDT in the population to 98 ppm?”
c
2012 Carl James Schwarz
66
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS • Choice of α The choice of α-level (i.e. 0.05 significance level) is also arbitrary. The value of α should reflect the costs of Type I errors, i.e. the costs of false positive results. In a murder trial, the cost of sending an innocent person to the electric chair is very large - we require a very large burden of proof, i.e. the p-value must be very small. On the other hand, the cost of an innocent person paying for a wrongfully issued parking ticket is not very large; a lesser burden of proof is required, i.e. a higher p-value can be used to conclude that the evidence is strong enough against the hypothesis. A similar analysis should be made for any hypothesis testing case, but rarely is done. The tradeoffs between Type I and II errors, power, and sample size are rarely discussed in this context. • Sharp decision rules Traditional hypothesis testing says that if the p-value is less than α, you should conclude that there is sufficient evidence against the null hypothesis and if the p-value is greater than α there is not enough evidence against the null hypothesis. Suppose that α is set at .05. Should different decisions be made if the p-value is 0.0499 or 0.0501? It seems unlikely that extremely minor differences in the p-value should lead to such dramatic differences in conclusions. • Obvious tests In many cases, hypothesis testing is used when the evidence is obvious. For example, why would you even bother testing if the true mean is 50 ppm? The data clearly shows that it is not. • Interpreting p-values P -values are prone to mis-interpretation as they measure the plausibility of the data assuming the null hypothesis is true, not the probability that the hypothesis is true. There is also the confusion between selecting the appropriate p-value for one- and two-sided tests. Refer to the Ministry of Forest’s publication Pamphlet 30 on interpreting the p-value available at http://www.stat.sfu.ca/~cschwarz/Stat-650/MOF/index.html. • Effect of sample size P -values are highly affected by sample size. With sufficiently large sample sizes every effect is statistically significant but may be of no biological interest. • Practical vs. statistical significance. Just because you find evidence against the null hypothesis (e.g. p-value < .05) does not imply that the effect is very large. For example, if you were to test if a coin were fair and were able to toss it 1,000,000 times, you would find strong evidence against the null hypothesis of fairness if the observed proportion of heads was 50.001%. But for all intents and purposes, the coin is fair enough for real use. Statistical significance is not the same as practical significance. Other examples of this trap, are the numerous studies that show cancerous effects of certain foods. Unfortunately, the estimated increase in risk from these studies is often less than 1/100 of 1%! The remedy for confusing statistical and practical significance is to ask for a confidence interval for the actual parameter of interest. This will often tell you the size of the purported effect. • Failing to detect a difference vs. no effect. Just because an experiment fails to find evidence against the null hypothesis (e.g. p-value > .05) does not mean that there is no effect! A Type II error - a false negative error - may have been committed. These usually occur when experiments are too small (i.e., inadequate sample size) to detect effects of interest. The remedy for this is to ask for the power of the test to detect the effect of practical interest, or failing that, ask for the confidence interval for the parameter. Typically power will be low, or the confidence interval will be so wide as to be useless.
c
2012 Carl James Schwarz
67
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS • Multiple testing. In some experiments, hundreds of statistical tests are performed. However, remember that the p-value represents the chance that this data could have occurred given that the hypothesis is true. So a p-value of 0.01 implies, that this event could have occurred in about 1% of cases EVEN IF THE NULL IS TRUE. So finding one or two significant results out of hundreds of tests is not surprising! There are more sophisticated analyses available to control this problem called ‘multiple comparison techniques’ and are covered in more advanced classes.
On the other hand, a confidence interval for the population parameter gives much more information. The confidence interval shows you how precise the estimate is, and the range of plausible values that are consistent with the data collected. For example consider the following illustration:
All three results on the left would be “statistically significant” but your actions would be quite different. On the extreme left, you detected an effect and it is biologically important – you must do something. In the second case, you detected an effect, but can’t yet decide if is biologically important – more data needs to be collected. In the third case, you detected an effect, but it is small and not biologically important. The two right cases are both where the results are not “statistically significant.” In the fourth case you failed to detect an effect, but the experiment was so well planned that you are confident that if an effect were c
2012 Carl James Schwarz
68
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS real, it would be small. There actually is NOT difference in your conclusions between the 3rd and 4th cases! The rightmost case is a poor experiment – you failed to detect anything because the experiment was so small and so poorly planned that you really don’t know anything! Try not to be in the right most case! Modern statistical methodology is placing more and more emphasis upon the use of confidence intervals rather than a blind adherence on hypothesis testing.
2.4.7
Problems with p-values - what does the literature say?
There were two influential papers in the Wildlife Society publications that have affected how people view the use of p-values.
Statistical tests in publications of the Wildlife Society Cherry, S. (1998) Statistical tests in publication of the Wildlife Society Wildlife Society Bulletin, 26, 947-954. http://www.jstor.org/stable/3783574.
The 1995 issue of the Journal of Wildlife Management has > 2400 p-values. I believe that is too many. In this article, I will argue that authors who publish in the Journal and in he Wildlife Society Bulletin are over using and misunderstanding hypothesis tests. They are conducting too many unnecessary tests, and they are making common mistakes in carrying out and interpreting the results of the tests they conduct. A major cause of the overuse of testing in the Journal and the Bulletin seems to be the mistaken belief that testing is necessary in order for a study to be valid or scientific. • What are the problems in the analysis of habitat availability. • What additional information do confidence intervals provide that significance levels do not provide? • When is the assumption of normality critical, in testing if the means of two population are equal? • What does Cherry recommend in lieu of hypothesis testing?
The Insignificance of Statistical Significance Testing Johnson, D. H. (1999) The Insignificance of Statistical Significance Testing c
2012 Carl James Schwarz
69
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Journal of Wildlife Management, 63, 763-772. http://dx.doi.org/10.2307/3802789 or online at http://www.npwrc.usgs.gov/resource/methods/statsig/index.htm
Despite their wide use in scientific journals such as The Journal of Wildlife Management, statistical hypothesis tests add very little value to the products of research. Indeed, they frequently confuse the interpretation of data. This paper describes how statistical hypothesis tests are often viewed, and then contrasts that interpretation with the correct one. He discusses the arbitrariness of p-values, conclusions that the null hypothesis is true, power analysis, and distinctions between statistical and biological significance. Statistical hypothesis testing, in which the null hypothesis about the properties of a population is almost always known a priori to be false, is contrasted with scientific hypothesis testing, which examines a credible null hypothesis about phenomena in nature. More meaningful alternatives are briefly outlined, including estimation and confidence intervals for determining the importance of factors, decision theory for guiding actions in the face of uncertainty, and Bayesian approaches to hypothesis testing and other statistical practices.
This is a very nice, readable paper, that discusses some of the problems with hypothesis testing. As in the Cherry paper above, Johnson recommends that confidence intervals be used in place of hypothesis testing. So why are confidence intervals not used as often as they should? Johnson give several reasons • hypothesis testing has become a tradition; • the advantages of confidence intervals are not recognized; • there is some ignorance of the procedures available; • major statistical packages do not include many confidence interval estimates; • sizes of parameter estimates are often disappointingly small even though they may be very significantly different from zero; • the wide confidence intervals that often result from a study are embarrassing; • some hypothesis tests (e.g., chi square contingency table) have no uniquely defined parameter associated with them; and • recommendations to use confidence intervals often are accompanied by recommendations to abandon statistical tests altogether, which is unwelcome advice. These reasons are not valid excuses for avoiding confidence intervals in lieu of hypothesis tests in situations for which parameter estimation is the objective.
c
2012 Carl James Schwarz
70
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Followups In Robinson, D. H. and Wainer, H. W. (2002). On the past and future of null hypothesis significance testing. Journal of Wildlife Management 66, 263-271. http://dx.doi.org/10.2307/3803158. the authors argue that there is some benefit to p-values in wildlife management, but then Johnson, D. H. (2002). The role of hypothesis testing in wildlife science. Journal of Wildlife Management 66, 272-276. http://dx.doi.org/10.2307/3803159. counters many of these arguments. Both papers are very easy to read and are highly recommended.
2.5
Meta-data
Meta-data are data about data, i.e. how has it been collected, what are the units, what do the codes used in the dataset represent, etc. It is good practice to store the meta-data as close as possible to the raw data. For example, some computer packages (e.g. JMP) allow the user to store information about each variable and about the data table. In some cases, data can be classified into broad classifications called scale or roles.
2.5.1
Scales of measurement
Data comes in various sizes and shapes and it is important to know about these so that the proper analysis can be used on the data. Some computer packages (e.g. JMP) use the scales of measurement to determine appropriate analyses of the data. For example, as you will see later in the course, if the response variable (Y ) has an interval scale and the explanatory variable (X) has a‘ nominal scale, then an ANOVA-type analysis comparing means is performed. If both the Y and X variables have a nominal scale, then a χ2 -type analysis on comparing proportions is performed. There are usually 4 scales of measurement that must be considered: c
2012 Carl James Schwarz
71
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS 1. Nominal Data • • • •
The data are simply classifications. The data have no ordering. The data values are arbitrary labels. An example of nominal data is sex using codes m and f , or codes 0 and 1. Note that just because a numeric code is used for sex, the variable is still nominally scaled. The practice of using numeric codes for nominal data is discouraged (see below).
2. Ordinal Data • The data can be ordered but differences between values cannot be quantified. • Some examples of ordinal data are: – Ranking political parties on left to right spectrum using labels 0, 1, or 2. – Using a Likert scale to rank you degree of happiness on a scale of 1 to 5. – Giving a restaurant ratings as terrific, good, or poor. – Ranking the size of animals as small, medium, large or coded as 1, 2, 3. Again numeric codes for ordinal data are discouraged (see below). 3. Interval Data • The data can be ordered, have a constant scale, but have no natural zero. • This implies that differences between data values are meaningful, but ratios are not. • There are really only two common interval scaled variables, e.g., temperature (◦ C, ◦ F), dates. For example, 30◦ C − 20◦ C = 20◦ C − 10◦ C, but 20◦ C/10◦ C is not twice as hot! 4. Ratio Data • Data can be ordered, have a constant scale, and have a natural zero. • Examples of ratio data are height, weight, age, length, etc. Some packages (e.g. JMP) make no distinction between Interval or Ratio data, calling them both ‘continuous’ scaled. However, this is, technically, not quite correct. Only certain operations can be performed on certain scales of measurement. The following list summarizes which operations are legitimate for each scale. Note that you can always apply operations from a ‘lesser scale’ to any particular data, e.g. you may apply nominal, ordinal, or interval operations to an interval scaled datum. • Nominal Scale. You are only allowed to examine if a nominal scale datum is equal to some particular value or to count the number of occurrences of each value. For example, gender is a nominal scale variable. You can examine if the gender of a person is F or count the number of males in a sample. Talking the average of nominally scaled data is not sensible (e.g. the average sex is not sensible). In order to avoid problems with computer packages trying to take averages of nominal data, it is recommended that alphanumerical codes be used for nominally scaled data, e.g. use M and F for sex rather than 0 or 1. Most packages can accept alphanumeric data without problems. c
2012 Carl James Schwarz
72
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS • Ordinal Scale. You are also allowed to examine if an ordinal scale datum is less than or greater than another value. Hence, you can ‘rank’ ordinal data, but you cannot ‘quantify’ differences between two ordinal values. For example, political party is an ordinal datum with the NDP to the left of the Conservative Party, but you can’t quantify the difference. Another example is preference scores, e.g. ratings of eating establishments where 10 = good and 1 = poor, but the difference between an establishment with a 10 ranking and an 8 ranking can’t be quantified. Technically speaking, averages are not really allowed for ordinal data, e.g. taking the average of small, medium and large as data values doesn’t make sense. Again alphanumeric codes are recommended for ordinal data. Some case should be taken with ordinal data and alphanumeric codes as many packages sort values alphabetically and so the ordering of large, medium , small may not correspond to the ordering desired. JMP allows the user to specify the ordering of values in the Column Information of each variable. A simple trick to get around this problem, is to use alphanumeric codes such as 1.small, 2.medium, 3.large as the data values as an alphabetic sort then keeps the values in proper order. • Interval Scale. You are also allowed to quantify the difference between two interval scale values but there is no natural zero. For example, temperature scales are interval data with 25◦ C warmer than 20◦ C and a 5◦ C difference has some physical meaning. Note that 0◦ C is arbitrary, so that it does not make sense to say that 20◦ C is twice as hot as 10◦ C. Values for interval scaled variables are recorded using numbers so that averages can be taken. • Ratio Scale. You are also allowed to take ratios among ratio scaled variables. Physical measurements of height, weight, length are typically ratio variables. It is now meaningful to say that 10 m is twice as long as 5 m. This ratio hold true regardless of which scale the object is being measured in (e.g. meters or yards). This is because there is a natural zero. Values for ratio scaled variables are recorded as numbers so that averages can be taken.
2.5.2
Types of Data
Data can also be classified by its type. This is less important than the scale of measurement, as it usually does not imply a certain type of analysis, but can have subtle effects. Discrete data Only certain specific values are valid, points between these values are not valid. For example, counts of people (only integer values allowed), the grade assigned in a course (F, D, C-, C, C+, . . .). Continuous data All values in a certain range are valid. For example, height, weight, length, etc. Note that some packages label interval or ratio data as continuous. This is not always the case. Continuous but discretized Continuous data cannot be measured to infinite precision. It must be discretized, and consequently is technically discrete. For example, a person’s height may be measured to the nearest cm. This can cause problems if the level of discretization is too coarse. For example, what would happen if a person’s height was measured to the nearest meter. As a rule of thumb, if the discretization is less than 5% of the typical value, then a discretized continuous variable can be treated as continuous without problems.
c
2012 Carl James Schwarz
73
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
2.5.3
Roles of data
Some computer packages (e.g. JMP) also make distinctions about the role of a variable. Label A variable whose value serves as an identification of each observation - usually for plotting. Frequency A variable whose value indicates how many occurrences of this observation occur. For example, rather than having 100 lines in a data set to represent 100 females, you could have one line with a count of 100 in the Frequency variable. Weight This is rarely used. It indicates the weight that this observation is to have in the analysis. Usually used in advanced analyses. X Identifies a variables as a ‘predictor’ variable. [Note that the use of the term ‘independent’ variable is somewhat old fashioned and is falling out of favour.] This will be more useful when actual data analysis is started. Y Identifies a variable as a ‘response’ variable. [Note tht the use of the term ‘dependent’ variable is somewhat old fashione and is falling out of favour.] This will be more useful when actual data analysis is started.
2.6
Bias, Precision, Accuracy
The concepts of Bias, Precision and Accuracy are often used interchangeably in non-technical writing and speaking. However these have very specific statistical meanings and it important that these be carefully differentiated. The first important point about these terms is that they CANNOT be applied in the context of a single estimate from a single set of data. Rather, they are measurements of the performance of an estimator over repeated samples from the same population. Recall, that a fundamental idea of statistics is that repeated samples from the same population will give different estimates, i.e. estimates will vary as different samples are selected.9 Bias is the difference between average value of the estimator over repeated sampling from the population and the true parameter value. If the estimates from repeated sampling vary above and below the true population parameter value so that the average over all possible samples equals the true parameter value, we say that the estimator is unbiased. There are two types of bias - systemic and statistical. Systemic Bias is caused by problems in the apparatus or the measuring device. For example, if a scale systematically gave readings that were 10 g too small, this would be a systemic bias. Or is snorkelers in stream survey consistently only see 50% of the available fish, this would also be an example of systemic bias. Statistical bias is related to the choice of 9 The
standard error of a estimator measures this variation over repeated samples from the same population.
c
2012 Carl James Schwarz
74
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS sampling design and estimator. For example, the usual sample statistics in an simple random sample give unbiased estimates of means, totals, variances, but not for standard deviations. The ratio estimator of survey sampling (refer to later chapters) is also biased. It is not possible to detect systemic biases using the data ta hand. The researcher must examine the experimental apparatus and design very carefully. For example, if repeated surveys were made by snorkeling over sections of streams, estimates may be very reproducible (i.e. very precise) but could be consistently WRONG, i.e. divers only see about 60% of the fish (i.e. biased). Systemic Bias is controlled by careful testing of the experimental apparatus etc. In some cases, it is possible to calibrate the method using "known" populations, – e.g. mixing a solution of a known concentration and then having your apparatus estimate the concentration. Statistical biases can be derived from statistical theory. For example, statistical theory can tell you that the sample mean of a simple random sample is unbiased for the population mean; that the sample VARIANCE is unbiased for the population variance; but that the sample standard deviation is a biased estimator for the population standard deviation. [Even though the sample variance is unbiased, the sample standard deviation is a NON-LINEAR function of the variance (i.e. square-rooted) and non-linear functions don’t preserve unbiasedness.] The ratio estimator is also biased for the population ratio. In many cases, the statistical bias can be shown to essentially disappear with reasonably large sample sizes. Precision of an estimator refers to how variable the repeated estimates will be over repeated sampling from the same population. Again recall that every different sample from the same population will lead to a different estimate. If these estimates have very little variation over repeated sample, we say that the estimate is precise. The standard error (SE) of the estimator measures the variation of the estimator over repeated sampling from the same population. The precision of an estimator is controlled by the sample size. In general, a larger sample size leads to more precise estimates than a smaller sample size. The precision of an estimator is also determined by statistical theory. For example, the precision (standard error) of a sample mean selected using a simple random sample from a large population is found using std dev . A common error is to use this latter formula for all estimators that mathematics to be equal to pop √ n look like a mean – however the formula for the standard error of any estimator depends upon the way the data are collected (i.e. is a simple random sample, a cluster sample, a stratified sample etc), the estimator of interest (e.g. different formulae are used for standard errors of mean, proportions, total, slopes etc.) and, in some cases, the distribution of the population values (e.g. do elements from the population come from a normal distribution, or a Weibull distribution, etc.). Finally, accuracy is a combination of precision and bias. It measures the “average distance” of the estimator from the population parameter. Technically, one p measure of the accuracy of an estimator is the Root Mean Square Error (RMSE) and is computed as (Bias)2 + (SE)2 . A precise, unbiased estimator will be accurate, but not all accurate estimators will be unbiased. The relationship between bias, precision, and accuracy can be view graphically as shown below. Let * represent the true population parameter value (say the population mean), and periods (.) represent values of
c
2012 Carl James Schwarz
75
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS the estimator (say the sample mean) over repeated samples from the same population.
Precise, Unbiased, Accurate Estimator Pop mean * ---------------------------------.. . .. Sample means
Imprecise, Unbiased, less accurate estimator Pop mean * ---------------------------------... .. . .. ... Sample means
Precise, Biased, but accurate estimator Pop mean * ---------------------------------... Sample means
Imprecise, Biased, less accurate estimator Pop mean * ---------------------------------.. ... ... Sample means
Precise, Biased, less accurate estimator Pop mean * ---------------------------------... Sample means
Statistical theory can tell if an estimator is statistically unbiased, its precision, and its accuracy if a probabilistic sample is taken. If data is collected haphazardly, the properties of an estimator cannot be determined. Systemic biases caused by poor instruments cannot be detected statistically.
c
2012 Carl James Schwarz
76
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
2.7
Types of missing data
Missing data happens frequently. There are three types of missing data and an important step in any analysis is to thing of the mechanisms that could have caused the missing data. First, data can be Missing Completely at Random (MCAR). In this case, the missing data is unrelated to the response variable nor to any other variable in the study. For example, in field trials, a hailstorm destroys a test plot. It is unlikely that the hailstorm location is related to the the response variable of the experiment or any other variable of interest to the experiment. In medical trials, a patient may leave the study because of they win the lottery. It is unlikely that this is related to anything of interest in the study. If data are MCAR, most analyses proceed unchanged. The design may be unbalanced, the estimates have poor precision than if all data were present, but no biases are introduced into the estimates. Second, data can be Missing at Random (MAR). In this case, the missingness is unrelated to the response variable, but may be related to other variables in the study. For example, suppose that in drug study involving males and females, that some females must leave the study because they became pregnant. Again, as long as the missingness is not related to the response variable, the design is unbalanced, the estimates have poorer precision, but no biases are introduced into the estimates. Third, and the most troublesome case, is Informative Missing. Here the missingness is related to the response. For example, a trial was conducted to investigate the effectiveness of fertilizer on the regrowth of trees after clear cutting. The added fertilizer increased growth, which attracted deer, which ate all the regrowth!10 The analyst must also carefully distinguish between values of 0 and missing values. They are NOT THE SAME! Here is a little example to illustrate the perils of missing data related to 0-counts. The Department of Fisheries and Oceans has a program called ShoreKeepers which allows community groups to collect data on the ecology of the shores of oceans in a scientific fashion that could be used in later years as part of an evironmental assessment study. As part of the protocol, volunteers randomly place 1 m2 quadrats on the shore and count the number of species of various organisms. Suppose the following data were recorded for three quadrats: 10 There is an urban legend about an interview with an opponent of compulsory seat belt legislation who compared the lengths of stays in hospitals of auto accident victims who were or were not wearing seat belts. People who wore seat belts spent longer, on average, in hospitals following the accident than people not wearing seat belts. The opponent felt that this was evidence for not making seat belts compulsory!
c
2012 Carl James Schwarz
77
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS Quadrat
Species
Q1
A
5
C
10
B
5
C
5
A
5
B
10
Q2 Q3
Count
Now based on the above data, what is the average density of species A? At first glance, it would appear to be (5 + 5)/2 = 5 per quadrat. However, there was no data recorded for species A in Q2. Does this mean that the density of species A was not recorded because people didn’t look for species A, or the density was not recorded because the density was 0? In the first instance, the value of A is Missing at Random from Q2 and the correct estimated density of species A is indeed 5. In the second case, the missingness is informative, and the correct estimated density is (5 + 0 + 5)/3 = 3.33 per quadrat. The above example may seem simplistic, but many database programs are set up in this fashion to “save storage space” by NOT recording zero counts. Unfortunately, one cannot distinguish between a missing value implying that the count was zero, or a missing value indicating that the data was not collected. Even worse, many database queries could erroneously treat the missing data as missing at random and not as zeros giving wrong answers to averages! For example, the Breeding Bird Survey is an annual survey of birds that follows specific routes and records the number of each type of species encountered. According to the documentation about this survey11 only the non-zero counts are stored in the database and some additional information such as the number of actual routes run is required to impute the missing zeroes: “Since only non-zero counts are included in the database, the complete list of years a route is run allows the times in which the species wasn’t seen to be identified and included in the data analysis.” If this extra step was not done, then you would be in the exact problem above on quadrat sampling. The moral of the story is that 0 is a valid value and should be recorded as such! Computer storage costs are declining so quickly, that the “savings” by not recording 0’s soon vanish when people can’t or don’t remember to adjust for the unrecorded 0 values. If your experiment or survey has informative missing values, you could have a serious problem in the analysis and expert help should be consulted. 11 http://www.cws-scf.ec.gc.ca/nwrc-cnrf/migb/stat_met_e.cfm
c
2012 Carl James Schwarz
78
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
2.8 2.8.1
Transformations Introduction
Many of the procedures in this course have an underlying assumption that the data from each group is normally distributed with a common variance. In some cases this is patently false, e.g. the data are highly skewed with variances that change, often with the mean. The most common method to fix this problem is a transformation of the data and the most common transformation in ecology is the logarithmic transform, i.e. analyze the log(Y ) rather than Y . Other transformations are possible - these will not be discussed in this course, but the material below applies equally well to these other transformations. If you are unsure of the proper transformation, there are a number of methods than can assist including a Box-Cox Transform and an applicaiton of Taylor’s Power Law. These are beyond the scope of this course. The logarithmic transform is often used when the data are positive and exhibit a pronounced long right tail. For example, the following are plots of (made-up) data before and after a logarithmic transformation:
c
2012 Carl James Schwarz
79
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS There are several things to note in the two graphs. • The distribution of Y is skewed with a long right tail, but the distribution of log(Y ) is symmetric. • The mean is the right of the median in the original data, but the mean and median are the same in the transformed data. dev 131 • The standard deviation of Y is large relative to the mean (cv = std mean = 421 = 31%) where as the .3 dev standard deviation is small relative to the mean on the transformed data (cv = std mean = 6.0 = 5%).
• The box-plots show a large number of potential “outliers” in the original data, but only a few on the transformed data. It can be shown that in the case of a a log-normal distribution, about 5% of observations are more than 3 standard deviations from the mean compared to a normal distribution with less than 1/2 of 1% of such observations. The form of the Y data above occurs quite often in ecology and is often called a log-normal distribution given that a logarithmic transformation seems to “normalize” the data.
2.8.2
Conditions under which a log-normal distribution appears
Under what conditions would you expect to see a log-normal distribution? Normal distributions often occur when the observed variable is the “sum” of underlying processes. For example, heights of adults (within a sex) are fit very closely by a normal distribution. The height of a person is determined by the “sum” of heights of the shin, thigh, trunk, neck, head and other portions of the body. A famous theorem of statistics (the Central Limit Theorem) says that data that are formed as the “sum” of other data, will tend to have a normal distribution. In some cases, the underlying process act multiplicatively. For example, the distribution of household income is often a log-normal distribution. You can imagine that factor such as level of education, motivation, parental support act to “multiply” income rather than simply adding a fixed amount of money. Similarly, data on animal abundance often has a log-normal distribution because factors such as survival act multiplicatively on the populations.
2.8.3
ln() vs. log()
There is often much confusion about the form of the logarithmic transformation. For example, many calculators and statistical packages differential between the common logarithm (base 10, or log) and the natural logarithm (base e or ln). Even worse, is that many packages actually use log to refer to natural logarithms and log 10 to refer to common logarithms. IT DOESN’T MATTER which transformation is used as long as the proper back-transformation is applied. When you compare the actual values after these transformations, you will see that ln(Y ) = 2.3 log10 (Y ), i.e. the log-transformed values differ by a fixed multiplicative constant. When the anti-logs are applied this constant will “disappear”. c
2012 Carl James Schwarz
80
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS In accordance with common convention in statistics and mathematics, the use of log(Y ) will refer to the natural or ln(Y ) transformation.
2.8.4
Mean vs. Geometric Mean
The simple mean of Y is called the arithmetic mean (or simply) the mean and is computed in the usual fashion. The anti-log of the mean of the log(Y ) values is called the geometric mean. The geometric mean of a set of data is ALWAYS less than the mean of the original data. In the special case of log-normal data, the geometric mean will be close to the MEDIAN of the original data. For example, look that the data above. The mean of Y is 421. The mean of log(Y ) is 5.999 and exp(5.999) = 403 which is close to the median of the original data. This implies that when reporting results, you will need to be a little careful about how the backtransformed values are interpreted. It is possible to go from the mean on the log-scale to the mean on the anti-log scale and vice-versa. For log-normal data,12 it turns out that Y antilog ≈ exp(Y log + and Y log ≈ log(Y antilog ) −
In this case: Y antilog ≈ exp(5.999 + and Y log ≈ log(421.81) −
s2log ) 2
s2antilog 2
2Y antilog
(.3)2 ) = 422. 2
131.22 = 5.996 2 × 421.82
Unfortunately, the formula for the standard deviations is not as straight forward. There is somewhat complicated formula available in many reference books, but a close approximation is that: sanitlog ≈ slog × exp(Y log ) and slog ≈ 12 Other
santilog Y antilog
transformation will have a different formula
c
2012 Carl James Schwarz
81
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS For the data above we see that: santilog ≈ .3 × exp(5.999) = 121 and slog ≈
131.21 =, 311 421.81
which is close, but not exactly on the money.
2.8.5
Back-transforming estimates, standard errors, and ci
Once inference is made on the transformed scale, it is often nice to back-transform and report results on the original scale. For example, a study of turbidity (measured in NTU) on a stream in BC gave the following results on the log-scale: Statistics Mean on log scale Std Dev on log scale SE on log scale
value 5.86 .96 0.27
upper 95% ci Mean
6.4
lower 95% ci Mean
5.3
How should these be reported on the original NTU scale?
Mean on log-scale back to MEDIAN on anti-log scale The simplest back-transform goes from the mean on the log-scale to the MEDIAN on the anti-log scale. The distribution is often symmetric on the log-scale so the mean, median, and mode on the log-scale all co-incide. However, when you take anti-logs, the upper tail gets larger much faster than the lower tail and the anti-log transform re-introduces skewness into the back-transformed data. Hence, the center point on the log-scale gets back-transformed to the median on the anti-logscale. The estimated MEDIAN (or GEOMETRIC MEAN) on the original scale is found by the back transform of the mean on the log-scale, i.e. mediananti-log = exp(meanlog−scale ) estimated median = exp(5.86) = 350 N T U. The 95% confidence interval for the MEDIAN is found by doing a simple back-transformation on the 95% confidence interval for the mean on the log-scale, i.e. from exp(5.3) = 196 to exp(6.4) = 632 NTUs. Note that the confidence interval on the back-transformed scale is no longer symmetric about the estimate. c
2012 Carl James Schwarz
82
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS There is no direct back-transformation of the standard error from the log-scale to the original scale, but an approximate standard error on the back-transformed scale is found as seantilog = selog × exp(5.86) = 95 NTUs. If the MEAN on the anti-log scale is needed, recall from the previous section that M eanantilog ≈ exp(M eanlog +
std dev2log ) 2
M eanantilog ≈ exp(M eanlog ) × exp( M eanantilog ≈ Median × exp(
std dev2log ) 2
std dev2log ) 2
.962 ) = Median × 1.58. 2 Hence multiply the median, standard error of the median, and limits of the 95% confidence interval all by 1.58. M eanantilog ≈ Median × exp(
2.8.6
Back-transforms of differences on the log-scale
Some care must be taken when back-transforming differences on the log-scale. The general rule of thumb is that a difference on the log-sacle corresponds to a log(ratio) on the original scale. Hence a back-transform of a difference on the log-scale corresponds to a ratio on the original scale.13 For example, here are the results from a study to compare turbidity before and after remediation was completed on a stream in BC. Statistics
value on log-scale
Difference
−0.8303
Std Err Dif
0.3695
Upper CL Dif
−0.0676
Lower CL Dif
−1.5929
p-value
0.0341
A difference of −.83 units on the log-scale corresponds to a ratio of exp(−.83) = .44 in the NTU on the original scale. In otherwords, the median NTU after remediation was .44 times that of the median NTU before mediation. Of the median NTU before remediation was exp(.83) = 2.29 times that of the median NTU after remediation. Note that 2.29 = 1/0.44. 13 Recall
that log( Y ) = log(Y ) − log(Z) Z
c
2012 Carl James Schwarz
83
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS The 95% confidence intervals are back-tranformed in a similar fashion. In this case the 95% confidence interval on the RATIO of median NTUs lies between exp(−1.59) = .20 to exp(−.067) = .93, i.e. the median NTU after remediation was between .20 and .95 of the median NTU before remediation. If necessary you could also back-transform the standard error to get a standard error for the ratio on the original scale, but this is rarely done.
2.8.7
Some additional readings on the log-transform
Here are some additional readings on the use of the log-transform taken from the WWW. The URL is presented at the bottom of each page.
c
2012 Carl James Schwarz
84
December 21, 2012
Stats: Log transformation
2/18/05 10:51
Search
Stats >> Model >> Log transformation Dear Professor Mean, I have some data that I need help with analysis. One suggestion is that I use a log transformation. Why would I want to do this? -- Stumped Susan Dear Stumped Think of it as employment security for us statisticians. Short answer If you want to use a log transformation, you compute the logarithm of each data value and then analyze the resulting data. You may wish to transform the results back to the original scale of measurement. The logarithm function tends to squeeze together the larger values in your data set and stretches out the smaller values. This squeezing and stretching can correct one or more of the following problems with your data: 1. Skewed data 2. Outliers 3. Unequal variation Not all data sets will suffer from these problems. Even if they do, the log transformation is not guaranteed to solve these problems. Nevertheless, the log transformation works surprisingly well in many situations. Furthermore, a log transformation can sometimes simplify your statistical models. Some statistical models are multiplicative: factors influence your outcome measure through multiplication rather than addition. These multiplicative models are easier to work with after a log transformation. If you are unsure whether to use a log transformation, here are a few things you should look for: 1. Is your data bounded below by zero? 2. Is your data defined as a ratio? 85 3. Is the largest value in your data more than three times larger than the smallest value? - 1 (7) http://www.cmh.edu/stats/model/linear/log.asp
Stats: Log transformation
2/18/05 10:51
Squeezing and stretching The logarithm function squeezes together big data values (anything larger than 1). The bigger the data value, the more the squeezing. The graph below shows this effect.
The first two values are 2.0 and 2.2. Their logarithms, 0.69 and 0.79 are much closer. The second two values, 2.6 and 2.8, are squeezed even more. Their logarithms are 0.96 and 1.03. The logarithm also stretches small values apart (values less than 1). The smaller the values the more the stretching. This is illustrated below.
86
- 2 (7) http://www.cmh.edu/stats/model/linear/log.asp
Stats: Log transformation
2/18/05 10:51
The values of 0.4 and 0.45 have logarithms (-0.92 and -0.80) that are further apart. The values of 0.20 and 0.25 are stretched even further. Their logarithms are -1.61 and -1.39, respectively. Skewness If your data are skewed to the right, a log transformation can sometimes produce a data set that is closer to symmetric. Recall that in a skewed right distribution, the left tail (the smaller values) is tightly packed together and the right tail (the larger values) is widely spread apart.
The logarithm will squeeze the right tail of the distribution and stretch the left tail, which produces a greater degree of symmetry. If the data are symmetric or skewed to the left, a log transformation could actually make things worse. Also, a log transformation is unlikely to be effective if the data has a narrow range (if the largest value is not more than three times bigger than the smallest value). Outliers If your data has outliers on the high end, a log transformation can sometimes help. The squeezing of large values might pull that outlier back in closer to the rest of the data. If your data has outliers on the low end, the log transformation might actually make the outlier worse, since it stretches small values. Unequal variation Many statistical procedures require that all of your subject groups have comparable variation. If you data has unequal variation, then the some of your tests and confidence intervals may be invalid. A log transformation can help with certain types of unequal variation. A common pattern of unequal variation is when the groups with the large means also tend to have large standard deviations. Consider housing prices in several different neighborhoods. In one part of town, houses might be cheap, and sell for 60 to 80 thousand dollars. In a different neighborhood, houses might sell for 120 to 180 thousand dollars. And in the snooty part of town, houses might sell for 400 to 600 thousand dollars. Notice that as the neighborhoods got 87 is an example of data where groups with more expensive, the range of prices got wider. This large means tend to have large standard deviations. - 3 (7) http://www.cmh.edu/stats/model/linear/log.asp
Stats: Log transformation
2/18/05 10:51
With this pattern of variation, the log transformation can equalize the variation. The log transformation will squeeze the groups with the larger standard deviations more than it will squeeze the groups with the smaller standard deviations. The log transformation is especially effective when the size of a group's standard deviation is directly proportional to the size of its mean. Multiplicative models There are two common statistical models, additive and multiplicative. An additive model assumes that factors that change your outcome measure, change it by addition or subtraction. An example of an additive model would when we increase the number of mail order catalogs sent out by 1,000, and that adds an extra $5,000 in sales. A multiplicative model assumes that factors that change your outcome measure, change it by multiplication or division. An example of a multiplicative model woud be when an inch of rain takes half of the pollen out of the air. In an additive model, the changes that we see are the same size, regardless of whether we are on the high end or the low end of the scale. Extra catalogs add the same amount to our sales regardless of whether our sales are big or small. In a multiplicative model, the changes we see are bigger at the high end of the scale than at the low end. An inch of rain takes a lot of pollen out on a high pollen day but proportionately less pollen out on a low pollen day. If you remember your high school algebra, you'll recall that the logarithm of a product is equal to the sum of the logarithms.
Therefore, a logarithm converts multiplication/division into addition/subtraction. Another way to think about this in a multiplicative model, large values imply large changes and small values imply small changes. The stretching and squeezing of the logarithm levels out the changes. When should you consider a log transformation? There are several situations where a log transformation should be given special consideration. Is your data bounded below by zero? When your data are bounded below by zero, you often have problems with skewness. The bound of zero prevents outliers on the low end, and constrains the left tail of the distribution to be tightly packed. Also groups with means close to zero are more constrained (hence less variable) than groups with means far away from zero. It does matter how close you are to zero. If your mean is within a standard deviation or two of zero, then expect some skewness. After all the bell shaped curve which speads out about three standard deviations on either side would crash into zero and cause a traffic jam in the left tail. Is your data defined as a ratio? Ratios tend to88be skewed by their very nature. They also tend to have models that are multiplicative. Is the largest value in your data more than three times larger than the smallest value? The - 4 (7) http://www.cmh.edu/stats/model/linear/log.asp
Stats: Log transformation
2/18/05 10:51
relative stretching and squeezing of the logarithm only has an impact if your data has a wide range. If the maximum of your data is not at least three times as big as your minimum, then the logarithm can't squeeze and stretch your data enough to have any useful impact. Example The DM/DX ratio is a measure of how rapidly the body metabolizes certain types of medication. A patient is given a dose of dextrometorphan (DM), a common cough medication. The patients urine is collected for four hours, and the concentrations of DM and DX (a metabolite of dextrometorphan) are measured. The ratio of DM concentration to DX is a measure of how well the CYD 2D6 metabolic pathway functions. A ratio less than 0.3 indicates normal metabolism; larger ratios indicate slow metabolism. Genetics can influence CYP 2D6 metabolism. In this set of 206 patients, we have 15 with no functional alleles and 191 with one or more functional alleles. The DM/DX ratio is a good candidate for a log transformation since it is bounded below by zero. It is also obviously a ratio. The standard deviation for this data (0.4) is much larger than the mean (0.1).
Finally, the largest value is several orders of magnitude bigger than the smallest value.
Skewness The boxplots below show the original (untransformed) data for the 15 patients with no functional alleles. The graph also shows the log transformed data. Notice that the untransformed data shows quite a bit of skewness. The lower whisker and the lower half of the box are much packed tightly, while the upper whisker and the upper half of the box are spread widely. The log transformed data, while not perfectly symmetric, does tend to have a better balance between the lower half and the upper half of the distribution. 89
- 5 (7) http://www.cmh.edu/stats/model/linear/log.asp
Stats: Log transformation
2/18/05 10:51
Outliers The graph below shows the untransformed and log transformed data for the subset of patients with exactly two functional alleles (n=119). The original data has two outliers which are almost 7 standard deviations above the mean. The log transformed data are not perfect, and perhaps there is now an outlier on the low end. Nevertheless, the worst outlier is still within 4 standard deviations of the mean. The influence of outliers is much less extreme with the log transformed data.
Unequal variation When we compute standard deviations for the patients with no functional alleles and the patients with one or more functional alleles, we see that the former group has a much larger standard deviation. This is not too surprising. The patients with no functional alleles are further from the lower bound and thus have much more room to vary.
After a log transformation, the standard deviations are much closer.
90
- 6 (7) http://www.cmh.edu/stats/model/linear/log.asp
Stats: Log transformation
2/18/05 10:51
Summary Stumped Susan wants to understand why she should use a log transformation for her data. Professor Mean explains that a log transformation is often useful for correcting problems with skewed data, outliers, and unequal variation. This works because the log function squeezes the large values of your data together and stretches the small values apart. The log transformation is also useful when you believe that factors have a mutliplicative effect. You should consider a log transformation when your data are bound below by zero, when you data are defined as a ratio, and/or when the largest value in your data is at least three times as big as the smallest value. Related pages in Stats Stats: Geometric mean Further reading The log transformation is special. Keene ON. Stat Med 1995: 14(8); 811-9. [Medline] Stats >> Model >> Log transformation This page was last modified on 01/10/2005 . Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page.
91
- 7 (7) http://www.cmh.edu/stats/model/linear/log.asp
Confidence Intervals Involving Logarithmically Transformed Data
2/18/05 10:52
Confidence Intervals Involving Data to Which a Logarithmic Transformation Has Been Applied These data were originally presented in Simpson J, Olsen A, and Eden J (1975), "A Bayesian Analysis of a Multiplicative Treatment effect in Weather Modification," Technometrics, 17, 161-166, and subsequently reported and analyzed by Ramsey FL and Schafer DW (1997), The Statistical Sleuth: A Course in Methods of Data Analysis. Belmont, CA: Duxbury Press. They involve an experiment performed in southern Florida between 1968 and 1972. An aircraft was flown through a series of cloud and, at random, seeded some of them with massive amounts of silver iodide. Precipitation after the aircraft passed through was measured in acre-feet. The distribution of precipitation within group (seeded or not) is positively skewed (long-tailed to the right). The group with the higher mean has a proportionally larger standard deviation as well. Both characteristics suggest that a logarithmic transformation be used to make the data more symmetric and homoscedastic (more equal spread). The second pair of box plots bears this out. This transformation will tend to make CIs more reliable, that is, the level of confidence is more likely to be what is claimed. N Mean Std. Deviation Median Rainfall
Not Seeded 26 164.6
278.4
44.2
Seeded
650.8
221.6
26 442.0
N Mean Std. Deviation Geometric Mean LOG_RAIN
Not Seeded 26 1.7330
.7130
54.08
Seeded
.6947
169.71
26 2.2297
95% Confidence Interval for the Mean Difference Seeded - Not Seeded (logged data) Lower
Upper
Equal variances assumed
0.1046
0.8889
Equal variances not assumed
0.1046
0.8889
Researchers often transform data back to the original scale when a logarithmic transformation is applied to a set of data. Tables might include Geometric Means, which are the anti-logs of the92mean of the logged data. When data are positively skewed, the geometric mean is invariably less than the arithmetic mean. This leads to questions of whether the geometric mean has any interpretation other than as the anti-log of the mean of the log transformed data. - 1 (3) http://www.tufts.edu/~gdallal/ci_logs.htm
Confidence Intervals Involving Logarithmically Transformed Data
2/18/05 10:52
The geometric mean is often a good estimate of the original median. The logarithmic transformation is monotonic, that is, data are ordered the same way in the log scale as in the original scale. If a is greater than b, then log(a) is greater than log(b). Since the observations are ordered the same way in both the original and log scales, the observation in the middle in the original scale is also the observation in the middle in the log scale, that is, the log of the median = the median of the logs If the log transformation makes the population symmetric, then the population mean and median are the same in the log scale. Whatever estimates the mean also estimates the median, and vice-versa. The mean of the logs estimates both the population mean and median in the log transformed scale. If the mean of the logs estimates the median of the logs, its anti-log--the geometric mean--estimates the median in the original scale! The median rainfall for the seeded clouds is 221.6 acre-feet. In the picture, the solid line between the two histograms connects the median in the original scale to the mean in the log-transformed scale. One property of the logarithm is that "the difference between logs is the log of the ratio", that is, log(x)log(y)=log(x/y). The confidence interval from the logged data estimates the difference between the population means of log transformed data, that is, it estimates the difference between the logs of the geometric means. However, the difference between the logs of the geometric means is the log of the ratio of the geometric means. The anti-logarithms of the end points of this confidence interval give a confidence interval for the ratio of geometric means itself. Since the geometric mean is sometime an estimate of the median in the original scale, it follows that a confidence interval for the geometric means is approximately a confidence interval for the ratio of the medians in the original scale. In the (common) log scale, the mean difference between seeded and unseeded clouds is 0.4967. Our best estimate of the ratio of the median rainfall of seeded clouds to that of unseeded clouds is 100.4967 [= 3.14]. Our best estimate of the effect of cloud seeding is that it produces 3.14 times as much rain on average as not seeding. Even when the calculations are done properly, the conclusion is often misstated. The difference 0.4967 does not mean seeded clouds produce 0.4967 acre-feet more rain that unseeded clouts. It is also improper to say that seeded clouds produce 0.4967 log-acre-feet more than unseeded clouds. The 3.14 means 3.14 times as much. It does not mean 3.14 times more (which would be 4.14 times as much). It does not mean 3.14 acre-feet more. It is a ratio and has to be described that way. The a 95% CI for the population mean difference (Seeded - Not Seeded) is (0.1046, 0.8889). For reporting purposes, this CI should be transformed back to the original scale. A CI for a difference in the log scale 93 becomes a CI for a ratio in the original scale. The antilogarithms of the endpoints of the confidence interval are 100.1046 = 1.27, and 100.8889 = 7.74. - 2 (3) http://www.tufts.edu/~gdallal/ci_logs.htm
Confidence Intervals Involving Logarithmically Transformed Data
2/18/05 10:52
Thus, the report would read: "The geometric mean of the amount of rain produced by a seeded cloud is 3.14 times as much as that produced by an unseeded cloud (95% CI: 1.27 to 7.74 times as much)." If the logged data have a roughly symmetric distribution, you might go so far as to say,"The median amount of rain...is approximately..." Comment: The logarithm is the only transformation that produces results that can be cleanly expressed in terms of the original data. Other transformations, such as the square root, are sometimes used, but it is difficult to restate their results in terms of the original data. Copyright © 2000 Gerard E. Dallal Last modified: Mon Sep 30 2002 14:15:42.
94
- 3 (3) http://www.tufts.edu/~gdallal/ci_logs.htm
CHAPTER 2. INTRODUCTION TO STATISTICS
2.9
Standard deviations and standard errors revisited
The use of standard deviations and standard errors in reports and publications can be confusing. Here are some typical questions asked by students about these two concepts. I am confused about why different graphs in different publication display the mean ± 1 standard deviation; the mean ± 2 standard deviations; the mean ± 1 se; or the mean ± 2 se. When should each graph be used? What is the difference between a box-plot; ±2se; and ± 2 standard deviations?
The foremost distinction between the use of standard deviation and standard errors can be made as follows: Standard deviations should be used when information about INDIVIDUAL observations is to be conveyed; standard errors should be used when information about the precision of an estimate is to be conveyed. There are in fact, several common types of graphs that can be used to display the distribution of the INDIVIDUAL data values. Common displays from "closest to raw data" to "based on summary statistics" are: • dot plots • stem and leaf plots • histograms • box plots • mean ± 1 std dev. NOTE this is NOT the same as the estimate ± 1 se • mean ± 2 std dev. NOTE this is NOT the same as the estimate ± 2 se The dot plot is a simple plot of the actual raw data values (e.g. that seen in JMP when the Analyze->Fit Y-by-X platform is invoked. It is used to check for outliers and other unusual points. Often jittering is used to avoid overprinting any duplicate data points. It useful for up to about 100 data points. Here is an example of a dot plot of air quality data in several years:
c
2012 Carl James Schwarz
95
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
Stem and leaf plots and histograms are similar. Both first start by creating ‘bins’ representing ranges of the data (e.g. 0→4.9999, 5→9.9999, 10→15.9999, etc.). Then the number of data points in each bin is tabulated. The display shows the number or the frequency in each bin. The general shape of the data is examined (e.g. is it symmetrical, or skewed, etc). Here are two examples of histograms and stem-and-leaf charts:
c
2012 Carl James Schwarz
96
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
c
2012 Carl James Schwarz
97
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
The box-plot is an alternate method of displaying the individual data values. The box portion displays the 25th, 50th, and 75th percentiles14 of the data. The definition of the extent of the whiskers depends upon the statistical package, but generally stretch to show the "typical" range to be expected from data. Outliers may be “indicated” in some plots. The box-plot is an alternative (and in my opinion a superior) display to a graph showing the mean ± 2 standard deviations because it conveys more information. For example, a box plot will show if the data are symmetric (25th , 50th , and 75th percentiles roughly equally spaced) or skewed (the median much closer to one of the 25th or 75th percentiles). The whiskers show the range of the INDIVIDUAL data values. Here is an example of side-by-side box plots of the air quality data: 14 The pth percentile in a data set is the value such that at least p% of the data are less than the percentile; and at least (100-p)% of the data values are greater than the percentile. For example, the median=.5 quantile = 50th percentile is the value such that at least 50% of the data values are below the median and at least 50% of the data values are above the median. The 25th percentile=.25 quantile = 1st quartile is the value such that at least 25% of the data values are less than the value and at least 75% of the data values are greater than this value.
c
2012 Carl James Schwarz
98
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
The mean ± 1 STD DEV shows a range where you would expect about 68% of the INDIVIDUAL data VALUES assuming the original data came from a normally distributed population. The mean ± 2 STD DEV shows a range where you would expect about 95% of INDIVIDUAL data VALUES assuming the original data came from a normally distributed population. The latter two plots are NOT RELATED to confidence intervals! This plot might be useful when the intent is to show the variability encountered in the sampling or the presence of outliers etc. It is unclear why many journals still accept graphs with ± 1 standard deviation as most people are interested in the range of the data collected so ± 2 standard deviations would be more useful. Here is a plot of the mean ± 1 standard deviation – plots of the mean ± 2 standard deviations are not available in JMP:
c
2012 Carl James Schwarz
99
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
I generally prefer the use of dot plots and bax-plots are these are much more informative than stem-andleaf plots, histograms, or the mean ± some multiple of standard deviations. Then there are displays showing precision of estimates: Common displays are: • mean ± 1 SE • mean ± 2 SE • lower and upper bounds of confidence intervals • diamond plots These displays do NOT have anything to do with the sample values - they are trying to show the location of plausible values for the unknown population parameter - in this case - the population mean. A standard error measures how variable an estimate would likely be if repeated samples/experiments from the same population were performed. Note that a se says NOTHING about the actual sample values! For example, it is NOT correct to say that a 95% confidence interval contains 95% of INDIVIDUAL data values. The mean ± 1 SE display is not very informative as it corresponds to an approximate 68% confidence interval. The mean ± 2 SE corresponds to an approximate 95% confidence interval IN THE CASE OF c
2012 Carl James Schwarz
100
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS SIMPLE RANDOM SAMPLING. Here is a plot of the mean ± 1 se – a plot of the mean ± 2 se is not available directly in JMP except as bar above and below the mean.
c
2012 Carl James Schwarz
101
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
Alternatively, JMP can plot confidence interval diamonds.
c
2012 Carl James Schwarz
102
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
Graphs showing ± 1 or 2 standard errors are showing the range of plausible values for the underlying population mean. It is unclear why many journals still publish graphs with ± 1 se as this corresponds to an approximate 68% confidence interval. I think that a 95% confidence interval would be more useful corresponding to ± 2 se. Caution. Often these graphs (e.g. created by Excel) use the simple formula for the se of the sample mean collected under a simple random sample even if the underlying design is more complex! In this case, the graph is in error and should not be interpreted! Both the confidence interval and the diamond plots (if computed correctly for a particular sampling design and estimator) correspond to a 95% confidence interval.
c
2012 Carl James Schwarz
103
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS
2.10
Other tidbits
2.10.1
Interpreting p-values
I have a question about p-values. I’m confused about the wording used when they explain the p-value. They say ‘with p = 0.03, in 3 percent of experiments like this we would observe sample means as different as or more different than the ones we got, if in fact the null hypothesis were true.’ The part that gets me is the ‘as different as or more different than’. I think I’m just having problems putting it into words that make sense to me. Do you have another way of saying it?
The p-value measures the ‘unusualness’ of the data assuming that the null hypothesis is true. The ‘confusing’ part is how to measure unusualness. For example; is a person 7 ft (about 2 m) unusually tall? Yes, because only a small fraction of people are AS TALL OR TALLER. Now if the hypothesis is 2-sided, both small and large values of the sample mean (relative to the hypothesized value) are unusual. For example, suppose that null hypothesis is that the mean amount in bottles of pop is 250 mL. We would be very surprised if the sample mean was very small (e.g. 150 mL) or very large (e.g. 350 mL). That is why, for a two-sided test, the unusualness is ‘as different or more different’. You aren’t just interested in the probability of getting exactly 150 or 350, but rather in the probability that the sample mean is < 150 or > 350 (analogous to the probability of being 7 ft or higher).
2.10.2
False positives vs. false negatives
What is the difference between a false positive and a false negative
A false positive (Type I) error occurs if you conclude that the evidence against the hypothesis of interest is strong, when, in fact, the hypothesis is true. For example, in a pregnancy test, the null hypothesis is that the person is NOT pregnant. A false positive reading would indicate that the test indicates a pregnancy, when in fact the person is not pregnant. A false negative (Type II error) occurs if insufficient evidence against the null hypothesis, in fact, the hypothesis is false. In the case of a pregnancy test, a false negative would occur if the test indicates not pregnant, when in fact, the person is pregnant.
2.10.3
Specificity/sensitivity/power
Please clarify specificity/sensitivity/power of a test. Are they the same?
c
2012 Carl James Schwarz
104
December 21, 2012
CHAPTER 2. INTRODUCTION TO STATISTICS The power and sensitivity are two terms for the ability to find sufficient evidence against the the null hypothesis when, in fact, the null hypothesis is false. For example, a pregnancy test with a 99that the test correctly identifies a pregnancy when in fact the person is pregnant. The specificity of a test indicates the ability to NOT find evidence against the null hypothesis when the null hypothesis is true - the opposite of a Type I error. A pregnancy test would have high specificity if it rarely declares a pregnancy for a non-pregnant person.
c
2012 Carl James Schwarz
105
December 21, 2012
Chapter 3
Sampling Contents 3.1
3.2
3.3 3.4
3.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Difference between sampling and experimental design . . . . . 3.1.2 Why sample rather than census? . . . . . . . . . . . . . . . . 3.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . . . 3.1.4 Probability sampling vs. non-probability sampling . . . . . . . 3.1.5 The importance of randomization in survey design . . . . . . . 3.1.6 Model vs. Design based sampling . . . . . . . . . . . . . . . . 3.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . . 3.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . 3.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Panel design - suitable for long-term monitoring . . . . . . . . 3.2.7 Sampling non-discrete objects . . . . . . . . . . . . . . . . . 3.2.8 Key considerations when designing or analyzing a survey . . . Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Random Sampling Without Replacement (SRSWOR) . . . 3.4.1 Summary of main results . . . . . . . . . . . . . . . . . . . . 3.4.2 Estimating the Population Mean . . . . . . . . . . . . . . . . 3.4.3 Estimating the Population Total . . . . . . . . . . . . . . . . . 3.4.4 Estimating Population Proportions . . . . . . . . . . . . . . . 3.4.5 Example - estimating total catch of fish in a recreational fishery Sample size determination for a simple random sample . . . . . . .
106
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
108 108 109 109 110 111 114 115 115 115 117 120 124 126 128 129 129 131 131 132 133 133 134 134 141
CHAPTER 3. SAMPLING
3.6
3.7
3.8
3.9
3.10
3.11
3.12 3.13 3.14
3.5.1 Example - How many angling-parties to survey . . . . . . . . . . . . . . . . . . . 144 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 3.6.1 Advantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.6.2 Disadvantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . 148 3.6.3 How to select a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 3.6.5 Technical notes - Repeated systematic sampling . . . . . . . . . . . . . . . . . . 149 Stratified simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 3.7.1 A visual comparison of a simple random sample vs. a stratified simple random sample154 3.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 3.7.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 3.7.4 Example - sampling organic matter from a lake . . . . . . . . . . . . . . . . . . . 164 3.7.5 Example - estimating the total catch of salmon . . . . . . . . . . . . . . . . . . . 168 3.7.6 Sample Size for Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . . . 177 3.7.7 Allocating samples among strata . . . . . . . . . . . . . . . . . . . . . . . . . . 180 3.7.8 Example: Estimating the number of tundra swans. . . . . . . . . . . . . . . . . . 183 3.7.9 Post-stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 3.7.10 Allocation and precision - revisited . . . . . . . . . . . . . . . . . . . . . . . . . 189 Ratio estimation in SRS - improving precision with auxiliary information . . . . . . . 190 3.8.1 Summary of Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 3.8.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population total . 201 Additional ways to improve precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 3.9.1 Using both stratification and auxiliary variables . . . . . . . . . . . . . . . . . . 210 3.9.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 3.9.3 Sampling with unequal probability - pps sampling . . . . . . . . . . . . . . . . . 211 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 3.10.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 3.10.2 Advantages and disadvantages of cluster sampling compared to SRS . . . . . . . 219 3.10.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 3.10.4 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 3.10.5 Example - estimating the density of urchins . . . . . . . . . . . . . . . . . . . . . 221 3.10.6 Example - estimating the total number of sea cucumbers . . . . . . . . . . . . . . 227 Multi-stage sampling - a generalization of cluster sampling . . . . . . . . . . . . . . . 235 3.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 3.11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 3.11.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 3.11.4 Example - estimating number of clams . . . . . . . . . . . . . . . . . . . . . . . 238 3.11.5 Some closing comments on multi-stage designs . . . . . . . . . . . . . . . . . . 242 Analytical surveys - almost experimental design . . . . . . . . . . . . . . . . . . . . . 242 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 3.14.1 Confusion about the definition of a population . . . . . . . . . . . . . . . . . . . 247
c
2012 Carl James Schwarz
107
December 21, 2012
CHAPTER 3. SAMPLING 3.14.2 3.14.3 3.14.4 3.14.5
3.1
How is N defined . . . . . . . . . . . . . . . . . . . . . Multi-stage vs. Multi-phase sampling . . . . . . . . . . . What is the difference between a Population and a frame? How to account for missing transects. . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
248 248 249 249
Introduction
Today the word "survey" is used most often to describe a method of gathering information from a sample of individuals or animals or areas. This "sample" is usually just a fraction of the population being studied. You are exposed to survey results almost every day. For example, election polls, the unemployment rate, or the consumer price index are all examples of the results of surveys. On the other hand, some common headlines are NOT the results of surveys, but rather the results of experiments. For example, is a new drug just as effective as an old drug. Not only do surveys have a wide variety of purposes, they also can be conducted in many ways – including over the telephone, by mail, or in person. Nonetheless, all surveys do have certain characteristics in common. All surveys require a great deal of planning in order that the results are informative. Unlike a census, where all members of the population are studied, surveys gather information from only a portion of a population of interest – the size of the sample depending on the purpose of the study. Surprisingly to many people, a survey can give better quality results than an census. In a bona fide survey, the sample is not selected haphazardly. It is scientifically chosen so that each object in the population will have a measurable chance of selection. This way, the results can be reliably projected from the sample to the larger population. Information is collected by means of standardized procedures The survey’s intent is not to describe the particular object which, by chance, are part of the sample but to obtain a composite profile of the population.
3.1.1
Difference between sampling and experimental design
There are two key differences between survey sampling and experimental design. • In experiments, one deliberately perturbs some part of population to see the effect of the action. In sampling, one wishes to see what the population is like without disturbing it. • In experiments, the objective is to compare the mean response to changes in levels of the factors. In sampling the objective is to describe the characteristics of the population. However, refer to the section on analytical sampling later in this chapter for when sampling looks very similar to experimental design. c
2012 Carl James Schwarz
108
December 21, 2012
CHAPTER 3. SAMPLING
3.1.2
Why sample rather than census?
There are a number of advantages of sampling over a complete census: • reduced cost • greater speed - a much smaller scale of operations is performed • greater scope - if highly trained personnel or equipment is needed • greater accuracy - easier to train small crew, supervise them, and reduce data entry errors • reduced respondent burden • in destructive sampling you can’t measure the entire population - e.g. crash tests of cars
3.1.3
Principle steps in a survey
The principle steps in a survey are: • formulate the objectives of the survey - need concise statement • define the population to be sampled - e.g. what is the range of animals or locations to be measured? Note that the population is the set of final sampling units that will be measured - refer to the FAQ at the end of the chapter for more information. • establish what data is to be collected - collect a few items well rather than many poorly • what degree of precision is required - examine power needed • establish the frame - this is a list of sampling units that is exhaustive and exclusive – in many cases the frame is obvious, but in others it is not – it is often very difficult to establish a frame - e.g. a list of all streams in the lower mainland. • choose among the various designs; will you stratify? There are a variety of sampling plans some of which will be discussed in detail later in this chapter. Some common designs in ecological studies are: – simple random sampling – systematic sample – cluster sampling – multi-stage design All designs can be improved by stratification, so this should always be considered during the design phase.
c
2012 Carl James Schwarz
109
December 21, 2012
CHAPTER 3. SAMPLING • pre-test - very important to try out field methods and questionnaires • organization of field work - training, pre-test, etc • summary and data analysis - easiest part if earlier parts done well • post-mortem - what went well, poorly, etc.
3.1.4
Probability sampling vs. non-probability sampling
There are two types of sampling plans - probability sampling where units are chosen in a ‘random fashion’ and non-probability sampling where units are chosen in some deliberate fashion. In probability sampling • every unit has a known probability of being in the sample • the sample is drawn with some method consistent with these probabilities • these selection probabilities are used when making estimates from the sample The advantages of probability sampling • we can study biases of the sampling plans • standard errors and measures of precision (confidence limits) can be obtained Some types of non-probability sampling plan include: • quota sampling - select 50 M and 50 F from the population – less expensive than a probability sample – may be only option if no frame exists • judgmental sampling - select ‘average’ or ‘typical’ value. This is a quick and dirty sampling method and can perform well if there are a few extreme points which should not be included. • convenience sampling - select those readily available. This is useful if is dangerous or unpleasant to sample directly. For example, selecting blood samples from grizzly bears. • haphazard sampling (not the same as random sampling). This is often useful if the sampling material is homogeneous and spread throughout the population, e.g. chemicals in drinking water. The disadvantages of non-probability sampling include c
2012 Carl James Schwarz
110
December 21, 2012
CHAPTER 3. SAMPLING • unable to assess biases in any rational way. • no estimates of precision can be obtained. In particular the simple use of formulae from probability sampling is WRONG!. • experts may disagree on what is the “best” sample.
3.1.5
The importance of randomization in survey design
[With thanks to Dr. Rick Routledge for this part of the notes.] . . . I had to make a ‘cover degree’ study... This involved the use of a Raunkiaer’s Circle, a device designed in hell. In appearance it was all simple innocence, being no more than a big metal hoop; but in use it was a devil’s mechanism for driving sane men mad. To use it, one stood on a stretch of muskeg, shut one’s eyes, spun around several times like a top, and then flung the circle as far away as possible. This complicated procedure was designed to ensure that the throw was truly ‘random’; but, in the event, it inevitably resulted in my losing sight of the hoop entirely, and having to spend an unconscionable time searching for the thing. Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed to follow such a bizarre-looking scheme for collecting a representative sample of tundra vegetation? Could she not have obtained a typical cross-section of the vegetation by using her own judgment? Undoubtedly, she could have convinced herself that by replacing an awkward, haphazard sampling scheme with one dependent solely on her own judgment and common sense, she could have been guaranteed a more representative sample. But would others be convinced? A careful, objective scientist is trained to be skeptical. She would be reluctant to accept any evidence whose validity depended critically on the judgment and skills of a stranger. The burden of proof would then rest squarely with Farley Mowat to prove his ability to take representative, judgmental samples. It is typically far easier for a scientist to use randomization in her sampling procedures than it is to prove her judgmental skills. Hovering and Patrolling Bees It is often difficult, if not impossible, to take a properly randomized sample. Consider, e.g., the problem faced by Alcock et al. (1977) in studying the behavior of male bees of the species, Centris pallida, in the deserts of south-western United States. Females pupate in underground burrows. To maximize the presence of his genes in the next generation, a male of the species needs to mate with as many virgin females as possible. One strategy is to patrol the burrowing area at a low altitude, and nab an emerging female as soon as her presence is detected. This patrolling strategy seems to involve a relatively high risk of confrontation with other patrolling males. The other strategy reported by the authors is to hover farther above the burrowing area, and mate with those females who escape detection by the hoverers. These hoverers appear to be involved in fewer conflicts. Because the hoverers tend to be less involved in aggressive confrontations, one might guess that they would tend to be somewhat smaller than the more aggressive patrollers. To assess this hypothesis, the c
2012 Carl James Schwarz
111
December 21, 2012
CHAPTER 3. SAMPLING authors took measurements of head widths for each of the two subpopulations. Of course, they could not capture every single male bee in the population. They had to be content with a sample. Sample sizes and results are reported in the Table below. How are we to interpret these results? The sampled hoverers obviously tended to be somewhat smaller than the sampled patrollers, although it appears from the standard deviations that some hoverers were larger than the average-sized patroller and vice-versa. Hence, the difference is not overwhelming, and may be attributable to sampling errors. Table Summary of head width measurements on two samples of bees. y SD Sample n Hoverers
50
4.92 mm
0.15 mm
Patrollers
100
5.14 mm
0.29 mm
If the sampling were truly randomized, then the only sampling errors would be chance errors, whose probable size can be assessed by a standard t-test. Exactly how were the samples taken? Is it possible that the sampling procedure used to select patrolling bees might favor the capture of larger bees, for example? This issue is indeed addressed by the authors. They carefully explain how they attempted to obtain unbiased samples. For example, to sample the patrolling bees, they made a sweep across the sampling area, attempting to catch all the patrolling bees that they observed. To assess the potential for bias, one must in the end make a subjective judgment. Why make all this fuss over a technical possibility? It is important to do so because lack of attention to such possibilities has led to some colossal errors in the past. Nowhere are they more obvious than in the field of election prediction. Most of us never find out the real nature of the population that we are sampling. Hence, we never know the true size of our errors. By contrast, pollsters’ errors are often painfully obvious. After the election, the actual percentages are available for everyone to see. Lessons from Opinion Polling In the 1930’s, political opinion was in its formative years. The pioneers in this endeavor were training themselves on the job. Of the inevitable errors, two were so spectacular as to make international headlines. In 1935, an American magazine with a large circulation, The Literary Digest, attempted to poll an enormous segment of the American voting public in order to predict the outcome of the presidential election that autumn. Roosevelt, the Democratic candidate, promised to develop programs designed to increase opportunities for the disadvantaged; Landon, the candidate for the Republican Party, appealed more to the wealthier segments of American society. The Literary Digest mailed out questionnaires to about ten million people whose names appeared in such places as subscription lists, club directories, etc. They received over 2.5 million responses, on the basis of which they predicted a comfortable victory for Landon. The election returns soon showed the massive size of their prediction error. The cumbersome design of this highly publicized survey provided a young, wily pollster with the chance of a lifetime. Between the time that the Digest announced its plans and released its predictions, George Gallup planned and executed a remarkable coup. By polling only a small fraction of these individuals, and a relatively small number of other voters, he correctly predicted not only the outcome of the election, but also
c
2012 Carl James Schwarz
112
December 21, 2012
CHAPTER 3. SAMPLING the enormous size of the error about to be committed by The Literary Digest. Obviously, the enormous sample obtained by the Digest was not very representative of the population. The selection procedure was heavily biased in favor of Republican voters. The most obvious source of bias is the method used to generate the list of names and addresses of the people that they contacted. In 1935, only the relatively affluent could afford magazines, telephones, etc., and the more conservative policies of the Republican Party appealed to a greater proportion of this segment of the American public. The Digest’s sample selection procedure was therefore biased in favor of the Republican candidate. The Literary Digest was guilty of taking a sample of convenience. Samples of convenience are typically prone to bias. Any researcher who, either by choice or necessity, uses such a sample, has to be prepared to defend his findings against possible charges of bias. As this example shows, it can have catastrophic consequences. How did Gallup obtain his more representative sample? He did not use randomization. Randomization is often criticized on the grounds that once in a while, it can produce absurdly unrepresentative samples. When faced with a sample that obviously contains far too few economically disadvantaged voters, it is small consolation to know that next time around, the error will likely not be repeated. Gallup used a procedure that virtually guaranteed that his sample would be representative with respect to such obvious features as age, race, etc. He did so by assigning quotas which his interviewers were to fill. One interviewer might, e.g., be assigned to interview 5 adult males with specified characteristics in a tough, inner-city neighborhood. The quotas were devised so as to make the sample mimic known features of the population. This quota sampling technique suited Gallup’s needs spectacularly well in 1935 even though he underestimated the support for the Democratic candidate by about 6%. His subsequent polls contained the same systematic error. In 1947, the error finally caught up with him. He predicted a narrow victory for the Republican candidate, Dewey. A Newspaper editor was so confident of the prediction that he authorized the printing of a headline proclaiming the victory before the official results were available. It turned out that the Democrat, Truman, won by a narrow margin. What was wrong with Gallup’s sampling technique? He gave his interviewers the final decision as to whom would be interviewed. In a tough inner-city neighborhood, an interviewer had the option of passing by a house with several motorcycles parked out in front and sounds of a raucous party coming from within. In the resulting sample, the more conservative (Republican) voters were systematically over-represented. Gallup learned from his mistakes. His subsequent surveys replaced interviewer discretion with an objective, randomized scheme at the final stage of sample selection. With the dominant source of systematic error removed, his election predictions became even more reliable. Implications for Biological Surveys The bias in samples of convenience can be enormous. It can be surprisingly large even in what appear to be carefully designed surveys. It can easily exceed the typical size of the chance error terms. To completely remove the possibility of bias in the selection of a sample, randomization must be employed. Sometimes this is simply not possible, as for example, appears to be the case in the study on bees. When this happens and the investigators wish to use the results of a nonrandomized sample, then the final report should discuss
c
2012 Carl James Schwarz
113
December 21, 2012
CHAPTER 3. SAMPLING the possibility of selection bias and its potential impact on the conclusions. Furthermore, when reading a report containing the results of a survey, it is important to carefully evaluate the survey design, and to consider the potential impact of sample selection bias on the conclusions. Should Farley Mowat really have been content to take his samples by tossing Raunkier’s Circle to the winds? Definitely not, for at least two reasons. First, he had to trust that by tossing the circle, he was generating an unbiased sample. It is not at all certain that certain types of vegetation would not be selected with a higher probability than others. For example, the higher shrubs would tend to intercept the hoop earlier in its descent than would the smaller herbs. Second, he has no guarantee that his sample will be representative with respect to the major habitat types. Leaving aside potential bias, it is possible that the circle could, by chance, land repeatedly in a snowbed community. It seems indeed foolish to use a sampling scheme which admits the possibility of including only snowbed communities when tundra bogs and fellfields may be equally abundant in the population. In subsequent chapters, we shall look into ways of taking more thoroughly randomized surveys, and into schemes for combining judgment with randomization for eliminating both selection bias and the potential for grossly unrepresentative samples. There are also circumstances in which a systematic sample (e.g., taking transects every 200 meters along a rocky shore line) may be justifiable, but this subject is not discussed in these notes.
3.1.6
Model vs. Design based sampling
Model-based sampling starts by assuming some sort of statistical model for the data in the population and the goal is to select data to estimate the parameters of this distribution. For example, you may be willing to assume that the distribution of values in the population is log-normally distributed. The data collected from the survey are then used along with a likelihood function to estimate the parameters of the distribution. Model-based sampling is very powerful because you are willing to make a lot of assumptions about the data process. However, if your model is wrong, there are big problems. For example, what if you assume log-normality but data is not log-normally distributed? In these cases, the estimates of the parameters can be extremely biased and inefficient. Design-based sampling makes no assumptions about the distribution of data values in the population. Rather it relies upon the randomization procedure to select representative elements of the population. Estimates from design-based methods are unbiased regardless of the distribution of values in the population, but in “strange” populations can also be inefficient. For example, if a population is highly clustered, a random sample of quadrats will end up with mostly zero observations and a few large values and the resulting estimates will have a large standard error. Most of the results in this chapter on survey sampling are design-based, i.e. we don’t need to make any assumptions about normality in the population for the results to valid.
c
2012 Carl James Schwarz
114
December 21, 2012
CHAPTER 3. SAMPLING
3.1.7
Software
For a review of packages that can be used to analyze survey data please refer to the article at http: //www.fas.harvard.edu/~stats/survey-soft/survey-soft.html. CAUTIONS IN USING STANDARD STATISTICAL SOFTWARE PACKAGES Standard statistical software packages generally do not take into account four common characteristics of sample survey data: (1) unequal probability selection of observations, (2) clustering of observations, (3) stratification and (4) nonresponse and other adjustments. Point estimates of population parameters are impacted by the value of the analysis weight for each observation. These weights depend upon the selection probabilities and other survey design features such as stratification and clustering. Hence, standard packages will yield biased point estimates if the weights are ignored. The estimated standard errors based on sample survey data are impacted by clustering, stratification and the weights. By ignoring these aspects, standard packages generally underestimate the standard error, sometimes substantially so. Most standard statistical packages can perform weighted analyses, usually via a WEIGHT statement added to the program code. Use of standard statistical packages with a weighting variable may yield the same point estimates for population parameters as sample survey software packages. However, the estimated standard error often is not correct and can be substantially wrong, depending upon the particular program within the standard software package. For further information about the problems of using standard statistical software packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/ donna_brogan.html. Fortunately, for simple surveys, we can often do the analysis using standard software as will be shown in these notes. Many software packages also have specialized software and, if available, these will be demonstrated. SAS includes many survey design procedures as shown in these notes.
3.2 3.2.1
Overview of Sampling Methods Simple Random Sampling
This is the basic method of selecting survey units. Each unit in the population is selected with equal probability and all possible samples are equally likely to be chosen. This is commonly done by listing all the members in the population (the set of sampling units) and then choosing units using a random number table. An example of a simple random sample would be a vegetation survey in a large forest stand. The stand is divided into 480 one-hectare plots, and a random sample of 24 plots was selected and analyzed using aerial photos. The map of the units selected might look like:
c
2012 Carl James Schwarz
115
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
116
December 21, 2012
CHAPTER 3. SAMPLING Units are usually chosen without replacement, i.e., each unit in the population can only be chosen once. In some cases (particularly for multi-stage designs), there are advantages to selecting units with replacement, i.e. a unit in the population may potentially be selected more than once. The analysis of a simple random sample is straightforward. The mean of the sample is an estimate of the population mean. An estimate of the population total is obtained by multiplying the sample mean by the number of units in the population. The sampling fraction, the proportion of units chosen from the entire population, is typically small. If it exceeds 5%, an adjustment (the finite population correction) will result in better estimates of precision (a reduction in the standard error) to account for the fact that a substantial fraction of the population was surveyed. A simple random sample design is often ‘hidden’ in the details of many other survey designs. For example, many surveys of vegetation are conducted using strip transects where the initial starting point of the transect is randomly chosen, and then every plot along the transect is measured. Here the strips are the sampling unit, and are a simple random sample from all possible strips. The individual plots are subsamples from each strip and cannot be regarded as independent samples. For example, suppose a rectangular stand is surveyed using aerial overflights. In many cases, random starting points along one edge are selected, and the aircraft then surveys the entire length of the stand starting at the chosen point. The strips are typically analyzed section- by-section, but it would be incorrect to treat the smaller parts as a simple random sample from the entire stand. Note that a crucial element of simple random samples is that every sampling unit is chosen independently of every other sampling unit. For example, in strip transects plots along the same transect are not chosen independently - when a particular transect is chosen, all plots along the transect are sampled and so the selected plots are not a simple random sample of all possible plots. Strip-transects are actually examples of cluster-samples. Cluster samples are discuses in greater detail later in this chapter.
3.2.2
Systematic Surveys
In some cases, it is logistically inconvenient to randomly select sample units from the population. An alternative is to take a systematic sample where every k th unit is selected (after a random starting point); k is chosen to give the required sample size. For example, if a stream is 2 km long, and 20 samples are required, then k = 100 and samples are chosen every 100 m along the stream after a random starting point. A common alternative when the population does not naturally divide into discrete units is grid-sampling. Here sampling points are located using a grid that is randomly located in the area. All sampling points are a fixed distance apart. An example of a systematice sample would be a vegetation survey in a large forest stand. The stand is divided into 480 one-hectare plots. As a total sample size of 24 is required, this implies that we need to sample every 480/24 = 20th plot. We pick a random starting point (the 9th ) plot in the first row, and then every 20 plots reading across rows. The final plan could look like:
c
2012 Carl James Schwarz
117
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
118
December 21, 2012
CHAPTER 3. SAMPLING If a known trend is present in the sample, this can be incorporated into the analysis (Cochran, 1977, Chapter 8). For example, suppose that the systematic sample follows an elevation gradient that is known to directly influence the response variable. A regression-type correction can be incorporated into the analysis. However, note that this trend must be known from external sources - it cannot be deduced from the survey. Pitfall: A systematic sample is typically analyzed in the same fashion as a simple random sample. However, the true precision of an estimator from a systematic sample can be either worse or better than a simple random sample of the same size, depending if units within the systematic sample are positively or negatively correlated among themselves. For example, if a systematic sample’s sampling interval happens to match a cyclic pattern in the population, values within the systematic sample are highly positively correlated (the sampled units may all hit the ‘peaks’ of the cyclic trend), and the true sampling precision is worse than a SRS of the same size. What is even more unfortunate is that because the units are positively correlated within the sample, the sample variance will underestimate the true variation in the population, and if the estimated precision is computed using the formula for a SRS, a double dose of bias in the estimated precision occurs (Krebs, 1989, p.227). On the other hand, if the systematic sample is arranged ‘perpendicular’ to a known trend to try and incorporate additional variability in the sample, the units within a sample are now negatively correlated, the true precision is now better than a SRS sample of the same size, but the sample variance now overestimates the population variance, and the formula for precision from a SRS will overstate the sampling error. While logistically simpler, a systematic sample is only ‘equivalent’ to a simple random sample of the same size if the population units are ‘in random order’ to begin with. (Krebs, 1989, p. 227). Even worse, there is no information in the systematic sample that allows the manager to check for hidden trends and cycles. Nevertheless, systematic samples do offer some practical advantages over SRS if some correction can be made to the bias in the estimated precision: • it is easier to relocate plots for long term monitoring • mapping can be carried out concurrently with the sampling effort because the ground is systematically traversed. This is less of an issue now with GPS as the exact position can easily be recorded and the plots revisited alter. • it avoids the problem of poorly distributed sampling units which can occur with a SRS [but this can also be avoided by judicious stratification.] Solution: Because of the necessity for a strong assumption of ‘randomness’ in the original population, systematic samples are discouraged and statistical advice should be sought before starting such a scheme. If there are no other feasible designs, a slight variation in the systematic sample provides some protection from the above problems. Instead of taking a single systematic sample every kth unit, take 2 or 3 independent systematic samples of every 2k th or 3k th unit, each with a different starting point. For example, rather than taking a single systematic sample every 100 m along the stream, two independent systematic samples can be taken, each selecting units every 200 m along the stream starting at two random starting points. The total sample effort is still the same, but now some measure of the large scale spatial structure can be estimated. This technique is known as replicated sub-sampling (Kish, 1965, p. 127).
c
2012 Carl James Schwarz
119
December 21, 2012
CHAPTER 3. SAMPLING
3.2.3
Cluster sampling
In some cases, units in a population occur naturally in groups or clusters. For example, some animals congregate in herds or family units. It is often convenient to select a random sample of herds and then measure every animal in the herd. This is not the same as a simple random sample of animals because individual animals are not randomly selected; the herds are the sampling unit. The strip-transect example in the section on simple random sampling is also a cluster sample; all plots along a randomly selected transect are measured. The strips are the sampling units, while plots within each strip are sub-sampling units. Another example is circular plot sampling; all trees within a specified radius of a randomly selected point are measured. The sampling unit is the circular plot while trees within the plot are sub-samples. The reason cluster samples are used is that costs can be reduced compared to a simple random sample giving the same precision. Because units within a cluster are close together, travel costs among units are reduced. Consequently, more clusters (and more total units) can be surveyed for the same cost as a comparable simple random sample. For example, consider the vegation survey of previous sections. The 480 plots can be divided into 60 clusters of size 8. A total sample size of 24 is obtained by randomly selecting three clusters from the 60 clusters present in the map, and then surveying ALL eight members of the seleced clusters. A map of the design might look like:
c
2012 Carl James Schwarz
120
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
121
December 21, 2012
CHAPTER 3. SAMPLING Alternatively, cluster are often formed when a transect sample is taken. For example, suppose that the vegetation survey picked an initial starting point on the left margin, and then flew completely across the landscape in a a straight line measuring all plots along the route. A map of the design migh look like:
c
2012 Carl James Schwarz
122
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
123
December 21, 2012
CHAPTER 3. SAMPLING In this case, there are three clusters chosen from a possible 30 clusters and the clusters are of unequal size (the middle cluster only has 12 plots measured compared to the 18 plots measured on the other two transects). Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. This is not valid because units within a cluster are typically positively correlated. The effect of this erroneous analysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimated standard error is too small and does not fully reflect the actual imprecision in the estimate. Solution: In order to be confident that the reported standard error really reflects the uncertainty of the estimate, it is important that the analytical methods are appropriate for the survey design. The proper analysis treats the clusters as a random sample from the population of clusters. The methods of simple random samples are applied to the cluster summary statistics (Thompson, 1992, Chapter 12).
3.2.4
Multi-stage sampling
In many situations, there are natural divisions of the population into several different sizes of units. For example, a forest management unit consists of several stands, each stand has several cutblocks, and each cutblock can be divided into plots. These divisions can be easily accommodated in a survey through the use of multi-stage methods. Selection of units is done in stages. For example, several stands could be selected from a management area; then several cutblocks are selected in each of the chosen stands; then several plots are selected in each of the chosen cutblocks. Note that in a multi-stage design, units at any stage are selected at random only from those larger units selected in previous stages. Again consider the vegetation survey of previous sections. The population is again divided into 60 clusers of size 8. However, rather than surveying all units within a cluster, we decide to survey only two units within each cluster. Hence, we now sample at the first stage, a total of 12 clusters out of the 60. In each cluster, we randomly sample 2 of the 8 units. A sample plan might look like the following where the rectangles indicate the clusters selected, and the checks indicate the sub-sample taken from each cluster:
c
2012 Carl James Schwarz
124
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
125
December 21, 2012
CHAPTER 3. SAMPLING The advantage of multi-stage designs are that costs can be reduced compared to a simple random sample of the same size, primarily through improved logistics. The precision of the results is worse than an equivalent simple random sample, but because costs are less, a larger multi-stage survey can often be done for the same costs as a smaller simple random sample. This often results in a more precise estimate for the same cost. However, due to the misuse of data from complex designs, simple designs are often highly preferred and end up being more cost efficient when costs associated with incorrect decisions are incorporated. Pitfall: Although random selections are made at each stage, a common error is to analyze these types of surveys as if they arose from a simple random sample. The plots were not independently selected; if a particular cut- block was not chosen, then none of the plots within that cutblock can be chosen. As in cluster samples, the consequences of this erroneous analysis are that the estimated standard errors are too small and do not fully reflect the actual imprecision in the estimates. A manager will be more confident in the estimate than is justified by the survey. Solution: Again, it is important that the analytical methods are suitable for the sampling design. The proper analysis of multi-stage designs takes into account that random samples takes place at each stage (Thompson, 1992, Chapter 13). In many cases, the precision of the estimates is determined essentially by the number of first stage units selected. Little is gained by extensive sampling at lower stages.
3.2.5
Multi-phase designs
In some surveys, multiple surveys of the same survey units are performed. In the first phase, a sample of units is selected (usually by a simple random sample). Every unit is measured on some variable. Then in subsequent phases, samples are selected ONLY from those units selected in the first phase, not from the entire population. For example, refer back to the vegetation survey. An initial sample of 24 plots is closen in a simple random survey. Aerial flights are used to quickly measure some characteristic of the plots. A second phase sample of 6 units (circled below) is then measured using ground based methods.
c
2012 Carl James Schwarz
126
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
127
December 21, 2012
CHAPTER 3. SAMPLING Multiphase designs are commonly used in two situations. First, it is sometimes difficult to stratify a population in advance because the values of the stratification variables are not known. The first phase is used to measure the stratification variable on a random sample of units. The selected units are then stratified, and further samples are taken from each stratum as needed to measure a second variable. This avoids having to measure the second variable on every unit when the strata differ in importance. For example, in the first phase, plots are selected and measured for the amount of insect damage. The plots are then stratified by the amount of damage, and second phase allocation of units concentrates on plots with low insect damage to measure total usable volume of wood. It would be wasteful to measure the volume of wood on plot with much insect damage. The second common occurrence is when it is relatively easy to measure a surrogate variable (related to the real variable of interest) on selected units, and then in the second phase, the real variable of interest is measured on a subset of the units. The relationship between the surrogate and desired variable in the smaller sample is used to adjust the estimate based on the surrogate variable in the larger sample. For example, managers need to estimate the volume of wood removed from a harvesting area. A large sample of logging trucks is weighed (which is easy to do), and weight will serve as a surrogate variable for volume. A smaller sample of trucks (selected from those weighed) is scaled for volume and the relationship between volume and weight from the second phase sample is used to predict volume based on weight only for the first phase sample. Another example is the count plot method of estimating volume of timber in a stand. A selection of plots is chosen and the basal area determined. Then a sub-selection of plots is rechosen in the second phase, and volume measurements are made on the second phase plots. The relationship between volume and area in the second phase is used to predict volume from area measurements seen the first phase.
3.2.6
Panel design - suitable for long-term monitoring
One common objective of long-term studies is to investigate changes over time of a particular population. There are three common designs. First, separate independent surveys can be conducted at each time point. This is the simplest design to analyze because all observations are independent over time. For example, independent surveys can be conducted at five year intervals to assess regeneration of cutblocks. However, precision of the estimated change may be poor because of the additional variability introduced by having new units sampled at each time point. At the other extreme, units are selected in the first survey, permanent monitoring stations are established and the same units are remeasured over time. For example, permanent study plots can be established that are remeasured for regeneration over time. Ceteris paribus (all else being equal), this design is the more efficient (i.e. has higher power) compared to the previous design. The advantage of permanent study plots occurs because in comparisons over time, the effects of that particular monitoring site tend to cancel out and so estimates of variability are free of additional variability introduced by new units being measured at every time point. One possible problem is that survey units may become ‘damaged’ over time, and the sample size will tend to decline over time resulting in a loss of power. Additionally, an analysis of these types of designs is more complex because of the need to account for the correlation over time of measurements on the same sample plot and the need to account for possible missing values when units become ‘damaged’ and
c
2012 Carl James Schwarz
128
December 21, 2012
CHAPTER 3. SAMPLING are dropped from the study. A compromise between these two design are partial replacement designs or panel designs. In these designs, a portion of the survey units are replaced with new units at each time point. For example, 1/5 of the units could be replaced by new units at each time point - units would normally stay in the study for a maximum of 5 time periods. This design combines the advantages of repeatedly measuring semi-permanent monitoring stations with the ability to replace (or refresh) the sample if units become damaged or are lost. The analysis of these designs is non-trival, but manageable with modern software.
3.2.7
Sampling non-discrete objects
In some cases, the population does not have natural discrete sampling units. For example, a large section of land may be arbitrarily divided into 1 m2 plots, or 10 m2 plots. A natural question to ask is what is the ‘best size’ of unit. This has no simple answer and depends upon several factors which must be addressed for each survey: • Cost. All else being equal, sampling many small plots may be more expensive than sampling fewer larger plots. The primary difference in cost is the overhead in traveling and setup to measure the unit. • Size of unit. An intuitive feeling is that more smaller plots are better than few large plots because the sample size is larger. This will be true if the characteristic of interest is ‘patchy’ , but surprisingly, makes no difference if the characteristic is randomly scattered through out the area (Krebs, 1989, p. 64). Indeed if the characteristic shows ‘avoidance’, then larger plots are better. For example, competition among trees implies they are spread out more than expected if they were randomly located. Logistic considerations often influence the plot size. For example, if trampling the soil affects the response, then sample plots must be small enough to measure without trampling the soil. • Edge effects. Because the population does not have natural boundaries, decisions often have to be made about objects that lie on the edge of the sample plot. In general larger square or circular plots are better because of smaller edge-to-area ratio. [A large narrow rectangular plot can have more edge than a similar area square plot.] • Size of object being measured. Clearly a 1 m2 plot is not appropriate when counting mature Douglasfir, but may be appropriate for a lichen survey. A pilot study should be carried out prior to a large scale survey to investigate factors that influence the choice of sampling unit size.
3.2.8
Key considerations when designing or analyzing a survey
Key considerations when designing a survey are
c
2012 Carl James Schwarz
129
December 21, 2012
CHAPTER 3. SAMPLING • what are the objectives of the survey? • what is the sampling unit? This should be carefully distinguished from the observational unit. For example, you may sample boats returning from fishing, but interview the individual anglers on the boat. • What frame is available (if any) for the sampling units? If a frame is available, then direct sampling can be used where the units can be numbered and the randomization used to select the sampling units. If no frame is available, then you will need to figure out how to identify the units and how to select then on the fly. For example, there is no frame of boats returning to an access point, so perhaps a systematic survey of every 5th boat could be used. • Are all the sampling units are the same size? If so, then a simple random sample (or variant thereof) is likely a suitable design. If the units vary considerably in size, then an unequal probability design may be more suitable. For example, if your survey units are forest polygons (as displayed on a GIS), these polygons vary considerably in size with many smaller polygons and fewer larger polygons. A design that selects polygons with a probability proportional to the size of the polygon may be more suited than a simple random sample of polygons. • Decide upon the sampling design used (i.e. simple random sample, or cluster sample, or multi-state design, etc.) The availablity of the frame and the existence of different sized sampling units will often dictate the type of design used. • What precision is required for the estimate? This (along with the variability in the response) will determine the sample size needed. • If you are not stratifying your design, then why not? Stratification is a low-cost or no-cost way to improve your survey. When analyzing a survey, the key steps are: • Recognize the design that was used to collect the data. Key pointers to help recognize various designs are: – How were the units selected? A true simple random sample makes a list of all possible items and then chooses from that list. – Is there more than one size of sampling unit? For example, were transects selected at random, and then quadrats within samples selected at random? This is usually a multi-stage design. – Is there a cluster? For example, transects are selected, and these are divided into a series of quadrats - all of which are measured. Any analysis of the data must use a method that matches the design used to collect the data! • Are there missing values? How did they occur? If the missingness is MCAR, then life is easy and the analysis proceeds with a reduced sample size. If the missingness is MAR, then some reweighting of the observed data will be required. If the missingness is IM, seek help - this is a difficult problem. • Use a suitable package to analyze the results (avoid Excel except for the simplest designs!). • Report both the estimate and the measure of precision (the standard error). c
2012 Carl James Schwarz
130
December 21, 2012
CHAPTER 3. SAMPLING
3.3
Notation
Unfortunately, sampling theory has developed its own notation that is different than that used for design of experiments or other areas of statistics even though the same concepts are used in both. It would be nice to adopt a general convention for all of statistics - maybe in 100 years this will happen. Even among sampling textbooks, there is no agreement on notation! (sigh). In the table below, I’ve summarized the “usual” notation used in sampling theory. In general, large letters refer to population values, while small letters refer to sample values. Characteristic
Population value
Sample value
number of elements
N
n
units
Yi
total
τ=
yi N P
Yi
i=1
mean
µ=
1 N
n P
y=
yi
i=1
N P
Yi
1 n
y=
i=1
τ N
proportion
π=
p=
variance
S2 =
N P
variance of a prop
S2 =
i=1 N N −1 π(1
(Yi −µ)2 N −1
− π)
n P
yi
i=1
y n
s2 =
n P
s2 =
i=1 np(1−p) n−1
(yi −y)2 n−1
Note: • The population mean is sometimes denoted as Y in many books. • The population total is sometimes denoted as Y in many books. • Again note the distinction between the population quantity (e.g. the population mean µ) and the corresponding sample quantity (e.g. the sample mean y).
3.4
Simple Random Sampling Without Replacement (SRSWOR)
This forms the basis of many other more complex sampling plans and is the ‘gold standard’ against which all other sampling plans are compared. It often happens that more complex sampling plans consist of a series of simple random samples that are combined in a complex fashion. In this design, once the frame of units has been enumerated, a sample of size n is selected without replacement from the N population units.
c
2012 Carl James Schwarz
131
December 21, 2012
CHAPTER 3. SAMPLING Refer to the previous sections for an illustration of how the units will be selected.
3.4.1
Summary of main results
It turns out that for a simple random sample, the sample mean (y) is the best estimator for the population mean (µ). The population total is estimated by multiplying the sample mean by the POPULATION size. And, a proportion is estimated by simply coding results as 0 or 1 depending if the sampled unit belongs to the class of interest, and taking the mean of these 0,1 values. (Yes, this really does work - refer to a later section for more details). As with every estimate, a measure of precision is required. We say in an earlier chapter that the standard error (se) is such a measure. Recall that the standard error measures how variable the results of our survey would be if the survey were to be repeated. The standard error for the sample mean looks very similar to that for a sample mean from a completely randomized design (refer to later chapters) with a common correction of a finite population factor (the (1 − f ) term). The standard error for the population total estimate is found by multiplying the standard error for the mean by the POPULATION SIZE. The standard error for a proportion is found again, by treating each data value as 0 or 1 and applying the same formula as the standard error for a mean. The following table summarizes the main results: Estimated se
Parameter
Population value
Estimator
Mean
µ
µ b=y
Total
τ
τb = N × µ b = N yy
Proportion
π
π b = p = y 0/1 =
q
y n
s2 n (1
− f)
N × se(b µ) = N q
p(1−p) n−1
q
s2 n (1
− f)
(1 − f )
Notes: • Inflation factor The term N/n is called the inflation factor and the estimator for the total is sometimes called the expansion estimator or the simple inflation estimator. • Sampling weight Many statistical packages that analyze survey data will require the specification of a sampling weight. A sampling weight represent how many units in the population are represented by this unit in the sample. In the case of a simple random sample, the sampling weight is also equal to N/n. For example, if you select 10 units at random from 150 units in the population, the sampling weight for each observation is 15, i.e. each unit in the sample represents 15 units in the population. The sampling weights are computed differently for various designs so won’t always be equal to N/n. c
2012 Carl James Schwarz
132
December 21, 2012
CHAPTER 3. SAMPLING • sampling fraction the term n/N is called the sampling fraction and is denoted as f . • finite population correction (fpc) the term (1 − f ) is called the finite population correction factor and reflects that if you sample a substantial part of the population, the standard error of the estimator is smaller than what would be expected from experimental design results. If f is less than 5%, this is often ignored. In most ecological studies the sampling fraction is usually small enough that all of the fpc terms can be ignored.
3.4.2
Estimating the Population Mean
The first line of the above table shows the “basic” results and all the remaining lines in the table can be derived from this line as will be shown later. The population mean (µ) is estimated by the sample mean (y). The estimated se of the sample mean is r s p s2 se(y) = (1 − f ) = √ (1 − f ) n n Note that if the sampling fraction (f) is small, then the standard error of the sample mean can be approximated by: r s2 s =√ se(y) ≈ n n which is the familiar form seen previously. In general, the standard error formula changes depending upon the sampling method used to collect the data and the estimator used on the data. Every different sampling design has its own way of computing the estimator and se. Confidence intervals for parameters are computed in the usual fashion, i.e. an approximate 95% confidence interval would be found as: estimator ± 2se. Some textbooks use a t-distribution for smaller sample sizes, but most surveys are sufficiently large that this makes little difference.
3.4.3
Estimating the Population Total
Many students find this part confusing, because of the term population total. This does NOT refer to the total number of units in the population, but rather the sum of the individual values over the units. For example, if you are interested in estimating total timber volume in an inventory unit, the trees are the sampling units. A sample of trees is selected to estimate the mean volume per tree. The total timber volume over all trees in the inventory unit is of interest, not the total number of trees in the inventory unit. As the population total is found by N µ (total population size times the population mean), a natural dAL = τb = N y. estimator is formed by the product of the population size and the sample mean, i.e. T OT Note that you must multiply by the population size not the sample size.
c
2012 Carl James Schwarz
133
December 21, 2012
CHAPTER 3. SAMPLING Its estimated se is found by multiplying the estimated se for the sample mean by the population size as well, i.e., r s2 se(b τ) = N (1 − f ) n In general, estimates for population totals in most sampling designs are found by multiplying estimates of population means by the population size. Confidence intervals are found in the usual fashion.
3.4.4
Estimating Population Proportions
A “standard trick” used in survey sampling when estimating a population proportion is to replace the response variable by a 0/1 code and then treat this coded data in the same way as ordinary data. For example, suppose you were interested the proportion of fish in a catch that was of a particular species. A sample of 10 fish were selected (of course in the real world, a larger sample would be taken), and the following data were observed (S=sockeye, C=chum): S
C
C
S
S
S
S
C
S
S
Of the 10 fish sampled, 3 were chum so that the sample proportion of fish that were chum is 3/10 = 0.30. If the data are recoded using 1=Chum, 0=Sockeye, the sample values would be: 0
1
1
0
0
0
0
1
0
0
The sample average of these numbers gives y = 3/10 = 0.30 which is exactly the proportion seen. It is not surprising then that by recoding the sample using 0/1 variables, the first line in the summary table reduces to the last line in the summary table. In particular, s2 reduces to np(1 − p)/(n − 1) resulting in the se seen above. Confidence intervals are computed in the usual fashion.
3.4.5
Example - estimating total catch of fish in a recreational fishery
This will illustrate the concepts in the previous sections using a very small illustrative example.
c
2012 Carl James Schwarz
134
December 21, 2012
CHAPTER 3. SAMPLING For management purposes, it is important to estimate the total catch by recreational fishers. Unfortunately, there is no central registry of fishers, nor is there a central reporting station. Consequently, surveys are often used to estimate the total catch. There are two common survey designs used in these types of surveys (generically called creel surveys). In access surveys, observers are stationed at access points to the fishery. For example, if fishers go out in boats to catch the fish, the access points are the marinas where the boats are launched and are returned. From these access points, a sample of fishers is selected and interviews conducted to measure the number of fish captured and other attributes. Roving surveys are commonly used when there is no common access point and you can move among the fishers. In this case, the observer moves about the fishery and questions anglers as they are encountered. Note that in this last design, the chances of encountering an angler are no longer equal - there is a greater chance of encountering an angler who has a longer fishing episode. And, you typically don’t encounter the angler at the end of the episode but somewhere in the middle of the episode. The analysis of roving surveys is more complex - seek help. The following example is based on a real life example from British Columbia. The actual survey is much larger involving several thousand anglers and sample sizes in the low hundreds, but the basic idea is the same. An access survey was conducted to estimate the total catch at a lake in British Columbia. Fortunately, access to the lake takes place at a single landing site and most anglers use boats in the fishery. An observer was stationed at the landing site, but because of time constraints, could only interview a portion of the anglers returning, but was able to get a total count of the number of fishing parties on that day. A total of 168 fishing parties arrived at the landing during the day, of which 30 were sampled. The decision to sample an fishing party was made using a random number table as the boat returned to the dock. The objectives are to estimate the total number of anglers and their catch and to estimate the proportion of boat trips (fishing parties) that had sufficient life-jackets for the members on the trip. Here is the raw data
c
2012 Carl James Schwarz
135
December 21, 2012
CHAPTER 3. SAMPLING - each line is the results for a fishing party.. Number Anglers 1 3 1 1 3 3 1 1 1 1 2 1 2 1 3 1 1 2 3 1 2 1 1 1 1 2 2 1 1 1
Party Catch 1 1 2 2 2 1 0 0 1 0 0 1 0 2 3 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0
Sufficient Life Jackets? yes yes yes no no yes no no yes yes yes yes yes yes yes no yes yes yes yes yes yes yes yes no yes no no yes yes
What is the population of interest? The population of interest is NOT the fish in the lake. The Fisheries Department is not interested in estimating the characteristics of the fish, such as mean fish weight or the number of fish in the lake. Rather, the focus is on the anglers and fishing parties. Refer to the FAQ at the end of the chapter for more details. It would be tempting to conclude that the anglers on the lake are the population of interest. However, note that information is NOT gathered on individual anglers. For example, the number of fish captured by each angler in the party is not recorded - only the total fish caught by the party. Similarly, it is impossible to c
2012 Carl James Schwarz
136
December 21, 2012
CHAPTER 3. SAMPLING say if each angler had an individual life jacket - if there were 3 anglers in the boat and only two life jackets, which angler was without? 1 For this reason, the the population of interest is taken to be the set of boats fishing at this lake. The fisheries agency doesn’t really care about the individual anglers because if a boat with 3 anglers catches one fish, the actual person who caught the fish is not recorded. Similarly, if there are only two life jackets, does it matter which angler didn’t have the jacket? Under this interpretation, the design is a simple random sample of boats returning to the landing.
What is the frame? The frame for a simple random sample is a listing of ALL the units in the population. This list is then used to randomly select which units will be measured. In this case, there is no physical list and the frame is conceptual. A random number table was used to decide which fishing parties to interview.
What is the sampling design and sampling unit? The sampling design will be treated as if it were a simple random sample from all boats (fishing parties) returning, but in actual fact was likely a systematic sample or variant. As you will see later, this may or may not be a problem. In many cases, special attention should be paid to identify the correct sampling unit. Here the sampling unit is a fishing party or boat, i.e. the boats were selected, not individual anglers. This mistake is often made when the data are presented on an individual basis rather than on a sampling unit basis. As you will see in later chapters, this is an example of pseudo-replication.
Excel analysis As mentioned earlier, Excel should be used with caution in statistical analysis. However, for very simple surveys, it is an adequate tool. A copy of a sample Excel worksheet called creel is available in the AllofData workbook in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Here is a condensed view of the spreadsheet within the workbook: 1 If data were collected on individual anglers, then the anglers could be taken as the population of interest. However, in this case, the design is NOT a simple random sample of anglers. Rather, as you will see later in the course, the design is a cluster sample where a simple random sample of clusters (boats) was taken and all members of the cluster (the anglers) were interviewed. As you will see later in the course, a cluster sample can be viewed as a simple random sample if you define the population in terms of clusters.
c
2012 Carl James Schwarz
137
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
138
December 21, 2012
CHAPTER 3. SAMPLING The analysis proceeds in a series of logical steps as illustrated for the number of anglers in each party variable. Enter the data on the spreadsheet The metadata (information about the survey) is entered at the top of the spreadsheet. The actual data is entered in the middle of the sheet. One row is used to list the variables recorded for each angling party. Obtain the required summary statistics. At the bottom of the data, the summary statistics needed are computed using the Excel built-in functions. This includes the sample size, the sample mean, and the sample standard deviation. Obtain estimates of the population quantity Because the sample mean is the estimator for the population mean in if the design is a simple random sample, no further computations are needed. In order to estimate the total number of angler, we multiply the average number of anglers in each fishing party (1.533 angler/party) by the POPULATION SIZE (the number of fishing parties for the entire day = 168) to get the estimated total number of anglers (257.6). Obtain estimates of precision - standard errors The se for the sample mean is computed using the formula presented earlier. The estimated standard error OF THE MEAN is 0.128 anglers/party. Because we found the estimated total by multiplying the estimates of the mean number of anglers/boat trip times the number of boat trips (168), the estimated standard error of the POPULATION TOTAL is found by multiplying the standard error of the sample mean by the same value, 0.128x168 = 21.5 anglers. Hence, a 95% confidence interval for the total number of anglers fishing this day is found as 257.6 ± 2(21.5). Estimating total catch The next column uses a similar procedure is followed to estimate the total catch. Estimating proportion of parties with sufficient life-jackets First, the character values yes/no are translated into 0,1 variables using the IF statement of Excel. Then the EXACT same formula as used for estimating the total number of anglers or the total catch is applied to the 0,1 data!
c
2012 Carl James Schwarz
139
December 21, 2012
CHAPTER 3. SAMPLING We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4 percentage points.
SAS analysis SAS (Version 8 or higher) has procedures for analyzing survey data. Copies of the sample SAS program called creel.sas and the output called creel.lst are available from the Sample Program Library at http: //www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The program starts with the Data step that reads in the data and creates the metadata so that the purpose of the program and how the data were collected etc are not lost.
data creel; /* read in the survey data */ input angler catch lifej $; enough = 0; if lifej = ’yes’ then enough = 1;
The first section of code reads the data and computes the 0,1 variable from the life-jacket information. A Proc Print (not shown) lists the data so that it can be verified that it was read correctly. Most program for dealing with survey data require that sampling weights be available for each observation.
data creel; set creel; sampweight = 168/30; run;
A sampling weight is the weighting factor representing how many people in the population this observation represents. In this case, each of the 30 parties represents 168/30=5.6 parties in the population. Finally, Proc SurveyMeans is used to estimate the quantities of interest.
proc surveymeans data=creel total=168 /* total population size */ mean clm /* find estimates of mean, its se, and a 95% confidence interval */ sum clsum /* find estimates of total,its se, and a 95% confidence interval */ ; var angler catch lifej ; /* estimate mean and total for numeric variables, proport weight sampweight; c
2012 Carl James Schwarz
140
December 21, 2012
CHAPTER 3. SAMPLING
/* Note that it is not necessary to use the coded 0/1 variables in this procedure */ run;
It is not necessary to code any formula as these are builtin into the SAS program. So how does the SAS program know this is a simple random sample? This is the default analysis - more complex designs require additional statements (e.g. a CLUSTER statement) to indicate a more complex design. As well, equal sampling weights indicate that all items were selected with equal probability. Here are portions of the SAS output Data Summary Number of Observations Sum of Weights
30 168
Class Level Information Class Variable
Levels
Values
2
no yes
lifej
Statistics Variable
Mean
Std Error of Mean
angler
1.533333
0.128419
1.27068638
catch
0.666667
0.139688
no
0.266667
yes
0.733333
lifej
Level
95% CL for Mean
Sum
Std Dev
1.79598029
257.600000
21.574442
213.475312
30
0.38097171
0.95236162
112.000000
23.467659
64.003248
15
0.074425
0.11444970
0.41888363
44.800000
12.503462
19.227550
7
0.074425
0.58111637
0.88555030
123.200000
12.503462
97.627550
14
All of the results match that from the Excel spreadsheet.
3.5
Sample size determination for a simple random sample
I cannot emphasize too strongly, the importance of planning in advance of the survey. There are many surveys where the results are disappointing. For example, a survey of anglers may show that the mean catch per angler is 1.3 fish but that the standard error is .9 fish. In other words, a 95% confidence interval stretches from 0 to well over 4 fish per angler, something that is known with near certainty even before the survey was conducted. In many cases, a back of the envelope calculation has showed that the c
2012 Carl James Schwarz
141
December 21, 2012
95% CL for
CHAPTER 3. SAMPLING precision obtained from a survey would be inadequate at the proposed sample size even before the survey was started. In order to determine the appropriate sample size, you will need to first specify some measure of precision that is required to be obtained. For example, a policy decision may require that the results be accurate to within 5% of the true value. This precision requirement usually occurs in one of two formats: • an absolute precision, i.e. you wish to be 95% confident that the sample mean will not vary from the population mean by a pre-specified amount. For example, a 95% confidence interval for the total number of fish captured should be ± 1,000 fish. • a relative precision, i.e. you wish to be 95% confident that the sample mean will be within 10% of the true mean. The latter is more common than the former, but both are equivalent and interchangeable. For example, if the actual estimate is around 200, with a se of about 50, then the 95% confidence interval is ± 100 and the relative precision is within 50% of the true answer (± 100 / 200). Conversely, a 95% confidence interval that is within ± 40% of the estimate of 200, turns out to be ± 80 (40% of 200), and consequently, the se is around 40 (=80/2). A common question is: What is the difference between se/est and 2se/est? When is the relative standard error divided by 2? Does se/est have anything to do with a 95 % ci? Precision requirements are stated in different ways (replace blah below by mean/total/proportion etc).
c
2012 Carl James Schwarz
142
December 21, 2012
CHAPTER 3. SAMPLING Expression
Mathematics
- within xxx of the blah
se = xxx
- margin of error of xxx
2se = xxx
- within xxx of the true value 19 times out of 20
2se = xxx
- within xxx of the true value 95% of the time
2se = xxx
- the width of the 95% confidence interval is xxx
4se = xxx
- within 10% of the blah
se/est = .10
- a rse of 10%
se/est = .10
- a relative error of 10%
se/est = .10
- within 10% of the blah 95% of the time
2se/est = .10
- within 10% of the blah 19 times out of 20
2se/est = .10
- margin of error of 10%
2se/est = .10
- width of 95% confidence interval = 10% of the blah
4se/est = .10
As a rough rule of thumb, the following are often used as survey precision guidelines: • For preliminary surveys, the 95% confidence interval should be ± 50% of the estimate. This implies that the target rse is 25%. • For management surveys, the 95% confidence interval should be ± 25% of the estimate. This implies that the target rse is 12.5%. • For scientific work, the 95% confidence interval should be ± 10% of the estimate. This implies that the target rse is 5%. Next, some preliminary guess for the standard deviation of individual items in the population (S) needs to be taken along with an estimate of the population size (N ) and possibly the population mean (µ) or population total (τ ). These are not too crucial and can be obtained by: • taking a pilot study. • previous sampling of similar populations • expert opinion
c
2012 Carl James Schwarz
143
December 21, 2012
CHAPTER 3. SAMPLING A very rough estimate of the standard deviation can be found by taking the usual range of the data/4. If the population proportion is unknown, the value of 0.5 is often used as this leads to the largest sample size requirement as a conservative guess. These are then used with the formulae for the confidence interval to determine the relevant sample size. Many text books have complicated formulae to do this - it is much easier these days to simply code the formulae in a spreadsheet (see examples) and use either trial and error to find a appropriate sample size, or use the “GOAL SEEKER” feature of the spreadsheet to find the appropriate sample size. This will be illustrated in the example. √ As an approximated answer, recall that se usally vary by n. Suppose that the present rse is .07. A rse of 5%, is smaller by a factor of .075/.05 = 1.5 which will require an increase of 1.52 = 2.25 in the sample size. If the raw data are available, you can also do a “bootstrap” selection (with replacement) to investigate the effect of sample size upon the se. For each different bootstrap sample size, estimate the parameter, the rse and then increase the sample size until the require rse is obtained. This is relatively easy to do in SAS using the Proc SurveySelect that can select samples of arbitrary size. In saome packages, such as JMP, sampling is without replacement so a direct sampling of 3x the observed sample size is not possible. In this case, create a pseudo-data set by pasting 19 copies of the raw data after the original data. Then use the Table →Subset →Random Sample Size to get the approximate bootstrap sample. Again compute the estimate and its rse, and increase the sample size until the required precision is obtained. The final sample size is not to be treated as the exact sample size but more as a guide to the amount of effort that needs to be expended. Remember that “guesses” are being made for the standard deviation, the require precision, the approximate value of the estimate etc. Consequently, there really isn’t a defensible difference between a required sample size of 30 and 40. What really is of interest is the order of magnitude of effort required. For example, if your budget allows for a sample size of 20, and the sample size computation show that a sample size of 200 is required, then doing the survey with a sample size of 20 is a waste of time and money. If the required sample size is about 30, then you may be ok with an actual sample size of 20. If more than one item is being surveyed, these calculations must be done for each item. The largest sample size needed is then chosen. This may lead to conflict in which case some response items must be dropped or a different sampling method must be used for this other response variable.
Precision essentially depends only the absolute sample size, not the relative fraction of the population sampled. For example, a sample of 1000 people taken from Canada (population of 33,000,000) is just as precise as a sample of 1000 people taken from the US (population of 333,000,000)! This is highly counter-intuitive and will be explored more in class.
3.5.1
Example - How many angling-parties to survey
We wish to repeat the angler creel survey next year. c
2012 Carl James Schwarz
144
December 21, 2012
CHAPTER 3. SAMPLING • How many angling-parties should be interviewed to be 95% confident of being with 10% of the true mean catch? • What sample size would be needed to estimate the proportion of boats within 3 percentage points 19 times out of 20? In this case we are asking that the 95% confidence interval be ±0.03 or that the se = 0.015. The sample size spreadsheet is available in an Excel workbook called SurveySampleSize.xls which can be downloaded from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/ Stat-650/Notes/MyPrograms. A SAS program to compute sample size is also available, but in my opinion, is neither user-friendly nor as flexible for the general user. The code and output is also available in the Sample Program Library referred to above. Here is a condensed view of the spreadsheet:
c
2012 Carl James Schwarz
145
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
146
December 21, 2012
CHAPTER 3. SAMPLING First note that the computations for sample size require some PRIOR information about population size, the population mean, or the population proportion. We will use information from the previous survey to help plan future studies. For example, about 168 boats were interviewed last year. The mean catch per angling party was about .667 fish/boat. The standard deviation of the catch per party was .844. These values are entered in the spreadsheet in column C. A preliminary sample size of 40 (in green in Column C) was tried. This lead to a 95% confidence interval of ± 35% which did not meet the precision requirements. Now vary the sample size (in green) in column C until the 95% confidence interval (in yellow) is below ± 10%. You will find that you will need to interview almost 135 parties - a very high sampling fraction indeed. The problem for this variable is the very high variation of individual data points. If you are familiar with Excel, you can use the Goal Seeker function to speed the search. Similarly, the proportion of people wearing lifejackets last year was around 73%. Enter this in the blue areas of Column E. The initial sample size of 20 is too small as the 95% confidence interval is ± .186 (18 percentage points). Now vary the sample size (in green) until the 95% confidence interval is ± .03. Note that you need to be careful in dealing with percentages - confidence limits are often specified in terms of percentage points rather than percents to avoid problems where percents are taken of percents. This will be explained further in class. Try using the spreadsheet to compare the precision of a poll of 1000 people taken from Canada (population 33,000,000) and 1000 people taken from the US (population 330,000,000) if both polls have about 40% in favor of some issue. Technical notes If you really want to know how the sample size numbers are determined, here is the lowdown. Suppose that you wish to be 95% sure that the sample mean is within 10% of the true mean. q We must solve z √Sn NN−n ≤ εµ for n where z is the term representing the multiplier for a particular confidence level (for a 95% c.i. use z = 2) and ε is the ‘closeness’ factor (in this case ε = 0.10). Rearranging this equation gives n =
3.6
N εµ 2 1+N ( zS )
Systematic sampling
Sometimes, logistical considerations make a true simple random sample not very convenient to administer. For example, in the previous creel survey, a true random sample would require that a random number be c
2012 Carl James Schwarz
147
December 21, 2012
CHAPTER 3. SAMPLING generated for each boat returning to the marina. In such cases, a systematic sample could be used to select elements. For example, every 5th angler could be selected after a random starting point.
3.6.1
Advantages of systematic sampling
The main advantages of systematic sampling are: • it is easier to draw units because only one random number is chosen • if a sampling frame is not available but there is a convenient method of selecting items, e.g. the creel survey where every 5th angler is chosen. • easier instructions for untrained staff • if the population is in random order relative to the variable being measured, the method is equivalent to a SRS. For example, it is unlikely that the number of anglers in each boat changes dramatically over the period of the day. This is an important assumption that should be investigated carefully in any real life situation! • it distributes the sample more evenly over the population. Consequently if there is a trend, you will get items selected from all parts of the trend.
3.6.2
Disadvantages of systematic sampling
The primary disadvantages of systematic sampling are: • Hidden periodicities or trends may cause biased results. In such cases, estimates of mean and standard errors may be severely biased! See Section 4.2.2 for a detailed discussion. • Without making an assumption about the distribution of population units, there is no estimate of the standard error. This is an important disadvantage of a systematic sample! Many studies very casually make the assumption that the systematic sample is equivalent to a simple random sample without much justification for this.
3.6.3
How to select a systematic sample
There are several methods, depending if you know the population size, etc. Suppose we need to choose every k th record, where k is chosen to meet sample size requirements. - an example of choosing k will be given in class. All of the following methods are equivalent if k divides N exactly. These are the two most common methods.
c
2012 Carl James Schwarz
148
December 21, 2012
CHAPTER 3. SAMPLING • Method 1 Choose a random number j from 1 · · · k.. Then choose the j, j + k, j + 2k, · · · records. One problem is that different samples may be of different size - an example will be given in class where n doesn’t divide N exactly. This causes problems in sampling theory, but not too much of a problem if n is large. • Method 2 Choose a random number from 1 · · · N . Choose very k th item and continue in a circle when you reach the end until you have selected n items. This will always give you the same sized sample, however, it requires knowledge of N
3.6.4
Analyzing a systematic sample
Most surveys casually assume that the population has been sorted in random order when the systematic sample was selected and so treat the results as if they had come from a SRSWOR. This is theoretically not correct and if your assumption is false, the results may be biased, and there is no way of examining the biases from the data at hand. Before implementing a systematic survey or analyzing a systematic survey, please consult with an expert in sampling theory to avoid problems. This is a case where an hour or two of consultation before spending lots of money could potentially turn a survey where nothing can be estimated, into a survey that has justifiable results.
3.6.5
Technical notes - Repeated systematic sampling
To avoid many of the potential problems with systematic sampling, a common device is to use repeated systematic samples on the same population. For example, rather than taking a single systematic sample of size 100 from a population, you can take 4 systematic samples (with different starting points) of size 25. An empirical method of obtaining a standard error from a systematic sample is to use repeated systematic sampling. Rather than choosing one systematic subsample of every k th unit, choose, m independent systematic subsample of size n/m. Then estimate the mean of each sub-systematic sample. Treat these means as a simple random sample from the population of possible systematic samples and use the usual sampling theory. The variation of the estimate among the sub-systematic samples provides an estimate of the standard error (after an appropriate adjustment). This will be illustrated in an example.
Example of replicated subsampling within a systematic sample A yearly survey has been conducted in the Prairie Provinces to estimate the number of breeding pairs of ducks. One breeding area has been divided into approximately 1000 transects of a certain width, i.e. the breeding area was divided into 1000 strips. c
2012 Carl James Schwarz
149
December 21, 2012
CHAPTER 3. SAMPLING What is the population of interest? As noted in class, the definition of a population depends, in part, upon the interest of the researcher. Two possible definitions are: • The population is the set of individual ducks on the study area. However, no frame exists for the individual birds. But a frame can be constructed based on the 1000 strips that cover the study area. In this case, the design is a cluster sample, with the clusters being strips. • The population consists of the 1000 strips that cover the study area and the number of ducks in each strip is the response variable. The design is then a simple random sample of the strips. In either case, the analysis is exactly the same and the final estimates are exactly the same. Approximately 100 of the transects are flown by an aircraft and spotters on the aircraft count the number of breeding pairs visible from the aircraft. For administrative convenience, it is easier to conduct systematic sampling. However, there is structure to the data; it is well known that ducks do not spread themselves randomly through out the breeding area. After discussions with our Statistical Consulting Service, the researchers flew 10 sets of replicated systematic samples; each set consisted of 10 transects. As each transect is flown, the scientists also classify each transect as ‘prime’ or ‘non-prime’ breeding habitat. Here is the raw data reporting the number of nests in each set of 10 transects:
c
2012 Carl James Schwarz
150
December 21, 2012
CHAPTER 3. SAMPLING
Set
Prime
Non-Prime
Habitat
Habitat
Total
n
Total
n
NonALL
Prime
prime
Total
mean
mean
Diff
(a)
(c)
(d)
(e)
(b) 1
123
3
345
7
468
41.0
49.3
-8.3
2
57
2
36
8
93
28.5
4.5
24.0
3
85
5
46
5
131
17.0
9.2
7.8
4
97
2
131
8
228
48.5
16.4
32.1
5
34
5
43
5
77
6.8
8.6
-1.8
6
85
3
67
7
152
28.3
9.6
18.8
7
56
7
64
3
120
8.0
21.3
-13.3
8
46
2
65
8
111
23.0
8.1
14.9
9
37
4
43
6
80
9.3
7.2
2.1
10
93
2
104
8
197
46.5
13.0
33.5
Avg
71.3
165.7
10.97
s
29.5
117.0
16.38
n
10
10
10 Est
Est total
7130
16570
mean
10.97
Est se
885
3510
se
4.91
Several different estimates can be formed. 1. Total number of nests in the breeding area (refer to column (a) above). The total number of nests in the breeding area for all types of habitat is of interest. Column (a) in the above table is the data that will be used. It represents the total number of nests in the 10 transects of each set. The principle behind the estimator is that the 1000 total transects can be divided into 100 sets of 10 transects, of which a random sample of size 10 was chosen. The sampling unit is the set of transects – the individual transects are essentially ignored. Note that this method assumes that the systematic samples are all of the same size. If the systematic samples had been of different sizes (e.g. some sets had 15 transects, other sets had 5 transects), then a ratio-estimator (see later sections) would have been a better estimator. • compute the total number of nests for each set. This is found in column (a). • Then the sets selected are treated as a SRSWOR sample of size 10 from the 100 possible sets. An estimate of the mean number of nests per set of 10 transects is found as: µ b = (468 + 93 + · · · + c
2012 Carl James Schwarz
151
December 21, 2012
CHAPTER 3. SAMPLING
197)/10 = 165.7 with an estimated se of se(b µ) = 35.1
q
s2 n
1−
n 100
=
q
117.02 10
1−
10 100
=
• The average number of nests per set is expanded to cover all 100 sets τb = 100b µ = 16570 and se(b τ ) = 100se(b µ) = 3510 2. Total number of nests in the prime habitat only (refer to column (b) above). This is formed in exactly the same way as the previous estimate. This is technically known as estimation in a domain. The number of elements in the domain in the whole population (i.e. how many of the 1000 transects are in prime-habitat) is unknown but is not needed. All that you need is the total number of nests in prime habitat in each set – you essentially ignore the non-prime habitat transects within each set. = 71.3 The average number of nests per set in prime habitats is found as before: µ b = 123+···+93 10 q q s2 29.52 n 10 with an estimated se of se(b µ) = n (1 − 100 ) = 10 (1 − 100 ) = 8.85. • because there are 100 sets of transects in total, the estimate of the population total number of nests in prime habitat and its estimated se is τb = 100b µ = 7130 with a se(b τ ) = 100se(b µ) = 885 • Note that the total number of transects of prime habitat is not known for the population and so an estimate of the density of nests in prime habitat cannot be computed from this estimated total. However, a ratio-estimator (see later in the notes) could be used to estimate the density. 3. Difference in mean density between prime and non-prime habitats The scientists suspect that the density of nests is higher in prime habitat than in non-prime habitat. Is there evidence of this in the data? (refer to columns (c)-(e) above). Here everything must be transformed to the density of nest per transect (assuming that the transects were all the same size). Also, pairing (refer to the section on experimental design) is taking place so a difference must be computed for each set and the differences analyzed, rather than trying to treat the prime and non-prime habitats as independent samples. Again, this is an example of what is known as domain-estimation. • Compute the domain means for type of habitat for each set (columns (c) and (d)). Note that the totals are divided by the number of transects of each type in each set. • Compute the difference in the means for each set (column (e)) • Treat this difference as a simple random sample of size 10 taken from the 100 possible sets of transects. What does the final estimated mean difference and se imply?
3.7
Stratified simple random sampling
A simple modification to a simple random sample can often lead to dramatic improvements in precision. This is known as stratification. All survey methods can potentially benefit from stratification (also known as blocking in the experimental design literature). Stratification will be beneficial whenever variability in the response variable among the survey units can be anticipated and strata can be formed that are more homogeneous than the original set of survey units. All stratified designs will have the same basic steps as listed below regardless of the underlying design. c
2012 Carl James Schwarz
152
December 21, 2012
CHAPTER 3. SAMPLING • Creation of strata. Stratification begins by grouping the survey units into homogeneous groups (strata) where survey units within strata should be similar and strata should be different. For example, suppose you wished to estimate the density of animals. The survey region is divided into a large number of quadrats based on aerial photographs. The quadrats can be stratified into high and low quality habitat because it is thought that the density within the high quality quadrats may be similar but different from the density in the low quality habitats. The strata do not have to be physically contiguous – for example, the high quality habitats could be scattered through out the survey region and can be grouped into one single stratum. • Determine total sample size. Use the methods in previous sections to determine the total sample size (number of survey units) to select. At this stage, some sort of “average” standard deviation will be used to determine the sample size. • Allocate effort among the strata. there are several ways to allocate the total effort among the strata. – Equal allocatin. In equal allocation, the total effort is split equally among all strata. Equal allocation is preferred when equally precise estimates are required for each stratum. 2 – Proportional allocation. In proportional allocation, the total effort is allocated to the strata in proportion to stratum importance. Stratum importance could be related to stratum size (e.g. when allocating effort among the U.S. and Canada, then because the U.S. is 10 times larger in Canada, more effort should be allocated to surveying the U.S.). But if density is your measure of importance, allocate more effort to higher density strata. Proportional allocation is preferred when more precise estimates are required in more important strata. – Neyman allocation. Neyman determined that if you also have information on the variability within each stratum, then more effort should be allocated to strata that are more important and more variable to give you the most precise overall estimate for a given sample size. This rarely is performed in ecology because often information on intra-stratum variability is unknown. 3 – Cost allocation. In general, effort should be allocated to more important strata, more variable strata, or strata where sampling is cheaper to give the best overall precision for the entire survey. As in the previous allocation method, ecologists rarely have sufficiently detailed cost information to do this allocation method. • Conduct separate surveys in each stratum Separate independent surveys are conducted in each stratum. It is not necessary to use the same survey method in all strata. For example, low density quadrats could be surveyed using aerial methods, while high density strata may require ground based methods. Some strata may use simple random samples, while other strata may use cluster samples. Many textbooks show examples were the same survey method is used in all strata, but this is NOT required. The ability to use different sampling methods in the different strata often leads to substantial cost savings and is a very good reason to use stratified sampling! • Obtain stratum specific estimates. Use the appropriate estimators to estimate stratum means and the se for EACH stratum. Then expand the estimated mean to get the estimated total (and se) in the usual way. 2 Recall
from previous sections that the absolute sample size is one of the drivers for precision. in many cases, higher means per survey unit are accompanied by greater variances among survey units so allocations based on stratum means often capture this variation as well. 3 However,
c
2012 Carl James Schwarz
153
December 21, 2012
CHAPTER 3. SAMPLING • Rollup The individual stratum estimates of the TOTAL are then combined to give an overall Grand Total value for the entire survey region. The se of the Grand Total is found as: p d ) = se(τb1 )2 + se(τb2 )2 + . . . se(GT Finally, if you want the overall grand average, simply divide the grand total (and its se) by the appropriate divisor.
Stratification is normally carried out prior to the survey (pre- stratification), but can also be done after the survey (post-stratification) – refer to a later section for details. Stratification can be used with any type of sampling design – the concepts introduced here deal with stratification applied to simple random samples but are easily extended to more complex designs. The advantages of stratification are: • standard errors of the mean or of the total will be smaller (i.e. estimates are more precise) when compared to the respective standard errors from an unstratified design if the units within strata are more homogenous (i.e., less variable) compared to the variability over the entire unstratified population. • different sampling methods may be used in each stratum for cost or convenience reasons. [In the detail below we assume that each stratum has the same sampling method used, but this is only for simplification.] This can often lead to reductions in cost as the most appropriate and cost effective sampling method can be used in each straum. • because randomization occurs independently in each stratum, corruption of the survey design due to problems experienced in the field may be confined. • separate estimates for each stratum with a given precision can be obtained • it may be more convenient to take a stratified random sample for administrative reasons. For example, the strata may refer to different district offices.
3.7.1
A visual comparison of a simple random sample vs. a stratified simple random sample
You may find it useful to compare a simple random sample of 24 vs. a stratified random sample of 24 using the following visual plans: Select a sample of 24 in each case.
c
2012 Carl James Schwarz
154
December 21, 2012
CHAPTER 3. SAMPLING Simple Random Sampling
c
2012 Carl James Schwarz
155
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
156
December 21, 2012
CHAPTER 3. SAMPLING Describe how the sample was taken.
c
2012 Carl James Schwarz
157
December 21, 2012
CHAPTER 3. SAMPLING Stratified Simple Random Sampling Suppose that there is a gradient in habitat quality across the population. Then a more efficient (i.e. leading to smaller standard errors) sampling design is a stratified design. Three strata are defined, consisting of the first 3 rows, the next 5 rows, and finally, the last two rows. In many cases, the sample sample design is used in all strata. For example, suppose it was decided to conduct a simple random sample within each stratum, with sample sizes of 8, 10, and 6 in the three strata respectively. [The decision process on allocating samples to strata will be covered later.]
c
2012 Carl James Schwarz
158
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
159
December 21, 2012
CHAPTER 3. SAMPLING Stratified Sampling with a different method in each stratum It is quite possible, and often desirable, to use different methods in the different strata. For example, it may be more efficient to survey desert areas using a fixed-wing aircraft, while ground surveys need to be used in heavily forested areas. For example, consider the following design. In the first (top most) stratum, a simple random sample was taken; in the second stratum a cluster sample was taken; in the third stratum a cluster sample (via transects) was also taken.
c
2012 Carl James Schwarz
160
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
161
December 21, 2012
CHAPTER 3. SAMPLING
3.7.2
Notation
Common notation is to use h as a stratum index and i or j as unit indices within each stratum. Characteristic
Population quantities
sample quantities
number of strata
H
H
stratum sizes
N1 , N2 , · · · , NH
n1 , n2 , · · · , nH
population units
Yhj h=1,· · · ,H, j=1,· · · ,NH
yhj h=1,· · · ,H, j=1,· · · ,nH
stratum totals
τh
yh
stratum means
µh
Population total
yh
τ =N
H P
Wh µh where Wh =
h=1
Population mean
µ=
H P
Nh N
Wh µh
h=1
Standard deviation
3.7.3
Sh2
s2h
Summary of main results
It is assumed that from each stratum, a SRSWOR of size nh is selected independently of ALL OTHER STRATA! The results below summarize the computations that can be more easily thought as occurring in four steps: 1. Compute the estimated mean and its se for each stratum. In this chapter, we use a SRS design in each stratum, but it not necessary to use this design in a stratum and each stratum could have a different design. In the case of an SRS, the estimate of the mean for each stratum is found as: µ bh = y h with associated standard error:
s se(b µh ) =
s2h (1 − fh ) nh
where the subscript h refers to each stratum. 2. Compute the estimated total and its se for each stratum. In many cases this is simply the estimated mean for the stratum multiplied by the STRATUM POPULATION size. In the case of an SRS in each stratum this gives:: τbh = Nh × µ bh = Nh × y h
c
2012 Carl James Schwarz
162
December 21, 2012
CHAPTER 3. SAMPLING .
s se(b τh ) = Nh × se(b µh ) = Nh ×
s2h (1 − fh ) nh
3. Compute the grand total and its se over all strata. This is the sum of the individual totals. The se is computed in a special way. τb = τb1 + τb2 + . . . p se(b τ ) = se(τb1 )2 + se(τb2 )2 + . . . 4. Occasionally, the grand mean over all strata is needed. This is found by dividing the estimated grand total by the total POPULATION sizes: τb N1 + N2 + . . .
µ b=
se(b µ) =
se(b τ) N1 + N2 + . . .
This can be summarized in a succinct form as follows. Note that the stratum weights Wh are formed as Nh /N and are often used to derive weighted means etc: Quantity
Pop value
Mean
µ=
H P
Estimator
Wh µh
h=1
µ bstr =
H P
se s Wh y h
H P h=1
h=1
s
H P h=1
Total
H P
τ =N
Wh µh or
h=1
τ=
H P
τ=
h=1 H P
τh or
τbstr = N τbstr =
H P
s Wh y h or
h=1
h=1
H P h=1
s Nh y h
H P
H P h=1
Wh2 se2 (y h ) = s2
Wh2 nhh (1 − fh )
Nh2 se2 (y h ) or s2
Nh2 nhh (1 − fh )
Nh µh
h=1
Notes • The estimator for the grand population mean is a weighted average of the individual stratum means using the POPULATION weights rather than the sample weights. This is NOT the same as the simple unweighted average of the estimated stratum means unless the nh /n equal the Nh /N - such a design is known as proportional allocation in stratified sampling. p • The estimated standard error for the grand total is found as se21 + se22 + · · · + se2h , i.e. the square root of the sum of the individual se2 of the strata TOTALS. c
2012 Carl James Schwarz
163
December 21, 2012
CHAPTER 3. SAMPLING • The estimators for a proportion are IDENTICAL to that of the mean except replace the variable of interest by 0/1 where 1=character of interest and 0=character not of interest. • Confidence intervals Once the se has been determined, the usual ±2se will give approximate 95% confidence intervals if the sample sizes are relatively large in each stratum. If the sample sizes are small in each stratum some authors suggest using a t-distribution with degrees of freedom determined using a Satterthwaite approximation - this will not be covered in this course.
3.7.4
Example - sampling organic matter from a lake
[With thanks to Dr. Rick Routledge for this example]. Suppose that you were asked to estimate the total amount of organic matter suspended in a lake just after a storm. The first scheme that might occur to you could be to cruise around the lake in a haphazard fashion and collect a few sample vials of water which you could then take back to the lab. If you knew the total volume of water in the lake, then you could obtain an estimate of the total amount of organic matter by taking the product of the average concentration in your sample and the total volume of the lake. The accuracy of your estimate of course depends critically on the extent to which your sample is representative of the entire lake. If you used the haphazard scheme outlined above, you have no way of objectively evaluating the accuracy of the sample. It would be more sensible to take a properly randomized sample. (How might you go about doing this?) Nonetheless, taking a randomized sample from the entire lake would still not be a totally sensible approach to the problem. Suppose that the lake were to be fed by a single stream, and that most of the organic matter were concentrated close to the mouth of the stream. If the sample were indeed representative, then most of the vials would contain relatively low concentrations of organic matter, whereas the few taken from around the mouth of the stream would contain much higher concentration levels. That is, there is a real potential for outliers in the sample. Hence, confidence limits based on the normal distribution would not be trustworthy. Furthermore, the sample mean is not as reliable as it might be. Its value will depend critically on the number of vials sampled from the region close to the stream mouth. This source of variation ought to be controlled. Finally, it might be useful to estimate not just the total amount of organic matter in the entire lake, but the extent to which this total is concentrated near the mouth of the stream. You can simultaneously overcome all three deficiencies by taking what is called a stratified random sample. This involves dividing the lake into two or more parts called strata. (These are not the horizontal strata that naturally form in most lakes, although these natural strata might be used in a more complex sampling scheme than the one considered here.) In this instance, the lake could be divided into two parts, one consisting roughly of the area of high concentration close to the stream outlet, the other comprising the remainder of the lake.
c
2012 Carl James Schwarz
164
December 21, 2012
CHAPTER 3. SAMPLING Then if a simple random sample of fixed size were to be taken from within each of these “strata”, the results could be used to estimate the total amount of organic matter within each stratum. These subtotals could then be added to produce an estimate of the overall total for the lake. This procedure, because it involves constructing separate estimates for each stratum, permits us to assess the extent to which the organic matter is concentrated near the stream mouth. It also permits the investigator to control the number of vials sampled from each of the two parts of the lake. Hence, the chance variation in the estimated total ought to be sharply reduced. Finally, we shall soon see that the confidence limits that one can construct are free of the outlier problem that invalidated the confidence limits based on a simple random sampling scheme. A randomized sample is to be drawn independently from within each stratum. How can we use the results of a stratified random sample to estimate the overall total? The simplest way is to construct an estimate of the totals within each of the strata, and then to sum these estimates. A sensible estimate of the average within the h’th stratum is y h . Hence, a sensible estimate of the total within the h’th PH PH stratum is τbh = Nh y h , and the overall total can be estimated by τb = h=1 τbh = h=1 Nh y h . If we prefer to estimate the overall average, we can merely divide the estimate of the overall total by the size of the population, N . The resulting P estimator is called the stratified random sampling estimator of the H population average, and is given by µ b = h=1 Nh y h /N . This can be expressed as a fancy average if we adjust the order of operations in the above expression. If, instead of dividing the sum by N , we divide each term by N and then sum the results, we shall obtain the same result. Hence, µ bstratif ied
=
H X
(Nh /N )y h
h=1
=
H X
Wh y h ,
h=1
where Wh = Nh /N . These Wh -values can be thought of as weighting factors, and µ bstratif ied can then be viewed as a weighted average of the within-stratum sample averages. The estimated standard error is found as: ( se(b µstratif ied )
= se
H X
) Wh y h
h=1
v uH uX = t Wh2 [se(y h )]2 , h=1
where the estimated se(y h ) is given by the formulas for simple random sampling: se(y h ) = c
2012 Carl James Schwarz
165
December 21, 2012
q
s2h nh (1
− fh ).
CHAPTER 3. SAMPLING A Numerical Example Suppose that for the lake sampling example discussed earlier the lake were subdivided into two strata, and that the following results were obtained. (All readings are in mg per litre.) Nh
Stratum 1 2
nh
yh
sh
40.4
41.52
4.23
403
369.4
25.7
Sample Observations
7.5 × 10
8
5
37.2
46.6
45.3
38.1
2.5 × 10
7
5
365
344
388
347
We begin by computing the estimated mean for each stratum and its associated standard error. The nh is so close to 0 it can be safely ignored. For example, the standard error of the mean sampling fraction N h for stratum 1 is found as: s r s21 4.232 se(b µ1 ) = (1 − f1 ) = = 1.89 n1 5 . This gives the summary table: Stratum
nh
µ bh
se(b µh )
1
5
41.52
1.8935
2
5
369.4
11.492
Next, we estimate the total organic matter in each stratum. This is found by multiplying the mean concentration and se of each stratum by the total volume: τbh = Nh × µ bh se(b τh ) = Nh se(b µh ) For example, the estimated total organic matter in stratum 1 is found as: τb1 = N1 × µ b1 = 7.5 × 108 × 41.52 = 311.4 × 108 se(b τ1 ) = N1 se(b µ1 ) = 7.5 × 108 × 1.89 = 14.175 × 108 This gives the summary table: Stratum
nh
µ bh
se(b µh )
τbh
se(b τh )
8
14.175 ×108
92.3 ×108
2.873 ×108
1
5
41.52
1.8935
2
5
369.4
11.492
311.4 ×10
Next, we total the organic content of the two strata and find the se of the grand total as 108 to give the summary table:
c
2012 Carl James Schwarz
166
December 21, 2012
√
14.1752 + 2.8732 ×
CHAPTER 3. SAMPLING Stratum
nh
µ bh
se(b µh )
τbh
se(b τh )
8
14.175 ×108
92.3 ×108
2.873 ×108
403.7 ×108
14.46 ×108
1
5
41.52
1.8935
2
5
369.4
11.492
Total
311.4 ×10
Finally, the overall grand mean is found by dividing by the total volume of the lake 7.75 × 108 to give: µ b=
403.7 × 108 = 52.09mg/L 7.75 × 108 14.46 × 108 = 1.87mg/L 7.75 × 108
se(b µ) =
The calculations required to compute the stratified estimate can also be done using the method of weighted averages as shown in the following table:
Stratum
Nh
Wh
yh
Wh y h
se(y h )
Wh2 [se(y h )]2
(= Nh /N ) 1
7.5 × 10
8
0.9677
41.52
40.180
1.8935
3.3578
2
2.5 × 107
0.0323
369.4
11.916
11.492
0.1374
Totals
7.75 × 108
1.0000
52.097
3.4952 √ se = 3.4952
√ Hence the estimate of the overall average is 52.097 mg/L, and the associated estimated standard error is 3.4963 = 1.870 mg/L and an approximate 95% confidence interval is then found in the usual fashion. As expected these match the previous results. This discussion swept a number of practical difficulties under the carpet. These include (a) estimating the volume of each of the two portions of the lake, (b) taking properly randomized samples from within each stratum, (c) selecting the appropriate size of each water sample, (d) measuring the concentration for each water sample, and (e) choosing the appropriate number of water samples from each stratum. None of these difficulties is simple to do. Estimating the volume of a portion of a lake, for example, typically involves taking numerous depth readings and then applying a formula for approximating integrals. This problem is beyond the scope of these notes. The standard error in the estimator of the overall average is markedly reduced in this example by the stratification. The standard error was just estimated for the stratified estimator to be around 2. This result was for a sample of total size 10. By contrast, for an estimator based on a simple random sample of the same size, the standard error can be found to be about 20. [This involves methods not covered in this class.] Stratification has reduced the standard error by an order of magnitude. It is also possible that we could reduce the standard error even further without increasing our sampling c
2012 Carl James Schwarz
167
December 21, 2012
CHAPTER 3. SAMPLING effort by somehow allocating this effort more efficiently. Perhaps we should take fewer water samples from the region far from the outlet, and take more from the other stratum. This will be covered later in this course. One can also read in more comprehensive accounts how to construct estimates from samples that are stratified after the sample is selected. This is known as post-stratification. These methods are useful if, e.g., you are sampling a population with a known sex ratio. If you observe that your sample is biased in favor of one sex, you can use this information to build an improved estimate of the quantity of interest through stratifying the sample by sex after it is collected. It is not necessary that you start out with a plan for sampling some specified number of individuals from each sex (stratum). Nonetheless, in any survey work, it is crucial that you begin with a plan. There are many examples of surveys that produced virtually useless results because the researchers failed to develop an appropriate plan. This should include a statement of your main objective, and detailed descriptions of how you plan to generate the sample, collect the data, enter them into a computer file, and analyze the results. The plan should contain discussion of how you propose to check for and correct errors at each stage. It should be tested with a pilot survey, and modified accordingly. Major, ongoing surveys should be reassessed continually for possible improvements. There is no reason to expect that the survey design will be perfect the first time that it is tried, nor that flaws will all be discovered in the first round. On the other hand, one should expect that after many years experience, the researchers will have honed the survey into a solid instrument. George Gallup’s early surveys were seriously biased. Although it took over a decade for the flaws to come to light, once they did, he corrected his survey design promptly, and continued to build a strong reputation. One should also be cautious in implementing stratified survey designs for long-term studies. An efficient stratification of the Fraser Delta in 1994, e.g., might be hopelessly out of date 50 years from now, with a substantially altered configuration of channels and islands. You should anticipate the need to revise your stratification periodically.
3.7.5
Example - estimating the total catch of salmon
DFO needs to monitor the catch of sockeye salmon as the season progresses so that stocks are not overfished. The season in one statistical sub-area in a year was a total of 2 days (!) and 250 vessels participated in the fishery in these 2 days. A census of the catch of each vessel at the end of each day is logistically difficult. In this particular year, observers were randomly placed on selected vessels and at the end of each day the observers contacted DFO managers with a count of the number of sockeye caught on that day. Here is the raw data - each line corresponds to the observers’ count for that vessel for that day. On the second day, a new random sample of vessels was selected. On both days, 250 vessels participated in the fishery. Date 29-Jul-98 29-Jul-98 c
2012 Carl James Schwarz
Sockeye 337 730 168
December 21, 2012
CHAPTER 3. SAMPLING 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98 29-Jul-98
458 98 82 28 544 415 285 235 571 225 19 623 180
30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98 30-Jul-98
97 311 45 58 33 200 389 330 225 182 270 138 86 496 215
What is the population of interest? The population of interest is the set of vessels participating in the fishery on the two days. [The fact that each vessel likely participated in both days is not really relevant.] The population of interest is NOT the salmon captured - this is the response variable for each boat whose total is of interest.
What is the sampling frame? It is not clear how the list of fishing boats was generated. It seems unlikely that the aerial survey actually had a picture of the boats on the water from which DFO selected some boats. More likely, the observers were taken onto the water in some systematic fashion, and then the observer selected a boat at random from those seen at this point. Hence the sampling frame is the set of locations chosen to drop off the observers and the set of boats visible from these points. c
2012 Carl James Schwarz
169
December 21, 2012
CHAPTER 3. SAMPLING What is the sampling design? The sampling unit is a boat on a day. The strata are the two days. On each day, a random sample was selected from the boats participating in the fishery. This is a stratified design with a simple random sample selected each day. Note in this survey, it is logistically impossible to do a simple random sample over both the days as the number of vessels participating really isn’t known for any day until the fishery starts. Here, stratification takes the form of administrative convenience.
Excel analysis A copy of an Excel spreadsheet is available in the sockeye tab of the AllofData workbook available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A summary of the page appears below:
c
2012 Carl James Schwarz
170
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
171
December 21, 2012
CHAPTER 3. SAMPLING The data are listed on the spreadsheet on the left. Summary statistics The Excel builtin functions are used to compute the summary statistics (sample size, sample mean, and sample standard deviation) for each stratum. Some caution needs to be exercised that the range of each function covers only the data for that stratum. 4 You will also need to specify the stratum size (the total number of sampling units in each stratum), i.e. 250 vessels on each day. Find estimates of the mean catch for each stratum Because the sampling design in each stratum is a simple random sample, the same formulae as in the previous section can be used. The mean and its estimated se for each day of the opening is reported in the spreadsheet. Find the estimates of the total catch for each stratum The estimated total catch is found by multiplying the average catch per boat by the total number of boats participating in the fishery. The estimated standard error for the total for that day is found by multiplying the standard error for the mean by the stratum size as in the previous section. For example, in the first stratum (29 July), the estimated total catch is found by multiplying the estimated mean catch per boat (322) by the number of boats participating (250) to give an estimated total catch of 80,500 salmon for the day. The se for the total catch is found by multiplying the se of the mean (57) by the number of boats participating (250) to give the se of the total catch for the day of 14,200 salmon. Find estimate of grand total Once an estimated total is found for each stratum, the estimated grand total is found by summing the individual stratum estimated totals. The estimated standard error of the grand total is found by the square root of the sum of the squares of the standard errors in each stratum - the Excel function sumsq is useful for this computation. Estimates of the overall grand mean This was not done in the spreadsheet, but is easily computed by dividing the total catch by the total number of boat days in the fishery (250+250=500). The se is found by dividing the se of the total catch also by 500. Note this is interpreted as the mean number of fish captured per day per boat. 4 If you are proficient with Excel, Pivot-Tables are an ideal way to compute the summary statistics for each stratum. An application of Pivot-Tables is demonstrated in the analysis of a cluster sample where the cluster totals are needed for the summary statistics.
c
2012 Carl James Schwarz
172
December 21, 2012
CHAPTER 3. SAMPLING SAS analysis As noted earlier, some care must be used when standard statistical packages are used to analyze survey data as many packages ignore the design used to select the data. A sample SAS program for the analysis of the sockeye example called sockeye.sas and its output called sockeye.lst is available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/ Stat-650/Notes/MyPrograms. The program starts with reading in the raw data and the computation of the sampling weights.
data sockeye; /* read in the data */ length date $8.; input date $ sockeye; /* compute the sampling weight. In general, these will be different for each stratum */ if date = ’29-Jul’ then sampweight = 250/15; if date = ’30-Jul’ then sampweight = 250/15;
Because the population size and sample size are the same for each stratum, the sampling weights are common to all boats. In general, this is not true, and a separate sampling weight computation is required for each stratum. A separate file is also constructed with the population sizes for each stratum so that estimates of the population total can be constructed.
data n_boats; /* you need to specify the stratum sizes if you want stratum totals */ length date $8.; date = ’29-Jul’; _total_=250; output; /* the stratum sizes must be variable _total_ date = ’30-Jul’; _total_=250; output; run;
Proc SurveyMeans then uses the STRATUM statement to identify that this is a stratified design. The default analysis in each stratum is again a simple random sample.
proc surveymeans data=sockeye N = n_boats /* dataset with the stratum population sizes present */ mean /* average catch/boat along with standard error */ sum ; /* request estimates of total */ ;
c
2012 Carl James Schwarz
173
December 21, 2012
CHAPTER 3. SAMPLING strata date / list; /* identify the stratum variable */ var sockeye; /* which variable to get estimates for */ weight sampweight; run;
The SAS output is: Data Summary Number of Strata
2
Number of Observations
30
Sum of Weights
500
Stratum Information Stratum Index
date
Population Total
Sampling Rate
N Obs
Variable
N
1
29−Jul
250
6.00%
15
sockeye
15
2
30−Jul
250
6.00%
15
sockeye
15
Statistics Variable
Mean
Std Error of Mean
Sum
Std Dev
sockeye
263.500000
33.082758
131750
16541
The results are the same as before. The only thing of “interest” is to note that SAS labels the precision of the estimated grand means as a Standard error while it labels the precision of the estimated total as a standard deviation! Both are correct a standard error is a standard deviation - not of individual units in the population - but of the estimates over repeated sampling from the same population. I think it is clearer to label both as standard errors to avoid any confusion. If separate analyses are wanted for each stratum, the SURVEYMEANS procedure has to be run twice, one time with a BY statement to estimate the means and totals in each stratum. Again, it is likely easiest to do planning for future experiments in an Excel spreadsheet rather than using SAS.
c
2012 Carl James Schwarz
174
December 21, 2012
CHAPTER 3. SAMPLING When should the various estimates be used? In a stratified sample, there are many estimates that are obtained with different standard errors. It can sometimes be confusion as to which estimate is used for which purpose. Here is a brief review of the four possible estimates and the level of interest in each estimate.
c
2012 Carl James Schwarz
175
December 21, 2012
se
176 December 21, 2012
Who would be interested in this quantity?
Stratum 1. Estimate is 322; se of 56.8 (not shown). The estimated average catch per boat was 322 fish (se 56.8 fish) on 29 July
A fisher who wishes to fish ONLY the first day of the season and wants to know if it will meet expenses.
Stratum 1. Estimate is 80,500=250x322; se of 14195=250x56.8.
The estimated total catch overall boats on 29 July was 80,500 (se 14,195) DFO who wishes to estimate TOTAL catch overall ALL boats on this single day so that quota for next day can be set. Grand Total
p se(b τ1 )2 + se(b τ1 )2
Estimate 131,750=80,500+51,250; se is √ 1419522 + 849222 = 16541. The estimated total catch overall all boats over all days is 132,000 fish (se 17,000 fish).
DFO who wishes to know total catch over entire fishing season so that impacts on stock can be examined.
se(b τ) N
Grand mean (not shown). N=500 vessel-days. Estimate is 131,750/500=263.5; se is 16541/500=33.0. The estimated catch per boat per day over the entire season was 263 fish (se 33 fish).
A fisher who want to know average catch per boat per day for the entire season to see if it will meet expenses.
q
Stratum mean
µ bh = Y h
Stratum total
τbh = Nh Y h
Grand total.
τb = τb1 + τb2
Grand average
µ b=
τb N
Example and Interpretation
Nh µ bh
=
s2h nh (1
− fh )
Nh se(b q µh ) Nh
s2j nh (1
= − fh )
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
Parameter Estimator
CHAPTER 3. SAMPLING ‘
3.7.6
Sample Size for Stratified Designs
As before, the question arises as how many units should be selected in stratified designs. This has two questions that need to be answered. First, what is the total sample size required? Second how should these be allocated among the strata. The total sample size can be determined using the same methods as for a simple random sample. I would suggest that you initially ignore the fact that the design will be stratified when finding the initial required total sample size. If stratification proves to be useful, then your final estimate will be more precise than you anticipated (always a nice thing to happen!) but seeing as you are making guesses as to the standard deviations and necessary precision required, I wouldn’t worry about the extra cost in sampling too much. If you must, it is possible to derive formulae for the overall sample sizes when accounting for stratification, but these are relatively complex. It is likely easier to build a general spreadsheet where the single cell is the total sample size and all other cells in the formula depend upon this quantity depending upon the allocation used. Then the total sample size can be manipulated to obtain the desired precision. The following information will be required: • The sizes (or relative sizes) of each stratum (i.e. the Nh or Wh ). • The standard deviation of measurements in each stratum. This can be obtained from past surveys, a literature search, or expert opinion. • The desired precision – overall – and if needed, for each stratum. Again refer to the sockeye worksheet.
c
2012 Carl James Schwarz
177
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
178
December 21, 2012
CHAPTER 3. SAMPLING The standard deviations from this survey will be used as ‘guesses’ for what might happen next year. As in this year’s survey, the total sample size will be allocated evenly between the two days. In this case, the total sample size must be allocated to the two strata. You will see several methods in a later section to do this, but for now, assume that the total sample will be allocated equally among both strata. Hence the proposed sample size of 75 is split in half to give a proposed sample size of 37.5 in each stratum. Don’t worry about the fractional sample size - this is only a planning exercise. We create one cell that has the total sample size, and then use the formulae to allocate the total sample size equally to the two strata. The total and the se of the overall total are found as before, and the relative precision (denoted as the relative standard error (rse), and, unfortunately, in some books at the coefficient of variation cv ) is found as the estimated standard error/estimated total. Again, this portion of the spreadsheet is setup so that changes in the total sample size are propagated throughout the sheet. If you change the total sample size from 75 to some other number, this is automatically split among the two strata, which then affects the estimated standard error for each stratum, which then affects the estimated standard error for the total, which then affects the relative standard error. Again, the proposed total sample size can be varied using trial and error, or the Excel Goal-Seek option can be used. Here is what happens when a sample size of 75 is used. Don’t be alarmed by the fractional sample sizes in each stratum – the goal is again to get a rough feel for the required effort for a certain precision. Total n=75 se Est
Est
n
Mean
std dev
vessels
total
total
29-Jul
37.5
322
226.8
250
80500
8537
30-Jul
37.5
205
135.7
250
51250
5107
131750
9948
rse
7.6%
Stratum
Total
A sample size of 75 is too small. Try increasing the sample size until the rse is 5% or less. Alternatively, once could use the GOAL SEEK feature of Excel to find the sample size that gives a relative standard error of 5% or less as shown below:
c
2012 Carl James Schwarz
179
December 21, 2012
CHAPTER 3. SAMPLING Total n=145 se Est
Est
n
Mean
std dev
vessels
total
total
29-Jul
72.5
322
226.8
250
80500
5611
30-Jul
72.5
205
135.7
250
51250
3357
Stratum
Total
3.7.7
131750
6539
rse
5.0%
Allocating samples among strata
There are number of ways of allocating a sample of size n among the various strata. For example, 1. Equal allocation. Under an equal allocation scheme, all strata get the same sample size, i.e. nh = n/H This allocation is best if variances of strata are roughly equal, equally precise estimates are required for each stratum, and you wish to test for differences in means among strata (i.e. an analytical survey discussed in previous sections). 2. Proportional allocation. Under proportional allocation, sample sizes are allocated to be proportional Ni i PNi = n × to the number of sampling units in the strata, i.e ni = n × N N = n× Nh N1 +N2 +···+NH = n × Wi This allocation is simple to plan and intuitively appealing. However, it is not the best design. This design may waste effort because large strata get large sample sizes but precision is determined by sample size not the ratio of sample size to population size. For example, if one stratum is 10 times larger than any other stratum, it is not necessary to allocate 10 times the sampling effort to get the same precision in that stratum. 3. Neyman allocation In Neyman allocation (named after the statistician Neyman), the sample is allocated to minimize the overall standard error for a given total sample size. Tedious algebra gives that the sample should be allocated proportional to the product of the stratum size and the stratum standard i Si deviation, i.e. ni = n × PWWi Sh Si h = n × PNNi Sh Si h = n × N1 S1 +N2N S2 +···+NH SH . This allocation will be appropriate if the costs of measuring units are the same in all strata. Intuitively, the strata that have the most of sampling units should be weighted larger; strata with larger standard deviations must have more samples allocated to them to get the se of the sample mean within the stratum down to a reasonable level. A key assumption of this allocation is that the cost to sample a unit is the same in all strata. 4. Optimal Allocation when costs are involved In some cases, the costs of sampling differ among the strata. Suppose that it costs Ci to sample each unit in a stratum i. Then the total cost of the survey P is C = nh Ch . The allocation rule is that sample sizes should be proportional to the product to stratum sizes, stratum standard deviations, and the inverse of the square root of the cost of sampling, i.e. ni = n ×
P
√ Wi Si / Ci √ (Wh Sh / Ch )
=n×
Ni Si √ Ci P Nh Sh (√ ) Ch
=n×
Ni Si √ Ci N1 S1 √ C1
N S + √2 2 C2
N S +···+ √H H
This implies that large
CH
samples are found in strata that are larger, more variable, or cheaper to sample. c
2012 Carl James Schwarz
180
December 21, 2012
CHAPTER 3. SAMPLING In practice, most of the gain in precision occurs from moving from equal to proportional allocation, while often only small improvements in precision are gained from moving from proportional allocation to Neyman allocation. Similarly, unless cost differences are enormous, there isn’t much of an improvement in precision to moving to an allocation based on costs. Example - estimating the size of a caribou herd This section is based on the paper: Siniff, D.B. and Skoog, R.O. (1964). Aerial Censusing of Caribou Using Stratified Random Sampling. The Journal of Wildlife Management, 28, 391-401. http://dx.doi.org/10.2307/3798104 Some of the values have been modified slightly for illustration purposes. The authors wished to estimate the size of a caribou herd. The density of caribou differs dramatically based on the habitat type. The survey area was was divided into six strata based on habitat type. The survey design is to divide each stratum in 4 km2 quadrats that will be randomly selected. The number of caribou in the quadrats will be counted from an aerial photograph. The computations are available in the caribou tab in the Excel workbook ALLofData.xls available in Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The key point to examining different allocations is to make a single cell represent the total sample size and then make a formula in each of the stratum sample sizes a function of the total. The total sample size can be found by varying the sample total until the desired precision is found. Results from previous year’s survey: Here are the summary statistics from the survey in a previous year: Map-squares
sampled
Stratum
Nh
nh
y
s
Est total
se(total)
1
400
98
24.1
74.7
9640
2621
2
40
10
25.6
63.7
1024
698
3
100
37
267.6
589.5
26760
7693
4
40
6
179
151.0
7160
2273
5
70
39
293.7
351.5
20559
2622
6
120
21
33.2
99.0
3984
2354
Total
770
211
69127
9172
The estimated size of the herd is 69,127 animals with an estimated se of 9,172 animals. Equal allocation
c
2012 Carl James Schwarz
181
December 21, 2012
CHAPTER 3. SAMPLING What would happen if an equal allocation were used? We now split the 211 total sample size equally among the 6 strata. In this case, the sample sizes are ‘fractional’, but this is OK as we are interested only in planning to see what would have happened. Notice that the estimate of the overall population would NOT change, but the se changes. Stratum
Nh
nh
y
s
Est total
se(total)
1
400
35.2
24.1
74.7
9640
4810
2
40
35.2
25.6
63.7
1024
149
3
100
35.2
267.6
589.5
26760
8005
4
40
35.2
179
151.0
7160
354
5
70
35.2
293.7
351.5
20559
2927
6
120
35.2
33.2
99.0
3984
1684
Total
770
211
69127
9938
An equal allocation gives rise to worse precision than the original survey. Examining the table in more detail, you see that far too many samples are allocated in an equal allocation to strata 2 and 4 and not enough to strata 1 and 3. Proportional allocation What about proportional allocation? Now the sample size is proportional to the stratum population sizes. For example, the sample size for stratum 1 is found as 211 × 400/770. The following results are obtained: Stratum
Nh
nh
y
s
Est total
se(total)
1
400
109.6
24.1
74.7
9640
2431
2
40
11.0
25.6
63.7
1024
656
3
100
27.4
267.6
589.5
26760
9596
4
40
11.0
179
151.0
7160
1554
5
70
19.2
293.7
351.5
20559
4787
6
120
32.9
33.2
99.0
3984
1765
Total
770
211
69127
11263
This has an even worse standard error! It looks like not enough samples are placed in stratum 3 or 5. Optimal allocation What if both the stratum sizes and the stratum variances are to be used in allocating the sample? We create a new column (at the extreme right) which is equal to Nh Sh . Now the sample sizes are proportional to these values, i.e. the sample size for the first stratum is now found as 211 × 29866.4/133893.8. Again the estimate of the total doesn’t change but the se is reduced.
c
2012 Carl James Schwarz
182
December 21, 2012
CHAPTER 3. SAMPLING Stratum
Nh
nh
y
s
Est total
se(total)
Nh Sh
1
400
47.1
24.1
74.7
9640
4089
29866.4
2
40
4.0
25.6
63.7
1024
1206
2550.0
3
100
92.9
267.6
589.56
26760
1629
58953.9
4
40
9.5
179
151.0
7160
1709
6039.6
5
70
38.8
293.7
351.5
20559
2639
24607.6
6
120
18.7
33.2
99.0
3984
2522
11876.4
Total
770
211
69127
6089
133893.8
3.7.8
Example: Estimating the number of tundra swans.
The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan, is a large bird with white plumage and black legs, feet, and beak. 5 The USFWS is responsible for conserving and protecting tundra swans as a migratory bird under the Migratory Bird Treaty Act and the Fish and Wildlife Conservation Act of 1980. As part of these responsibilities, it conducts regular aerial surveys at one of their prime breeding areas in Bristol Bay, Alaska. And, the Bristol Bay population of tundra swans is of particular interest because suitable habitat for nesting is available earlier than most other nesting areas. This example is based on one such survey. 6 Tundra swans are highly visible on their nesting grounds making them easy to monitor during aerial surveys. The Bristol Bay refuge has been divided into 186 survey units, each being a quarter section. These survey units have been divided into three strata based on density, and previous years’ data provide the following information about the strata: Density
Total
Past
Past
Stratum
Survey Units
Density
Std Dev
High
60
20
10
Medium
68
10
6
Low
58
2
3
Total
186
Based on past years’ results and budget considerations, approximately 30 survey units can be sampled. The three strata are all approximately the same total area (number of survey units) so allocations based on stratum area will be approximately equal across strata. However, that would place about 1/3 of the effort into the low density strata which typically have fewer birds. 5 Additional 6 Doster,
information about the tundra swan is available at http://www.hww.ca/hww2.asp?id=78&cid=7 J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern Alaska Peninsula, June 2002.
c
2012 Carl James Schwarz
183
December 21, 2012
CHAPTER 3. SAMPLING It is felt that stratum density is a suitable measure of stratum importance (notice that close relationship between stratum density and stratum standard deviations which is often found in biological surveys). Consequently, an allocation based on stratum density was used. The sum of the density values is 20 + 10 + 2 = 32. A proportional allocation would then place about 30 × 20 32 = 18 units in the high density stratum; about 30 × 10 = 9 units in the medium density stratum; and the remainder (3 units) in the low density stratum. 32 The survey was conducted with the following results: Survey
Area 2
(km )
Swans in
Single
Total
flocks
Birds
Pairs
birds
Unit
Stratum
dilai2
h
148
12
6
24
naka41
h
137
13
15
43
naka43
h
137
6
16
38
naka51
h
16
3
2
17
nakb32
h
137
10
10
30
nakb44
h
135
6
18
12
48
nakc42
h
83
4
5
6
21
nakc44
h
109
17
15
47
nakd33
h
134
11
11
33
ugac34
h
65
2
10
22
ugac44
h
138
28
15
58
ugad5/63
h
159
9
20
49
dugad56/4
m
102
7
4
15
guad43
m
137
6
4
14
ugad42
m
137
11
15
46
low1
l
143
2
2
low3
l
138
1
1
10
5
The first thing to notice from the table above is that not all survey units could be surveyed because of poor weather. As always with missing data, it is important to determine if the data are Missing Completely at Random (MCAR). In this case, it seems reasonable that swans did not adjust their behavior knowing that certain survey units would be sampled on the poor weather days and so there is no impact of the missing data other than a loss of precision compared to a survey with a full 30 survey units chosen. Also notice that “blanks” in the table (missing values) represent zeros and not really missing data. Finally, not all of the survey units are the same area. This could introduce additional variation into our data which may affect our final standard errors. Even though the survey units are of different areas, the survey units were chosen as a simple random sample so ignoring the area will NOT introduce bias into the estimates (why). You will see in later sections how to compute a ratio estimator which could take the area
c
2012 Carl James Schwarz
184
December 21, 2012
CHAPTER 3. SAMPLING of each survey units into account and potentially lead to more precise estimates. SAS analysis A copy of the SAS program (tundra.sas) is available in available in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are read into SAS in the usual fashion with the code fragment: data swans; infile datalines firstobs=3; length survey_unit $10 stratum $1;; input survey_unit $ stratum $ area num_flocks num_single num_pairs; num_swans = num_flocks + num_single + 2*num_pairs; datalines4; ... datalines inserted here ...
The total number of survey units in each stratum is also read into SAS using the code fragment. data total_survey_units; length stratum $1.; input stratum $ _total_; /* must use _total_ as variable name */ datalines; h 60 m 68 l 58 ;;;;
Notice that the variable that has the number of stratum units must be called _total_ as required by the SurveyMeans procedure. Next the data are sorted by stratum (not shown), the number of actual survey units surveyed in each stratum is found using Proc Means: proc means data=swans noprint; by stratum; var num_swans; output out=n_units n=n; run;
Most survey procedures in SAS require the use sampling weights. These are the reciprocal of the probability of selection. In this case, this is simply the number of units in the stratum divided by the number sampled in each stratum: c
2012 Carl James Schwarz
185
December 21, 2012
CHAPTER 3. SAMPLING
data swans; merge swans total_survey_units n_units; by stratum; sampling_weight = _total_ / n; run;
Now the individual stratum estimates are obtained using the code fragment:
/* first estimate the numbers in each stratum */ proc surveymeans data=swans total=total_survey_units /* inflation factors */ sum clsum mean clm; by stratum; /* separate estimates by stratum */ var num_swans; weight sampling_weight; ods output statistics=IndivEst; run;
This gives the output: Obs
stratum
Mean
StdErr
LowerCLMean
UpperCLMean
1
h
35.83
3.43
28.28
43.38
2
l
1.50
0.49
−4.74
7.74
3
m
25.00
10.27
−19.19
69.19
Obs
stratum
Sum
StdDev
LowerCLSum
UpperCLSum
1
h
2150
206
1697
2603
2
l
87
28
−275
449
3
m
1700
698
−1305
4705
The estimates in the L and M strata are not very precise because of the small number of survey units selected. SAS has incorporated the finite population correction factor when estimating the se for the individual stratum estimates. We estimate that about 2000 swans are present in the H and M strata, but just over 100 in the L stratum. The grand total is found by adding the estimated totals √ from the strata 2150+87+1700=3937, and the standard error of the grand total is found in the usual way 2062 + 282 + 6982 = 729. c
2012 Carl James Schwarz
186
December 21, 2012
CHAPTER 3. SAMPLING Proc SurveyMeans can be used to estimate the grand total number of units overall strata using the code fragment::
/* now to estimate the grand total */ proc surveymeans data=swans total=total_survey_units /* inflation factors for each stratum */ sum clsum mean clm; /* want to estimate grand totals */ title2 ’Estimate total number of swans’; strata stratum /list; /* which variable define the strata */ var num_swans; /* which variable to analyze */ weight sampling_weight; /* sampling weight for each obs */ ods output statistics=FinalEst; run;
This gives the output: Obs
Mean
StdErr
LowerCLMean
UpperCLMean
1
21.17
3.92
12.77
29.57
Obs
Sum
StdDev
LowerCLSum
UpperCLSum
1
3937
729
2374
5500
The standard error is larger than desired, mostly because of the very small sample size in the M stratum where only 3 of the 9 proposed survey units could be surveyed.
3.7.9
Post-stratification
In some cases, it is inconvenient or impossible to stratify the population elements into strata before sampling because the value of a variable used for stratification is only available after the unit is sampled. For example, • we wish to stratify a sample of baby births by birth weight to estimate the proportion of birth defects; • we wish to stratify by family size when looking at day care costs. • we wish to stratify by soil moisture but this can only be measured when the plot is actually visited. We don’t know the birth weight, the family-size, or the soil moisture until after the data are collected.
c
2012 Carl James Schwarz
187
December 21, 2012
CHAPTER 3. SAMPLING There is nothing formally wrong with post-stratification and it can lead to substantial improvements in precison. How would post-stratification work in practise? Suppose than 20 quadrats (each 1 m2 ) were sampled out of a 100 m2 survey area using a simple random sample, and the number of insect grubs counted in each quadrat. When the units were sampled, the soil was classified into high or low quality habitit for these grubs: Grubs
Post-strat
10
h
2
l
3
l
8
h
1
l
3
l
11
h
2
l
2
l
11
h
17
h
1
l
0
l
11
h
15
h
2
l
2
l
4
l
2
l
1
l
The overall mean density is estimated to be 5.40 insects/m2 with a se of 1.17 m2 (ignoring any fpc). The estimated total number of insects over all 100 m2 of the study area is 100 × 5.40 = 540 insects with a se of 100 × 1.17 = 117 insects. Now suppose we look at the summary statistics by the post-stratification variable. If the area of the post-strata are known (and this is NOT always possible), you can use standard rollup for a stratified design. Suppose that there were 30 m2 of high quality habitat and 70 m2 of low quality habitat. Then the roll-up proceeds as before and is summarized as:
c
2012 Carl James Schwarz
188
December 21, 2012
CHAPTER 3. SAMPLING Now the estimated total grubs is 490 with a se of 40 – a substantial improvement over the non-stratified analysis. The difference in the estimates (i.e. 540 vs. 490) is well within the range of uncertainty summarized by the standard errors. There are several potential problems when using post-stratification. • The sample size in each post-stratum cannot be controlled. This implies it is not possible to use any of the allocation methods to improve precision that were discussed earlier. As well, the survey may end up with a very small sample size in some strata. • The reported se must be increased to account for the fact that the sample size in each stratum is no longer fixed. This introduces an additional source of variation for the estimate, i.e. estimates will vary from sample to sample not only because a new sample is drawn each time, but also because the sample size within a stratum will change. However in practice, this is rarely a problem because the actual increase in the se is usually small and this additional adjustment is rarely every done. • In the above example, the area of each stratum in the ENTIRE study area could be found after the fact. But in some cases, it is impossible to find the area of each stratum in the entire study area and so the rollup could not be done. In these cases, you could use the results from the post-stratification to also estimate the area of each stratum, but now the expansion factor for each stratum also has a se and this must also be taken into account Please consult a standard book on sampling theory for details.
3.7.10
Allocation and precision - revisited
A student wrote: I’m a little confused about sample allocation in stratified sampling. Earlier in the course, you stated that precision is independent of sample size, i.e. a sample of 1000 gave estimates that were equally precise for Canada and the US (assuming a simple random sample). Yet in stratified sampling, you also said that precision is improved by proportional allocation where larger strata get larger sample sizes. Both statements are correct. If you are interested in estimates for individual populations, then absolute sample size is important. If you wanted equally precise estimates for BOTH Canada and the US then you would have equal sample sizes from both populations, say 1000 from both population even though their overall population size differs by a factor of 10:1. However, in stratified sampling designs, you may also be interested in the OVERALL estimate, over both populations. In this case, a proportional allocation where sample size is allocated proportion to population size often performs better. In this, the overall sample of 2000 people would be allocated proportional to the population sizes as follows: c
2012 Carl James Schwarz
189
December 21, 2012
CHAPTER 3. SAMPLING Stratum US Canada Total
Population
Fraction of total population
Sample size
300,000,000
91%
91% x2000=1818
30,000,000
9%
9% x2000=181
330,000,000
100%
2000
Why does this happen? Well if you are interested in the overall population, then the US results essentially drives everything and Canada has little effect on the overall estimate. Consequently, it doesn’t matter that the Canadian estimate is not as precise as the US estimate.
3.8
Ratio estimation in SRS - improving precision with auxiliary information
An association between the measured variable of interest and a second variable of interest can be exploited to obtain more precise estimates. For example, suppose that growth in a sample plot is related to soil nitrogen content. A simple random sample of plots is selected and the height of trees in the sample plot is measured along with the soil nitrogen content in the plot. A regression model is fit (Thompson, 1992, Chapters 7 and 8) between the two variables to account for some of the variation in tree height as a function of soil nitrogen content. This can be used to make precise predictions of the mean height in stands if the soil nitrogen content can be easily measured. This method will be successful if there is a direct relationship between the two variables, and, the stronger the relationship, the better it will perform. This technique is often called ratio-estimation or regression-estimation. Notice that multi-phase designs often use an auxiliary variable but this second variable is only measured on a subset of the sample units and should not be confused with ratio estimators in this section. Ratio estimation has two purposes. First, in some cases, you are interested in the ratio of two variables, e.g. what is the ratio of wolves to moose in a region of the province. Second, a strong relationship between two variables can be used to improve precision without increasing sampling effort. This is an alternative to stratification when you can measure two variables on each sampling unit. Y Y We define the population ratio as R = ττX = µµX . Here Y is the variable of interest; X is a secondary variable not really of interest. Note that notation differs among books - some books reverse the role of X and Y .
Why is the ratio defined in this way? There are two common ratio estimators, traditionally called the mean-of-ratio and the ratio-of-mean estimators. Suppose you had the following data for Y and X which represent the counts of animals of species 1 and 2 taken on 3 different days:
c
2012 Carl James Schwarz
190
December 21, 2012
CHAPTER 3. SAMPLING Sample 1
2
3
Y
10
100
20
X
3
20
1
The mean-of-ratios estimator would compute the estimated ratio between Y and X as: Rmean−of −ratio =
10 3
+
100 20
+
3
20 1
= 9.44
while the ratio-of-means would be computed as: Rratio−of −means =
10 + 100 + 20 (10 + 100 + 20)/3 = = 5.41 (3 + 20 + 1)/3 3 + 20 + 1
Which is ”better”? The mean-of-ratio estimator should be used when you wish to give equal weight to each pair of numbers regardless of the magnitude of the numbers. For example, you may have three plots of land, and you measure Y and X on each plot, but because of observer efficiencies that differ among plots, the raw numbers cannot be compared. For example, in a cloudy, rainy day it is hard to see animals (first case), but in a clear, sunny day, it is easy to see animals (second case). The actual numbers themselves cannot be combined directly. The ratio-of-means estimator (considered in this chapter) gives every value of Y and X equal weight. Here the fact that unit 2 has 10 times the number of animals as unit 1 is important as we are interested in the ratio over the entire population of animals. Hence, by adding the values of Y and X first, each animals is given equal weight. When is a ratio estimator better - what other information is needed? The higher the correlation between Xi and Yi , the better the ratio estimator is compared to a simple expansion estimator. It turns out that the ratio estimator is the ‘best’ linear estimator if • the relation between Yi and Xi is linear through the origin • the variation around the regression line is proportional to the X value, i.e. the spread around the regression line increases as X increases unlike an ordinary regression line where the spread is assumed to be constant in all parts of the line. In practice, plot yi vs. xi from the sample and see what type of relation exists. When can a ratio estimator be used? A ratio estimator will require that another variable (the X variable) be measured on the selected sampling units. Furthermore, if you are estimating the overall mean or total, the total value of the X-variable over the entire population must also be known. For example, as see in the examples to come, the total area must be known to estimate the total animals once the density (animals/ha) is known. c
2012 Carl James Schwarz
191
December 21, 2012
CHAPTER 3. SAMPLING
3.8.1
Summary of Main results
Quantity
Population value
Ratio
R=
τY τX
=
µY µX
Sample estimate r=
y x
=
y x
se r
2 1 sdiff µ2X n
r τY = RτX
Total
τbratio = rτX
τX ×
2 1 sdiff µ2X n
r µY = RµX
Mean
µc Y ratio = rµX
µX ×
(1 − f )
2 1 sdiff µ2X n
(1 − f ) (1 − f )
Notes Don’t be alarmed by the apparent complexity of the formulae above. They are relatively simple to implement in spreadsheets.
• The term s2diff =
n P i=1 2
(yi −rxi )2 n−1
is computed by creating a new column yi −rxi and finding the (sample
standard deviation) of this new derived variable. This will be illustrated in the examples. • In some cases the µ2X in the denominator may or may not be known and it or its estimate x2 can be used in place of it. There doesn’t seem to be any empirical evidence that either is better. 2 • The term τX /µ2X reduces to N 2 .
• Confidence intervals Confidence limits are found in the usual fashion. In general, the distribution of R is positively skewed and so the upper bound is usually too small. This skewness is caused by the variation in the denominator of the the ratio. For example, suppose that a random variable (Z) has a uniform distribution between 0.5 and 1.5 centered on 1. The inverse of the random variable (i.e. 1/Z) now ranges between 0.666 and 2 - no longer symmetrical around 1. So if a symmetric confidence interval is created, the width will tend not to match the true distribution. This skewness is not generally a problem if the sample size is at least 30 and the relative standard error of y and x are both less than 10%. • Sample size determination: The appropriate sample size to obtain a specified size of confidence interval can be found by iniert ingthe formulae for the se for the ratio. This can be done on a spread sheet using trial and error or the goal seek feature of the spreadsheet as illustated in the examples that follow.
3.8.2
Example - wolf/moose ratio
[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs interchanges the use of x and y in the ratio.]
c
2012 Carl James Schwarz
192
December 21, 2012
CHAPTER 3. SAMPLING Wildlife ecologists interested in measuring the impact of wolf predation on moose populations in BC obtained estimates by aerial counting of the population size of wolves and moose on 11 sub-areas (all roughly equal size) selected as SRSWOR from a total of 200 sub-areas in the game management zone. In this example, the actual ratio of wolves to moose is of interest. Here are the raw data: Sub-areas 1 2 3 4 5 6 7 8 9 10 11
Wolves 8 15 9 27 14 3 12 19 7 10 16
Moose 190 370 460 725 265 87 410 675 290 370 510
What is the population and parameter of interest? As in previous situations, there is some ambiguity: • The population of interest is the 200 sub-areas in the game-management zone. The sampling units are the 11 sub-areas. The response variables are the wolf and moose populations in the game management sub-area. We are interested in the wolf/moose ratio. • The populations of interest are the moose and wolves. If individual measurements were taken of each animal, then this definition would be fine. However, only the total number of wolves and moose within each sub-area are counted - hence a more proper description of this design would be a cluster design. As you will see in a later section, the analysis of a cluster design starts by summing to the cluster level and then treating the clusters as the population and sampling unit as is done in this case. Having said this, do the number of moose and wolves measured on each sub-area include young moose and young wolves or just adults? How will immigration and emigration be taken care of? What was the frame? Was it complete? The frame consists of the 200 sub-areas of the game management zone. Presumably these 200 sub-areas cover the entire zone, but what about emigration and immigration? Moose and wolves may move into and out of the zone. What was the sampling design? c
2012 Carl James Schwarz
193
December 21, 2012
CHAPTER 3. SAMPLING It appears to be an SRSWOR design - the sampling units are the sub-areas of the zone. How did they determine the counts in the sub-areas? Perhaps they simply looked for tracks in the snow in winter - it seems difficult to get estimates from the air in summer when there is lots of vegetation blocking the view.
Excel analysis A copy of the worksheet to perform the analysis of this data is called wolf and is available in the Allofdata workbook from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/ Notes/MyPrograms. Here is a summary shot of the spreadsheet:
c
2012 Carl James Schwarz
194
December 21, 2012
CHAPTER 3. SAMPLING
Assessing conditions for a ratio estimator
c
2012 Carl James Schwarz
195
December 21, 2012
CHAPTER 3. SAMPLING The ratio estimator works well if the relationship between Y and X is linear, through the origin, with increasing variance with X. Begin by plotting Y (wolves) vs. X (moose).
The data appears to satisfy the conditions for a ratio estimator. Compute summary statistics for both Y and X Refer to the screen shot of the spreadsheet. The Excel builtin functions are used to compute the sample size, sample mean, and sample standard deviation for each variable. Compute the ratio The ratio is computed using the formula for a ratio estimator in a simple random sample, i.e. r=
y x
Compute the difference variable c
2012 Carl James Schwarz
196
December 21, 2012
CHAPTER 3. SAMPLING Then for each observation, the difference between the observed Y (the actual number of wolves) and the predicted Y based on the number of moose (Ybi = rXi ) is found. Notice that the sum of the differences must equal zero. The standard deviation of the differences will be needed to compute the standard error for the estimated ratio. Estimate the standard error of the estimated ratio Use the formula given at the start of the section. Final estimate Our final result is that the estimated ratio is 0.03217 wolf/moose with an estimated se of 0.00244 wolf/moose. An approximate 95% confidence interval would be computed in the usual fashion. Planning for future surveys Our final estimate has an approximate rse of 0.00244/.03217 = 7.5% which is pretty good. You could try different n values to see what sample size would be needed to get a rse of better than 5% or perhaps this is too precise and you only want a rse of about 10%. √ As an approximated answer, recall that se usally vary by n. A rse of 5%, is smaller by a factor of .075/.05 = 1.5 which will require an increase of 1.52 = 2.25 in the sample size, or about nnew = 2.25 × 11 = 25 units (ignoring the fpc). If the raw data are available, you can also do a “bootstrap” selection (with replacement) to investigate the effect of sample size upon the se. For each different bootstrap sample size, estimate the ratio, the se and then increase the sample size until the require se is obtained. This is relatively easy to do in SAS using the Proc SurveySelect that can select samples of arbitrary size. In saome packages, such as JMP, sampling is without replacement so a direct sampling of 3x the observed sample size is not possible. In this case, create a pseudo-data set by pasting 19 copies of the raw data after the original data. Then use the Table →Subset →Random Sample Size to get the approximate bootstrap sample. Again compute the ratio and its se, and increase the sample size until the required precision is obtained. If you want to be more precise about this, notice that the formula for the se of a ratio is found as: s 1 s2diff (1 − f ) µ2X n From the spreadsheet we extract various values and find that the se of the ratio is r 1 3.292 n (1 − ) 395.642 n 200 Different value of n can be tried until the rse is 5%. This gives a sample size of about 24 units. If the actual raw data are not available, all is not lost. You would require the approximate MEAN of X (µX ), the standard DEVIATION of Y , the standard DEVIATION of X, the CORRELATION between Y c
2012 Carl James Schwarz
197
December 21, 2012
CHAPTER 3. SAMPLING and X, the approximate ratio (R), and the approximate number of total sample units (N ). The correlation determines how closely Y can be predicted from X and essentially determines how much better you will do using a ratio estimator. If the correlation is zero, there is NO gain in precison using a ratio estimator over a simple mean. The se of r is then found as: s p n 1 V (y) + R2 V (x) − 2Rcorr(y, x) V (x)V (y) (1 − ) se(r) = µ2X n N Different values of n can be tried to obtain the desired rse. This is again illustrated on the spreadsheet.
SAS Analysis The above computations can also be done in SAS with the program wolf.sas available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. It uses Proc SurveyMeans which gives the output contained in wolf.lst. The SAS program again starts with the DATA step to read in the data.
data wolf; input subregion wolf moose;
Because the sampling weights are equal for all observation, it is not necessary to include them when estimating a ratio (the weights cancel out in the formula used by SAS). The Gplot procedure creates the plot similar to that in the Excel spreadsheet.
proc gplot data=wolf; title2 ’plot to assess assumptions’; plot wolf*moose;
c
2012 Carl James Schwarz
198
December 21, 2012
CHAPTER 3. SAMPLING
Finally, the SurveyMeans procedure does the actual computation:
proc surveymeans data=wolf ratio clm N=200; title2 ’Estimate of wolf to moose ratio’; /* ratio clm - request a ratio estimator with confidence intervals */ /* N=200 specifies total number of units in the population */ var moose wolf; ratio wolf/moose; /* this statement ask for ratio estimator */ ods output Ratio=ratio;
The RATIO statement in the SURVEYMEANS procedure request the computation of the ratio estimator. Here is the output: Obs 1
Numerator Variable
Denominator Variable
wolf
moose
c
2012 Carl James Schwarz
199
Ratio
LowerCL
StdErr
UpperCL
0.032169
0.02673676
0.002438
0.03760148
December 21, 2012
CHAPTER 3. SAMPLING The results are identical to that from the spreadsheet. Again, it is easier to do planning in the Excel spreadsheet rather than in the SAS program. CAUTION. Ordinary regression estimation from standard statistical packages provide only an APPROXIMATION to the correct analysis of survey data. There are two problems in using standard statistical packages for regression and ratio estimation of survey data: • Assumes a simple random sample. If your data is NOT collected using a simple random sample, then ordinary regression methods should NOT be used. • Unable to use a finite population correction factor. This is usually not a problem unless the sample size is large relative to the population size. • Wrong error structure. Standard regression analyses assume that the variance around the regression or ratio line is constant. In many survey problems this is not true. This can be partially alleviated through the use of weighted regression, but this still does not completely fix the problem. For further information about the problems of using standard statistical software packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_ brogan.html. Using ordinary regression Because the ratio estimator assumes that the variance of the response increases with the value of X, a new column representing the inverse of the X variable (i.e. 1/the number of moose) has been created. We start by plotting the data to assess if the relationship is linear and through the origin. The Y variable is the number of wolves; the X variable is the number of moose. If the relationship is not through the origin, then a more complex analysis called Regression estimation is required. The graph looks like it is linear through the origin which is one of the assumptions of the ratio estimator. Now we wish to fit a straight line THROUGH THE ORIGIN. By default, most computer packages include the intercept which we want to force to zero. We must also specify that the inverse of the X variable (1/X) is the weighting variable. We see that the estimated ratio (.032 wolves/moose) matches the Excel output, the estimated standard error (.0026) does not quite match Excel. The difference is a bit larger than can be accounted for not using the finite population correction factor. As a matter of interest, if you repeat the analysis WITHOUT using the inverse of the X variable as the weighting variable, you obtain an estimated ratio of .0317 (se .0022). All of these estimates are similar and it likely makes very little difference which is used. Finding the required sample size is trickier because of the weighted regression approach used by the packages, the slightly different way the se is computed, and the lack of a f pc. The latter two issues are c
2012 Carl James Schwarz
200
December 21, 2012
CHAPTER 3. SAMPLING usually not important in determining the approximate sample size, but the first issue is crucial. Start by REFITTING Y vs. X WITHOUT using the weighting variable. This will give you roughly the same estimate and se, but now it is much easier to extract the necessary information for sample size determination. When the UNWEIGHTED model is fit, you will see that Root Mean Square Error has the value of 3.28. This is the value of sdif f that is needed. The approximate se for r (ignoring the fpc) is se(r) ≈
3.28 sdif f √ √ =≈ µx n 395.64 n
Again different value of n can be tried to get the appropriate rse. This gives an n of about 25 or 26 which is sufficient for planning purposes‘
Post mortem No population numbers can be estimated using the ratio estimator in this case because of a lack of suitable data. In particular, if you had wanted to estimate the total wolf population, you would have to use the simple inflation estimator that we discussed earlier unless you had some way of obtaining the total number of moose that are present in the ENTIRE management zone. This seems unlikely. However, refer to the next example, where the appropriate information is available.
3.8.3
Example - Grouse numbers - using a ratio estimator to estimate a population total
In some cases, a ratio estimator is used to estimate a population total. In these cases, the improvement in precision is caused by the close relationship between two variables. Note that the population total of the auxiliary variable will have to be known in order to use this method. Grouse Numbers A wildlife biologist has estimated the grouse population in a region containing isolated areas (called pockets) of bush as follows: She selected 12 pockets of bush at random, and attempted to count the numbers of grouse in each of these. (One can assume that the grouse are almost all found in the bush, and for the purpose of this question, that the counts were perfectly accurate.) The total number of pockets of bush in the region is 248, comprising a total area of 3015 hectares. Results are as follows:
c
2012 Carl James Schwarz
201
December 21, 2012
CHAPTER 3. SAMPLING Area
Number
(ha)
Grouse
8.9
24
2.7
3
6.6
10
20.6
36
3.7
8
4.1
8
25.8
60
1.8
5
20.1
35
14.0
34
10.1
18
8.0
22
What is the population of interest and parameter to be estimated? As before, the is some ambiguity: • The population of interest are the pockets of brush in the region. The sampling unit is the pocket of brush. The number of grouse in each pocket is the response variable. • The population of interest is the grouse. These happen to be clustered into pockets of brush. This leads back to the previous case. What is the frame Here the frame is explicit - the set of all pockets of bush. It isn’t clear if all grouse will be found in these pockets - will some be itinerant and hence missed? What about movement between looking at the pockets of bush? Summary statistics n
mean
std dev
area
12
10.53
7.91
grouse
12
21.92
16.95
Variable
Simple inflation estimator ignoring the pocket areas If we wish to adjust for the sampling fraction, we can use our earlier results for the simple inflation estimator, c
2012 Carl James Schwarz
202
December 21, 2012
CHAPTER 3. SAMPLING our estimate qof the total number of qgrouse is τb = N y = 248 × 21.92 = 5435.33 with an estimated se of se = N ×
s2 n (1
− f ) = 248 ×
16.952 12 (1
−
12 248 )
= 1183.4.
The estimate isn’t very precise with a rse of 1183.4/5435.3 = 22%. Ratio estimator - why? Why did the inflation estimator do so poorly? Part of the reason is the relatively large standard deviation in the number of grouse in the pockets. Why does this number vary so much? It seems reasonable that larger pockets of brush will tend to have more grouse. Perhaps we can do better by using the relationship between the area of the bush and the number of grouse through a ratio estimator.
Excel analysis An Excel worksheet is available in the grouse tab in the AllofData workbook from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Preliminary plot to assess if ratio estimator will work First plot numbers of grouse vs. area and see if this has a chance of succeeding.
c
2012 Carl James Schwarz
203
December 21, 2012
CHAPTER 3. SAMPLING
The graph shows a linear relationship, through the origin. There is some evidence that the variance is increasing with X (area of the plot). Find the ratio between grouse numbers and area The spreadsheet is set up similarly to the previous example:
c
2012 Carl James Schwarz
204
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
205
December 21, 2012
CHAPTER 3. SAMPLING The total of the X variable (area) will need to be known. As before, you find summary statistics for X and Y , compute the ratio estimate, find the difference variables, find the standard deviation of the difference variable, and find the se of the estimated ratio. The estimated ratio is: r = y/x = 21.82/10.53 = 2.081 grouse/ha. The se of r is found as s r s2diff 4.74642 1 1 12 × se(r) = × (1 − ) = 0.1269 2 × n × (1 − f ) = 10.5332 12 248 x grouse/ha. Expand ratio by total of X In order to estimate the population total of Y , you now multiply the estimated ratio by the population total of X. We know the pockets cover 3015 ha, and so the estimated total number of grouse is found by τc Y = τX × r = 3015 × 2.081 = 6273.3 grouse. To estimate the se of the total, multiply the se of r by 3015 as well: se(c τY ) = τX × se(r) = 3015 × 0.1269 = 382.6 grouse. The precision is much improved compared to the simple inflation estimator. This improvement is due to the very strong relationship between the number of grouse and the area of the pockets. Sample size for future surveys If you wish to investigate different sample sizes, the simplest way would be to modify the cell corresponding to the count of the differences. This will be left as an exercise for the reader. The final ratio estimate has a rse of about 6% - quite good. It is relatively straight forward to investigate the sample size needed for a 5% rse. We find this to be about 17 pockets.
SAS analysis The analysis is done in SAS using the program grouse.sas from the Sample Program Library http:// www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The SAS program starts in the usual fashion:
data grouse; input area grouse;
c
2012 Carl James Schwarz
/* sampling weights not needed */
206
December 21, 2012
CHAPTER 3. SAMPLING The Data step reads in the data. It is not necessary to include a computation of the sampling weight if the data are collected in a simple random sample for a ratio estimator – the weights will cancel out in the formulae used by SAS. Proc Gplot creates the standard plot of numbers of grouse vs. the area of each grove.
proc gplot data=grouse; title2 ’plot to assess assumptions’; plot grouse*area;;
Proc SurveyMeans procedure can estimate the ratio of grouse/ha but cannot directly estimate the population total.
proc surveymeans data=grouse ratio clm N=248; /* the ratio clm keywords request a ratio estimator and a confidence interval. */ c
2012 Carl James Schwarz
207
December 21, 2012
CHAPTER 3. SAMPLING
title2 ’Estimation using a ratio estimator’; var grouse area; ratio grouse / area; ods output ratio=outratio; /* extract information so that total can be estimated * run;
The ODS statement redirects the results from the RATIO statement to a new dataset that is processed further to multiply by the total area of the pockets.
data outratio; /* compute estimates of the total */ set outratio; Est_total = ratio * 3015; Se_total = stderr* 3015; UCL_total = uppercl*3015; LCL_total = lowercl*3015; format est_total se_total ucl_total lcl_total 7.1; format ratio stderr lowercl uppercl 7.3; run;
The output is as follows: Data Summary Number of Observations
12
Statistics Variable
Mean
Std Error of Mean
95% CL for Mean
grouse
21.916667
4.772130
11.4132790
32.4200543
area
10.533333
2.227746
5.6300968
15.4365699
Ratio Analysis Numerator
Denominator
grouse
area
Ratio
Std Err
2.080696
0.126893
95% CL for Ratio 1.80140636
2.35998605
Obs
Ratio
StdErr
LowerCL
UpperCL
Est total
Se total
LCL total
UCL total
1
2.081
0.127
1.801
2.360
6273.3
382.6
5431.2
7115.4
c
2012 Carl James Schwarz
208
December 21, 2012
CHAPTER 3. SAMPLING The results are exactly the same as before. Again, it is easiest to do the sample size computations in Excel. We must first estimate the ratio (grouse/hectare), and then expand this to estimate the overall number of grouse. CAUTION. Ordinary regression estimation from standard statistical packages provide only an APPROXIMATION to the correct analysis of survey data. There are two problems in using standard statistical packages for regression and ratio estimation of survey data: • Assumes that a simple random sample was taken. If the sampling design is not a simple random sample, then regular regression cannot be used. • Unable to use a finite population correction factor. This is usually not a problem unless the sample size is large relative to the population size. • Wrong error structure. Standard regression analyses assume that the variance around the regression or ratio line is constant. In many survey problems this is not true. This can be partially alleviated through the use of weighted regression, but this still does not completely fix the problem. For further information about the problems of using standard statistical software packages in survey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_ brogan.html. Because the ratio estimator assumes that the variance of the response increases with the value of X, a new column representing the inverse of the X variable (i.e. 1/area of pocket) has been created. The graph looks like it is linear through the origin which is one of the assumptions of the ratio estimator. The estimated density is 2.081 (se .123) grouse/hectare. The point estimate is bang on, and the estimated se is within 1% of the correct se. This now need to be multiplied by the total area of the pockets (3015 ha) which gives an estimated total number of grouse of 6274 (se 371) grouse. [Again the estimated se is slightly smaller because of the lack of a finite population correction.] The ratio estimator is much more precise than the inflation estimator because of the strong relationship between the number of grouse and the area of the pocket.
Post mortem - a question to ponder What if it were to turn out that grouse population size tended to be proportional to the perimeter of a pocket of bush rather than its area? Would using the above ratio estimator based on a relationship with area introduce serious bias into the ratio estimate, increase the standard error of the ratio estimate, or do both? c
2012 Carl James Schwarz
209
December 21, 2012
CHAPTER 3. SAMPLING
3.9
Additional ways to improve precision
This section will not be examined on the exams or term tests
3.9.1
Using both stratification and auxiliary variables
It is possible to use both methods to improve precision. However, this comes at a cost of increased computational complexity. There are two ways of combining ratio estimators in stratified simple random sampling. 1. combined ratio estimate: Estimate the numerator and denominator using stratified random sampling and then form the ratio of these two estimates: rstratif ied,combined = and τc Y stratif ied,combined =
µc Y stratif ied µc X stratif ied µc Y stratif ied τX µc X stratif ied
We won’t consider the estimates of the se in this course, but it can be found in any textbook on sampling. 2. separate ratio estimator- make a ratio total for each stratum, and form a grand ratio by taking a weighted average of these estimates. Note that we weight by the covariate total rather than the stratum sizes. We get the following estimators for the grand ratio and grand total: rstratif ied,separate =
H 1 X τXh rh τX h=1
and τc Y stratif ied,separate =
H X
τXh rh
h=1
Again, we won’t worry about the estimates of the se. Why use one over the other? • You need stratum total for separate estimate, but only population total for combined estimate • combined ratio is less subject to risk of bias. (see Cochran, p. 165 and following). In general, the biases in separate estimator are added together and if they fall in the same direction, then trouble. In the combined estimator these biases are reduced through stratification for numerator and denominator c
2012 Carl James Schwarz
210
December 21, 2012
CHAPTER 3. SAMPLING • When the ratio estimate is appropriate (regression through the origin and variance proportional to covariate), the last term vanishes. Consequently, the combined ratio estimator will have greater standard error than the separate ratio estimator unless R is relatively constant from stratum to stratum. However, see above, the bias may be more severe for the separate ratio estimator. You must consider the combined effects of bias and precision, i.e. MSE.
3.9.2
Regression Estimators
A ratio estimator works well when the relationship between Yi and Xi is linear, through the origin, with the variance of observations about the ratio line increasing with X. In some cases, the relationship may be linear, but not through the origin. In these cases, the ratio estimator is generalized to a regression estimator where the linear relationship is no longer constrained to go through the origin. We won’t be covering this in this course. Regression estimators are also useful if there is more than one X variable. Whenever you use a regression estimator, be sure to plot y vs. x to assess if the assumptions for a ratio estimator are reasonable. CAUTION: If ordinary statistical packages are used to do regression analysis on survey data, you could obtain misleading results because the usual packages ignore the way in which the data were collected. Virtually all standard regression packages assume you’ve collected data under a simple random sample. If your sampling design is more complex, e.g. stratified design, cluster design, multi-state design, etc, then you should use a package specifically designed for the analysis of survey data, e.g. SAS and the Proc SurveyReg procedure.
3.9.3
Sampling with unequal probability - pps sampling
All of the designs discussed in previous sections have assumed that each sample unit was selected with equal probability. In some cases, it is advantageous to select units with unequal probabilities, particularly if they differ in their contribution to the overall total. This technique can be used with any of the sampling designs discussed earlier. An unequal probability sampling design can lead to smaller standard errors (i.e. better precision) for the same total effort compared to an equal probability design. For example, forest stands may be selected with probability proportional to the area of the stand (i.e. a stand of 200 ha will be selected with twice the probability that a stand of 100 ha in size) because large stands contribute more to the overall population and it would be wasteful of sampling effort to spend much effort on smaller stands. The variable used to assign the probabilities of selection to individual study units does not need to have an exact relationship with an individual contributions to the total. For example, in probability proportional
c
2012 Carl James Schwarz
211
December 21, 2012
CHAPTER 3. SAMPLING to prediction (3P sampling), all trees in a small area are visited. A simple, cheap characteristic is measured which is used to predict the value of the tree. A sub-sample of the trees is then selected with probability proportional to the predicted value, remeasured using a more expensive measuring device, and the relationship between the cheap and expensive measurement in the second phase is used with the simple measurement from the first phase to obtain a more precise estimate for the entire area. This is an example of two-phase sampling with unequal probability of selection. Please consult with a sampling expert before implementing or analyzing an unequal probability sampling design.
3.10
Cluster sampling
In some cases, units in a population occur naturally in groups or clusters. For example, some animals congregate in herds or family units. It is often convenient to select a random sample of herds and then measure every animal in the herd. This is not the same as a simple random sample of animals because individual animals are not randomly selected; the herds are the sampling unit. The strip-transect example in the section on simple random sampling is also a cluster sample; all plots along a randomly selected transect are measured. The strips are the sampling units, while plots within each strip are sub-sampling units. Another example is circular plot sampling; all trees within a specified radius of a randomly selected point are measured. The sampling unit is the circular plot while trees within the plot are sub-samples. Some examples of cluster samples are: • urchin estimation - transects are taken perpendicular to the shore and a diver swims along the transect and counts the number of urchins in each m2 along the line. • aerial surveys - a plane flies along a line and observers count the number of animals they see in a strip on both sides of the aircraft. • forestry surveys - often circular plots are located on the ground and ALL tree within that plot are measured. Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. This is not valid because units within a cluster are typically positively correlated. The effect of this erroneous analysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimated standard error is too small and does not fully reflect the actual imprecision in the estimate. Solution: You will pleased to know that, in fact, you already know how to design and analyze cluster samples! The proper analysis treats the clusters as a random sample from the population of clusters, i.e. treat the cluster as a whole as the sampling unit, and deal only with cluster total as the response measure.
c
2012 Carl James Schwarz
212
December 21, 2012
CHAPTER 3. SAMPLING
3.10.1
Sampling plan
In simple random sampling, a frame of all elements was required in order to draw a random sample. Individual units are selected one at a time. In many cases, this is impractical because it may not be possible to list all of the individual units or may be logistically impossible to do this. In many cases, the individual units appear together in clusters. This is particularly true if the sampling unit is a transect - almost always you measure things on a individual quadrat level, but the actual sampling unit is the cluster. This problem is analogous to pseudo-replication in experimental design - the breaking of the transect into individual quadrats is like having multiple fish within the tank. A visual comparison of a simple random sample vs. a cluster sample You may find it useful to compare a simple random sample of 24 vs. a cluster sample of 24 using the following visual plans: Select a sample of 24 in each case.
c
2012 Carl James Schwarz
213
December 21, 2012
CHAPTER 3. SAMPLING Simple Random Sampling
c
2012 Carl James Schwarz
214
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
215
December 21, 2012
CHAPTER 3. SAMPLING Describe how the sample was taken.
c
2012 Carl James Schwarz
216
December 21, 2012
CHAPTER 3. SAMPLING Cluster Sampling First, the clusters must be defined. In this case, the units are naturally clustered in blocks of size 8. The following units were selected.
c
2012 Carl James Schwarz
217
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
218
December 21, 2012
CHAPTER 3. SAMPLING Describe how the sample was taken. Note the differences between stratified simple random sampling and cluster sampling!
3.10.2
Advantages and disadvantages of cluster sampling compared to SRS
• Advantage It may not be feasible to construct a frame for every elemental unit, but possible to construct frame for larger units, e.g. it is difficult to locate individual quadrats upon the sea floor, but easy to lay out transects from the shore. • Advantage Cluster sampling is often more economical. Because all units within a cluster are close together, travel costs are much reduced. • Disadvantage Cluster sampling has a higher standard error than an SRSWOR of the same total size because units are typically homogeneous within clusters. The cluster itself serves as the sampling unit. For the same number of units, cluster sampling almost always gives worse precision. This is the problem that we have seen earlier of pseudo-replication. • Disadvantage A cluster sample is more difficult to analyze, but with modern computing equipment, this is less of a concern. The difficulties are not arithmetic but rather being forced to treat the clusters as the survey unit - there is a natural tendency to think that data are being thrown away. The perils of ignoring a cluster design The cluster design is frequently used in practice, but often analyzed incorrectly. For example, when ever the quadrats have been gathered using a transect of some sort, you have a cluster sampling design. The key thing to note is that the sampling unit is a cluster, not the individual quadrats. The biggest danger of ignoring the clustering aspects and treating the individual quadrats as if they came from an SRS is that, typically, your reported se will be too small. That is, the true standard error from your design may be substantially larger than your estimated standard error obtained from a SRS analysis. The precision is (erroneously) thought to be far better than is justified based on the survey results. This has been seen before - refer to the paper by Underwood where the dangers of estimation with positively correlated data were discussed.
3.10.3
Notation
The key thing to remember is to work with the cluster TOTALS. Traditionally, the cluster size is denoted by M rather than by X, but as you will see in a few moment, estimation in cluster sampling is nothing more than ratio estimation performed on the cluster totals.
c
2012 Carl James Schwarz
219
December 21, 2012
CHAPTER 3. SAMPLING Population
Sample
Attribute
value
value
Number of clusters
N
n
Cluster totals
τi
yi
Cluster sizes
Mi
mi
Total area
M
3.10.4
NOTE τi and yi are the cluster i TOTALS
Summary of main results
The key concept in cluster sampling is to treat the cluster TOTAL as the response variable and ignore all the individual values within the cluster. Because the clusters are a simple random sample from the population of clusters, simply apply all the results you had before for a SRS to the CLUSTER TOTALS. The analysis of a cluster design will require the size of each cluster - this is simply the number of subunits within each cluster. If the clusters are roughly equal in size, a simple inflation estimator can be used. But, in many cases, there is strong relationship between the size of the cluster and cluster total – in these cases a ratio estimator would likely be more suitable (i.e. will give you a smaller standard error), where the X variable is the cluster size. If there is no relationship between cluster size and the cluster total, a simple inflation estimator can be used as well even in the case of unequal cluster sizes. You should do a preliminary plot of the cluster totals against the cluster sizes to see if this relationship holds. Extensions of cluster analysis - unequal size sampling In some cases, the clusters are of quite unequal sizes. A better design choice may to be select clusters with an unequal probability design rather than using a simple random sample. In this case, clusters that are larger, typically contribute more to the population total, and would be selected with a higher Computational formulae Parameter
Population value N P
Overall mean
µ=
n P
τi
i=1 N P
Estimator
Mi
µ b=
τ =M ×µ
mi
i=1
i=1
Overall total
yi
i=1 n P
estimated se q 2 1 sdiff (1 − f ) m2 n
τb = M × µ b
q
M2 ×
2 1 sdiff m2 n
(1 − f )
• You never use the mean per unit within a cluster. c
2012 Carl James Schwarz
220
December 21, 2012
CHAPTER 3. SAMPLING • The term s2diff =
n P i=1
(yi −b µmi )2 n−1
is again found in the same fashion as in ratio estimation - create a new
variable which is the difference between yi − µ bmi , find the sample standard deviation2 of it, and then square the standard deviation. • Sometimes the ratio of two variables measured within each cluster is required, e.g. you conduct aerial surveys to estimate the ratio of wolves to moose - this has already been done in an earlier example! In these cases, the actual cluster length is not used. Confidence intervals As before, once you have an estimator for the mean and for the se, use the usual ±2se rule. If the number of clusters is small, then some text books advise using a t-distribution for the multiplier – this is not covered in this course. Sample size determination Again, this is no real problem - except that you will get a value for the number of CLUSTERS, not the individual quadrats within the clusters.
3.10.5
Example - estimating the density of urchins
Red sea urchins are considered a delicacy and the fishery is worth several millions of dollars to British Columbia. In order to set harvest quotas and in order to monitor the stock, it is important that the density of sea urchins be determined each year. To do this, the managers lay out a number of transects perpendicular to the shore in the urchin beds. Divers then swim along the transect, and roll a 1 m2 quadrat along the transect line and count the number of legal sized and sub-legal sized urchins in the quadrat. The number of possible transects is so large that the correction for finite population sampling can be ignored. The dataset contains variables for the transect, the quadrat within each transect, and the number of legal and sub-legal sized urchins counted in that quadrat. What is the population of interest and the parameter? The population of interest is the sea urchins in the harvest area. These happened to be (artificially) “clustered” into transects which are sampled. All sea urchins within the cluster are measured. The parameter of interest is the density of legal sized urchins. c
2012 Carl James Schwarz
221
December 21, 2012
CHAPTER 3. SAMPLING What is the frame? The frame is conceptual - there is no predefined list of all the possible transects. Rather they pick random points along the shore and then lay the transects out from that point. What is the sampling design? The sampling design is a cluster sample - the clusters are the transect lines while the quadrats measured within each cluster are similar to pseudo-replicates. The measurements within a transect are not independent of each other and are likely positively correlated (why?). As the points along the shore were chosen using a simple random sample the analysis proceeds as a SRS design on the cluster totals.
Excel Analysis An Excel worksheet with the data and analysis is called urchin and is available in the AllofData workbook from then Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/ Notes/MyPrograms. A reduced view appears below:
c
2012 Carl James Schwarz
222
December 21, 2012
CHAPTER 3. SAMPLING
c
2012 Carl James Schwarz
223
December 21, 2012
CHAPTER 3. SAMPLING Summarize to cluster level The key, first step in any analysis of a cluster survey is to first summarize the data to the cluster level. You will need the cluster total and the cluster size (in this case the length of the transect). The Pivot Table feature of Excel is quite useful for doing this automatically. Unfortunately, you still have to play around with the final table in order to get the data displayed in a nice format. In many transect studies, there is a tendency to NOT record quadrats with 0 counts as they don’t affect the cluster sum. However, you still have to know the correct size of the cluster (i.e. how many quadrats), so you can’t simply ignore these ‘missing’ values. In this case, you could examine the maximum of the quadrat number and the number of listed quadrats to see if these agree (why?). Preliminary plot Plot the cluster totals vs. the cluster size to see if a ratio estimator is appropriate, i.e. linear relationship through the origin with variance increasing with cluster size. The plot (not shown) shows a weak relationship between the two variables. Summary Statistics Compute the summary statistics on the cluster TOTALS. You will need the totals over all sampled clusters of both variables. sum(legal) sum(quad) 1507 1120
n(transect) 28
Compute the ratio d = The estimated density is then density
sum(legal) sum(quad)
= 1507/1120 = 1.345536 urchins/m2 .
Compute the difference column To compute the se, create the diff column as in the ratio estimation section and find its standard deviation. Compute the se of the ratio estimate r d The estimated se is then found as: se(density) =
s2diff ntransects
×
1 2 quad
=
q
48.099332 28
×
1 402
= 0.2272
urchins/m2 . (Optional) Expand final answer to a population total In order to estimate the total number of urchins in the harvesting area, you simply multiply the estimated ratio and its standard error by the area to be harvested. c
2012 Carl James Schwarz
224
December 21, 2012
CHAPTER 3. SAMPLING SAS Analysis SAS v.8 has procedures for the analysis of survey data taken in a cluster design. A program to analyze the data is urchin.sas and is available from the Sample Program Library at http://www.stat.sfu.ca/ ~cschwarz/Stat-650/Notes/MyPrograms. The SAS program starts by reading in the data at the individual transect level:
data urchin; infile urchin firstobs=2 missover; /* the first record has the variable names */ input transect quadrat legal sublegal; /* no need to specify sampling weights because transects are an SRS */ run;
The total on the urchins and length of urchins are computed using Proc Means:
proc sort data=urchin; by transect; proc means data=urchin noprint; by transect; var quadrat legal; output out=check min=min max=max n=n sum(legal)=tlegal; run;
and then plotted:
proc gplot data=check; title2 ’plot the relationship between the cluster total and cluster size’; plot tlegal *n=1; /* use the transect number as the plotting character */ symbol1 v=plus pointlabel=("#transect"); run;
c
2012 Carl James Schwarz
225
December 21, 2012
CHAPTER 3. SAMPLING
Because we are computing a ratio estimator from a simple random sample of transects, it is not necessary to specify the sampling weights. The key feature of the SAS program is the use of the CLUSTER statement to identify the clusters in the data.
proc surveymeans data=urchin; cluster transect; var legal; run;
/* do not specify a pop size as fpc is negligble */
The population number of transects was not specified as the finite population correction is negligible. Here are the results:
c
2012 Carl James Schwarz
226
December 21, 2012
CHAPTER 3. SAMPLING Data Summary Number of Clusters Number of Observations
28 1120
Statistics Variable legal
N
Mean
Std Error of Mean
1120
1.345536
0.227248
95% CL for Mean 0.87926137
1.81181006
The results are identical to above. The first step in the analysis when using standard computer packages is to summarize up to the cluster level. You need to compute the total for each cluster and the size of each cluster. Note that there was no transect numbered 5, 12, 17, 19, or 32. Why are these transects missing? According to the records of the survey, inclement weather caused cancellation of the missing transects. It seems reasonable to treat the missing transects as missing completely at random (MCAR). In this case, there is no problem in simply ignoring the missing data – all that happens is that the precision is reduced compared to the design with all data present. We compare the maximum(quadrat) number to the number of quadrat values actually recorded and see that they all match indicating that it appears no empty quadrats were not recorded. Now we are back to the case of a ratio estimator with the Y variable being the number of legal sized urchins measured on the transect, and the X variable being the size of the transect. As in the previous examples of a ratio estimator, we create a weighting variable equal to 1/X = 1/size of transect: The estimated density is 1.346 (se .216) uchins/m2 . The se is a bit smaller because of the lack of a finite population correction factor but is within 1% of the correct se.
Planning for future experiments The rse of the estimate is 0.2274/1.3455 = 17% - not terrific. The determination of sample size is done in the same manner as in the ratio estimator case dealt with in earlier sections except that the number of CLUSTERS is found. If we wanted to get a rse near to 5%, we would need almost 320 transects - this is likely too costly.
3.10.6
Example - estimating the total number of sea cucumbers
Sea cucumbers are considered a delicacy among some, and the fishery is of growing importance. c
2012 Carl James Schwarz
227
December 21, 2012
CHAPTER 3. SAMPLING In order to set harvest quotas and in order to monitor the stock, it is important that the number of sea cucumbers in a certain harvest area be estimated each year. The following is an example taken from Griffith Passage in BC 1994. To do this, the managers lay out a number of transects across the cucumber harvest area. Divers then swim along the transect, and while carrying a 4 m wide pole, count the number of cucumbers within the width of the pole during the swim. The number of possible transects is so large that the correction for finite population sampling can be ignored. Here is the summary information up the transect area (the preliminary raw data is unavailable): Transect
Sea
Area
Cucumbers
260
124
220
67
200
6
180
62
120
35
200
3
200
1
120
49
140
28
400
1
120
89
120
116
140
76
800
10
1460
50
1000
122
140
34
180
109
80
48
The total harvest area is 3,769,280 m2 as estimated by a GIS system. The transects were laid out from one edge of the bed and the length of the edge is 51,436 m. Note that because each transect was 4 m wide, the number of transects is 1/4 of this value. c
2012 Carl James Schwarz
228
December 21, 2012
CHAPTER 3. SAMPLING What is the population of interest and the parameter? The population of interest is the sea cucumbers in the harvest area. These happen to be (artificially) “clustered” into transects which are the sampling unit. All sea cucumbers within the transect (cluster) are measured. The parameter of interest is the total number of cucumbers in the harvest area. What is the frame? The frame is conceptual - there is no predefined list of all the possible transects. Rather they pick random points along the edge of the harvest area, and then lay out the transect from there. What is the sampling design? The sampling design is a cluster sample - the clusters are the transect lines while the quadrats measured within each cluster are similar to pseudo-replicates. The measurements within a transect are not independent of each other and are likely positively correlated (why?). Analysis - abbreviated As the analysis is similar to the previous example, a detailed description of the Excel, SAS, R, or JMP versions will not be done. The worksheetl cucumber is available in the Allofdata workbook from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms illustrates the computations in Excel. There are three different surveys illstrated. It also computes the two estimators when two potential outliers are deleted and for a second harvest area. Summarize to cluster level The key, first step in any analysis of a cluster survey is to first summarize the data to the cluster level. You will need the cluster total and the cluster size (in this case the area of the transect). This has already been done in the above data. Now this summary table is simply an SRSWOR from the set of all transects. We first estimate the density, and then multiply by the area to estimate the total. Note that after summarizing up to the transect level, this example proceeds in an analogous fashion as the grouse in pockets of brush example that we looked at earlier. Preliminary Plot A plot of the cucumber total vs. the transect size shows a very poor relationship between the two variables. It will be interesting to compare the results from the simple inflation estimator and the ratio estimator.
c
2012 Carl James Schwarz
229
December 21, 2012
CHAPTER 3. SAMPLING Simple Inflation Estimator First, estimate the number ignoring the area of the transects by using a simple inflation estimator. The summary statistics that we need are: n Mean std Dev
19 transects 54.21 cucumbers/transect 42.37 cucumbers/transect
We compute an estimate of the total as τb = N y = (51, 436/4) × 54.21 = 697, 093 sea cucumbers. [Why did we use 51,436/4 rather than 51,436?] We compute an estimate of the se of the total as: se(b τ) = 124, 981 sea cucumbers.
p
N 2 s2 /n × (1 − f ) =
p
(51, 436/4)2 × 42.372 /19 =
The finite population correction factor is so small we simply ignore it. This gives a relative standard error (se/est) of 18%. Ratio Estimator We use the methods outlined earlier for ratio estimators from SRSWOR to get the following summary table:
Mean
area 320.00
cucumbers 54.21 per transect
d = The estimated density of sea cucumbers is then density 2 cucumber/m .
mean(cucumbers) mean(area)
= 54.21/320.00 = 0.169
To compute the se, create the diff column as in the ratio estimation section and find deviation qits standard s2diff 1 d as sdiff = 73.63. The estimated se of the ratio is then found as: se(density) = ntransects × area2 = q 73.632 1 2 × 320 2 = 0.053 cucumbers/m . 19 We once again ignore the finite population correction factor. In order to estimate the total number of cucumbers in the harvesting area, you simply multiply the above by the area to be harvested: d = 3, 769, 280 × 0.169= 638,546 sea cucumbers. τbratio = area × density d The se is found as: se(b τratio ) = area × se(density) = 3, 769, 280 × 0.053 = 198,983 sea cucumbers for an overall rse of 31%. c
2012 Carl James Schwarz
230
December 21, 2012
CHAPTER 3. SAMPLING SAS Analysis The SAS program is available in cucumber.sas and the relevant output in cucumber.lst in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. However, because only the summary data is available, you cannot use the CLUSTER statement of Proc SurveyMeans. Rather, as noted earlier in the notes, you form a ratio estimator based on the cluster totals. The data are read in the Data step already sumarized to the cluster level:
data cucumber; input area cucumbers; transect = _n_; /* number the transects */
Do a plot to see the relationship between transect area and numbers of cucumbers:
proc gplot data=cucumber; title2 ’plot the relationship between the cluster total and cluster size’; plot cucumbers*area=1; /* use the transect number as the plotting character */ symbol1 v=plus pointlabel=("#transect"); run;
c
2012 Carl James Schwarz
231
December 21, 2012
CHAPTER 3. SAMPLING
Because the relationship between the number of cucmbers and transect area is not very strong, A simple inflation estimator will be tried first. The sample weights must be computed. This is equal to the total area of the cucumber bed divided by the number of transects taken:
/* First compute the sampling weight and add to the dataset */ /* The sampling weight is simply the total pop size / # sampling units in an SRS */ /* In this example, transects were an SRS from all possible transects */ proc means data=cucumber n mean std ; var cucumbers; /* get the total number of transects */ output out=weight n=samplesize; run; data cucumber; merge cucumber weight; retain samplingweight;
c
2012 Carl James Schwarz
232
December 21, 2012
CHAPTER 3. SAMPLING if samplesize > . then samplingweight = 51436/4 / samplesize; run;
And then the simple inflation estimator is used via Proc SurveyMeans:
proc surveymeans data=cucumber mean clm sum clsum cv ; /* N not specified as we ignore the fpc in this problem */ /* mean clm - find estimate of mean and confidence intervals */ /* sum clsum - find estimate of grand total and confidence intervals */ title2 ’Simple inflation estimator using cluster totals’; var cucumbers; weight samplingweight; run;
Data Summary Number of Observations Sum of Weights
19 12859
Statistics Variable cucumbers
Mean
Std Error of Mean
54.210526
9.719330
95% CL for Mean 33.7909719
Coeff of Variation
Sum
Std Dev
95
0.179289
697093
124981
434518
74.6300807
Now for the ratio estimator. First use Proc SurveyMeans to compute the density, and then inflated the density by the total area of cucumber area;
proc surveymeans data=cucumber ratio clm ; /* the ratio clm keywords request a ratio estimator and a confidence interval. */ title2 ’Estimation using a ratio estimator’; var cucumbers area; ratio cucumbers / area; ods output ratio=outratio; /* extract information so that total can be estimated * run; data outratio; /* compute estimates of the total */ set outratio; cv = stderr / ratio; /* the relative standard error of the estimate */ Est_total = ratio * 3769280; c
2012 Carl James Schwarz
233
December 21, 2012
CHAPTER 3. SAMPLING Se_total = stderr* 3769280; UCL_total = uppercl*3769280; LCL_total = lowercl*3769280; format est_total se_total ucl_total lcl_total 7.1; format cv 7.2; format ratio stderr lowercl uppercl 7.3; run;
This gives the final results: Obs
Ratio
StdErr
cv
LowerCL
UpperCL
1
0.169
0.053
0.31
0.058
0.280
Obs
Est total
Se total
LCL total
UCL total
1
638546
198983
220498
1056593
Comparing the two approaches Why did the ratio estimator do worse in this case than the simple inflation estimator in Griffiths Passage? The plot the number of sea cucumbers vs. the area of the transect shows virtually no relationship between the two - hence there is no advantage to using a ratio estimator. In more advanced courses, it can be shown that the ratio estimator will do better than the inflation estimator if the correlation between the two variables is greater than 1/2 of the ratio of their respective relative variation (std dev/mean). Advanced computations shows that half of the ratio of their relative variations is 0.732, while the correlation between the two variables is 0.041. Hence the ratio estimator will not do well. The Excel worksheet also repeats the analysis for Griffith Passage after dropping some obvious outliers. This only makes things worse! As well, at the bottom of the worksheet, a sample size computation shows that substantially more transects are needed using a ratio estimator than for a inflation estimator. It appears that in Griffith Passage, that there is a negative correlation between the length of the transect and the number of cucumbers found! No biological reason for this has been found. This is a cautionary example to illustrate the even the best laid plans can go astray - always plot the data. A third worksheet in the workbook analyses the data for Sheep Passage. Here the ratio estimator outperforms the inflation estimator, but not by a wide factor.
c
2012 Carl James Schwarz
234
December 21, 2012
CHAPTER 3. SAMPLING
3.11
Multi-stage sampling - a generalization of cluster sampling
Not part of Stat403/650 Please consult with a sampling expert before implementing or analyzing a multistage design.
3.11.1
Introduction
All of the designs considered above select a sampling unit from the population and then do a complete measurement upon that item. In the case of cluster sampling, this is facilitated by dividing the sampling unit into small observational units, but all of the observational units within the sampled cluster are measured. If the units within a cluster are fairly homogeneous, then it seems wasteful to measure every unit. In the extreme case, if every observational unit within a cluster was identical, only a single observational unit from the cluster needs to be selected in order to estimate (without any error) the cluster total. Suppose then that the observational units within a cluster were not identical, but had some variation? Why not take a sub-sample from each cluster, e.g. in the urchin survey, count the urchins in every second or third quadrat rather than every quadrat on the transect. This method is called two-stage sampling. In the first stage, larger sampling units are selected using some probability design. In the second stage, smaller units within the selected first-stage units are selected according to a probability design. The design used at each stage can be different, e.g. first stage units selected using a simple random sample, but second stage units selected using a systematic design as proposed for the urchin survey above. This sampling design can be generalized to multi-stage sampling. Some example of multi-stage designs are: • Vegetation Resource Inventory. The forest land mass of BC has been mapped using aerial methods and divided into a series of polygons representing homogeneous stands of trees (e.g. a stand dominated by Douglas-fir). In order to estimate timber volumes in an inventory unit, a sample of polygons is selected using a probability-proportional-to-size design. In the selected polygons, ground measurement stations are selected on a 100 m grid and crews measure standing timber at these selected ground stations. • Urchin survey Transects are selected using a simple random sample design. Every second or third quadrat is measured after a random starting point. • Clam surveys Beaches are divided into 1 ha sections. A random sample of sections is selected and a series of 1 m2 quadrats are measured within each section. • Herring spawns biomass Schweigert et al. (1985, CJFAS, 42, 1806-1814) used a two-stage design to estimate herring spawn in the Strait of Georgia.
c
2012 Carl James Schwarz
235
December 21, 2012
CHAPTER 3. SAMPLING • Georgia Strait Creel Survey The Georgia Strait Creel Survey uses a multi-stage design to select landing sites within strata, times of days to interview at these selected sites, and which boats to interview in a survey of angling effort on the Georgia Strait. Some consequences of simple two-stage designs are: • If the selected first-stage units are completely enumerated then complete cluster sampling results. • If every first-stage unit in the population is selected, then a stratified design results. • A complete frame is required for all first-stage units. However, a frame of second-stage and lowerstage units need only be constructed for the selected upper-stage units. • The design is very flexible allowing (in theory) different selection methods to be used at each stage, and even different selection methods within each first stage unit. • A separate randomization is done within each first-stage unit when selecting the second-stage units. • Multi-stage designs are less precise than a simple random sample of the same number of final sampling units, but more precise than a cluster sample of the same number of final sampling units. [Hint: think of what happens if the second-stage units are very similar.] • Multi-stage designs are cheaper than a simple random sample of the same number of final sampling units, but more expensive than a cluster sample of the same number of final sampling units. [Hint: think of the travel costs in selecting more transects or measuring quadrats within a transect.] • As in all sampling designs, stratification can be employed at any level and ratio and regression estimators are available. As expected, the theory becomes more and more complex, the more "variations" are added to the design. The primary incentives for multi-stage designs are that 1. frames of the final sampling units are typically not available 2. it often turns out that most of the variability in the population occurs among first-stage units. Why spend time and effort in measuring lower stage units that are relatively homogeneous within the firststage unit
3.11.2
Notation
A sample of n first-stage units (FSU) is selected from a total of N first-stage units. Within the ith first-stage unit, mi second-stage units (SSU) are selected from the Mi units available.
c
2012 Carl James Schwarz
236
December 21, 2012
CHAPTER 3. SAMPLING Item
Population
Sample
Value
Value
N
n
Mi
mi
First stage units Second stage units SSUs in population
M=
P
Mi
Value of SSU
Yij
Total of FSU
τi P
Total in pop
τ=
Mean in pop
µ = τ /M
3.11.3
yij τbi = Mi /mi
mi P
yij
j=1
τi
Summary of main results
We will only consider the case when simple random sampling occurs at both stages of the design. The intuitive explanation for the results is that a total is estimated for each FSU selected (based on the SSU selected). These estimated totals are then used in a similar fashion to a cluster sample to estimate the grand total. Parameter
Population value
Total Mean
τ=
P
µ=
τ M
Estimated Estimate
τi
N n
n P i=1
µ b=
se s
τbi
τb M
se (ˆ τ) = se (ˆ µ) =
N 2 (1 − f1 ) q
s21 n
+
N 2 f1 n2
n P i=1
se2 (ˆ τ) M2
where n P
s21 =
i=1
s22i =
c
2012 Carl James Schwarz
237
2 τbi − τb
n−1 mi P
s2
Mi2 (1 − f2 ) m2ii
2
(yij − yi )
j=1
mi − 1
December 21, 2012
CHAPTER 3. SAMPLING
n
τb =
1X τbi n i=1
f1 = n/N and f2i = mi /Mi Notes: • There are two contributions to the estimated se - variation among first stage totals (s21 ) and variation 2 among second stage units (S2i ). • If the FSU vary considerably in size, a ratio estimator (not discussed in these notes) may be more appropriate. Confidence Intervals The usual large sample confidence intervals can be used.
3.11.4
Example - estimating number of clams
A First Nations wished to develop a wild oyster fishery. As first stage in the development of the fishery, a survey was needed to establish the current stock in a number of oyster beds. This example looks at the estimate of oyster numbers from a survey conducted in 1994. The survey was conducted by running a line through the oyster bed – the total length was 105 m. Several random location were located along the line. At each randomly chosen location, the width of the bed was measured and about 3 random location along the perpendicular transect at that point were taken. A 1 m2 quadrat was applied, and the number of oysters of various sizes was counted in the quadrat.
c
2012 Carl James Schwarz
238
December 21, 2012
CHAPTER 3. SAMPLING
Location
tran-
width
quad-
sect
width
rat
(m)
(m)
seed
xsmall
small
med
large
total
Net
count
weight (kg)
Lloyd
5
17
3
18
18
41
48
14
139
14.6
Lloyd
5
17
5
6
4
30
9
4
53
5.2
Lloyd
5
17
10
15
21
44
13
11
104
8.2
Lloyd
7
18
5
8
10
14
5
3
40
6.0
Lloyd
7
18
12
10
38
36
16
4
104
10.2
Lloyd
7
18
13
0
15
12
3
3
33
4.6
Lloyd
18
14
1
11
8
5
9
19
52
7.8
Lloyd
18
14
5
13
23
68
18
11
133
12.6
Lloyd
18
14
8
1
29
60
2
1
93
10.2
Lloyd
30
11
3
17
1
13
13
2
46
5.4
Lloyd
30
11
8
12
16
23
22
14
87
6.6
Lloyd
30
11
10
23
15
19
17
1
75
7.0
Lloyd
49
9
3
10
27
15
1
0
53
2.0
Lloyd
49
9
5
13
7
14
11
4
49
6.8
Lloyd
49
9
8
10
25
17
16
11
79
6.0
Lloyd
76
21
4
3
3
11
7
0
24
4.0
Lloyd
76
21
7
15
4
32
26
24
101
12.4
Lloyd
76
21
11
2
19
14
19
0
54
5.8
Lloyd
79
18
1
14
13
7
9
0
43
3.6
Lloyd
79
18
4
0
32
32
27
16
107
12.8
Lloyd
79
18
11
16
22
43
18
8
107
10.6
Lloyd
84
19
1
14
32
25
39
7
117
10.2
Lloyd
84
19
8
25
43
42
17
3
130
7.2
Lloyd
84
19
15
5
22
61
30
13
131
14.2
Lloyd
86
17
8
1
19
32
10
8
70
8.6
Lloyd
86
17
11
8
17
13
10
3
51
4.8
Lloyd
86
17
12
7
22
55
11
4
99
9.8
Lloyd
95
20
1
17
12
20
18
4
71
5.0
Lloyd
95
20
8
32
4
26
29
12
103
11.6
Lloyd
95
20
15
3
34
17
11
1
66
6.0
These multi-stage designs are complex to analyze. Rather than trying to implement the various formulae, I would suggest that a proper sampling package be used (such as SAS, or R) rather than trying to do these by
c
2012 Carl James Schwarz
239
December 21, 2012
CHAPTER 3. SAMPLING hand. If using simple packages, the first step is to move everything up to the primary sampling unit level. We need to estimate the total at the primary sampling unit, and to compute some components of the variance from the second stage of sampling. Now you will need to add some columns to estimate the total for each FSU and the contribution of the second stage sampling to the overall variance. These columns will be created using the formula boxes as shown below. First the formula for the FSU total, i.e. the estimated total weight for the entire transect. This is the simply the average weight per quadrat times the width of the strip. Second, we compute the component of variance for the second stage. [Typically, if the first stage sampling fraction is small, this can be ignored.] Now to summarize up to the population level. We compute the mean transect total and expand by the number of transects: The variance component from the first stage of sampling is found as: And the final overall se is found by combining the first stage variance component and the second stage variance component from each transect This gives us the final solution: Our final estimate is a total biomass of 14,070 kg with an estimated se of 1484 kg. A similar procedure can be used for the other variables.
Excel Spreadsheet The above computations can also be done in Excel as shown in the wildoyster worksheet in the ALLofData.xls workbook from the Sample Program Library. As in the case of a pure cluster sample, the PivotTable feature can be used to compute summary statistics needed to estimate the various components.
SAS Program SAS can also be used to analyze the data as shown in the program wildoyster.sas and output wildoyster.lst in the Sample Program Library.
c
2012 Carl James Schwarz
240
December 21, 2012
CHAPTER 3. SAMPLING The data are read in the usual way:
data oyster; infile datalines firstobs=3; input loc $ transect width quad small xsamll small med large total weight; sampweight = 105/10 * width/3; /* sampling weight = product of sampling fractions */
The sample weight is computed as the product of the sampling fraction at the first stage and the second stage. Proc SurveyMeans is used directly with the two-stage design. The cluster statement identifies the first stage of the sampling.
/* estimate the total biomass on the oyster bed */ /* Note that SurveyMeans only use a first stage variance in its computation of the standard error. As the first stage sampling fraction is usually quite small, this will tend to give only slight underestimates of the true standard error of the estimate */ proc surveymeans data=oyster total=105 /* length of first reference line */ sum ; /* interested in total biomass estimate */ cluster transect; /* identify the perpindicular transects */ var weight; weight sampweight; run;
Note that the Proc SurveyMeans computes the se using only the first stage standard errors. As the first stage sampling fraction is usually quite small, this will tend to give only slight underestimates of the true standard error of the estimate. The final results are: Data Summary Number of Clusters
10
Number of Observations
30
Sum of Weights
c
2012 Carl James Schwarz
241
1722
December 21, 2012
CHAPTER 3. SAMPLING Statistics Variable weight
3.11.5
Sum
Std Dev
14070
1444.919931
Some closing comments on multi-stage designs
The above example barely scratches the surface of multi-stage designs. Multi-stage designs can be quite complex and the formulae for the estimates and estimated standard errors fearsome. If you have to analyze such a design, it is likely better to invest some time in learning one of the statistical packages designed for surveys (e.g. SAS v.8) rather than trying to program the tedious formulae by hand. There are also several important design decisions for multi-stage designs. • Two-stage designs have reduced costs of data collection because units within the FSU are easier to collect but also have a poorer precision compared to a simple-random sample with the same number final sampling units. However, because of the reduced cost, it often turns out the more units can be sampled under a multi-stage design leading to an improved precision for the same cost as a simplerandom sample design. There is a tradeoff between sampling more first stage units and taking a small sub-sample in the secondary stage. An optimal allocation strategy can be constructed to decide upon the best strategy – consult some of the reference books on sampling for details. • As with ALL sampling designs, stratification can be used to improve precision. The stratification usually takes place at the first sampling unit stage, but can take place at all stages. The details of estimation under stratification can be found in many sampling texts. • Similarly, ratio or regression estimators can also be used if auxiliary information is available that is correlated with the response variable. This leads to very complex formulae! One very nice feature of multi-stage designs is that if the first stage is sampled with replacement, then the formulae for the estimated standard errors simplify considerably to a single term regardless of the design used in the lower stages! If there are many first stage units in the population and if the sampling fraction is small, the chances of selecting the same first stage unit twice are very small. Even if this occurs, a different set of second stage units will likely be selected so there is little danger of having to measure the same final sampling unit more than once. In such situations, the design at second and lower stages is very flexible as all that you need to ensure is that an unbiased estimate of the first-stage unit total is available.
3.12
Analytical surveys - almost experimental design
In descriptive surveys, the objective was to simply obtain information about one large group. In observational studies, two deliberately selected sub-populations are selected and surveyed, but no attempt is made c
2012 Carl James Schwarz
242
December 21, 2012
CHAPTER 3. SAMPLING to generalize the results to the whole population. In analytical studies, sub-populations are selected and sampled in order to generalize the observed differences among the sub-population to this and other similar populations. As such, there are similarities between analytical and observational surveys and experimental design. The primary difference is that in experimental studies, the manager controls the assignment of the explanatory variables while measuring the response variables, while in analytical and observational surveys, neither set of variables is under the control of the manager. [Refer back to Examples B, C, and D in the earlier chapters] The analysis of complex surveys for analytical purposes can be very difficult (Kish 1987; Kish, 1984; Rao, 1973; Sedransk, 1965a, 1965b, 1966). As in experimental studies, the first step in analytical surveys is to identify potential explanatory variables (similar to factors in experimental studies). At this point, analytical surveys can be usually further subdivided into three categories depending on the type of stratification: • the population is pre-stratified by the explanatory variables and surveys are conducted in each stratum to measure the outcome variables; • the population is surveyed in its entirety, and post-stratified by the explanatory variables. • the explanatory variables can be used as auxiliary variables in ratio or regression methods. [It is possible that all three types of stratification take place - these are very complex surveys.] The choice between the categories is usually made by the ease with which the population can be prestratified and the strength of the relationship between the response and explanatory variables. For example, sample plots can be easily pre-stratified by elevation or by exposure to the sun, but it would be difficult to pre-stratify by soil pH. Pre-stratification has the advantage that the manager has control over the number of sample points collected in each stratum, whereas in post- stratification, the numbers are not controllable, and may lead to very small sample sizes in certain strata just because they form only a small fraction of the population. For example, a manager may wish to investigate the difference in regeneration (as measured by the density of new growth) as a function of elevation. Several cut blocks will be surveyed. In each cut block, the sample plots will be pre-stratified into three elevation classes, and a simple random sample will be taken in each elevation class. The allocation of effort in each stratum (i.e. the number of sample plots) will be equal. The density of new growth will be measured on each selected sample plot. On the other hand, suppose that the regeneration is a function of soil pH. This cannot be determined in advance, and so the manager must take a simple random sample over the entire stand, measure the density of new growth and the soil pH at each sampling unit, and then post-stratify the data based on measured pH. The number of sampling units in each pH class is not controllable; indeed it may turn out that certain pH classes have no observations. If explanatory variables are treated as a auxiliary variables, then there must be a strong relationship between the response and explanatory variables. Additionally, we must be able to measure the auxiliary variable precisely for each unit. Then, methods like multiple regression can also be used to investigate c
2012 Carl James Schwarz
243
December 21, 2012
CHAPTER 3. SAMPLING the relationship between the response and the explanatory variable. For example, rather than classifying elevation into three broad elevation classes or soil pH into broad pH classes, the actual elevation or soil pH must be measured precisely to serve as an auxiliary variable in a regression of regeneration density vs. elevation or soil pH. If the units have been selected using a simple random sample, then the analysis of the analytical surveys proceeds along similar lines as the analysis of designed experiments (Kish, 1987; also refer to Chapter 2). In most analyses of analytical surveys, the observed results are postulated to have been taken from a hypothetical super-population of which the current conditions are just one realization. In the above example, cut blocks would be treated as a random blocking factor; elevation class as an explanatory factor; and sample plots as samples within each block and elevation class. Hypothesis testing about the effect of elevation on mean density of regeneration occurs as if this were a planned experiment. Pitfall: Any one of the sampling methods described in Section 2 for descriptive surveys can be used for analytical surveys. Many managers incorrectly use the results from a complex survey as if the data were collected using a simple random sample. As Kish (1987) and others have shown, this can lead to substantial underestimates of the true standard error, i.e., the precision is thought to be far better than is justified based on the survey results. Consequently the manager may erroneously detect differences more often than expected (i.e., make a Type I error) and make decisions based on erroneous conclusions. Solution: As in experimental design, it is important to match the analysis of the data with the survey design used to collect it. The major difficulty in the analysis of analytical surveys are: 1. Recognizing and incorporating the sampling method used to collect the data in the analysis. The survey design used to obtain the sampling units must be taken into account in much the same way as the analysis of the collected data is influenced by actual experimental design. A table of ‘equivalences’ between terms in a sample survey and terms in experimental design is provided in Table 1.
c
2012 Carl James Schwarz
244
December 21, 2012
CHAPTER 3. SAMPLING Table 1 Equivalences between terms used in surveys and in experimental design. Survey Term
Experimental Design Term
Simple Random Sample
Completely randomized design
Cluster Sampling
(a) Clusters are random effects; units within a cluster treated as sub-samples; or (b) Clusters are treated as main plots; units within a cluster treated as sub-plots in a split-plot analysis.
Multi-stage sampling
(a) Nested designs with units at each stage nested in units in higher stages. Effects of units at each stage are treated as random effects, or (b) Split-plot designs with factors operating at higher stages treated as main plot factors and factors operating at lower stages treated as sub-plot factors.
Stratification
Fixed factor or random block depending on the reasons for stratification.
Sampling Unit
Experimental unit or treatment unit
Sub-sample
Sub-sample
There is no quick easy method for the analysis of complex surveys (Kish, 1987). The super-population approach seems to work well if the selection probabilities of each unit are known (these are used to weight each observation appropriately) and if random effects corresponding to the various strata or stages are employed. The major difficulty caused by complex survey designs is that the observations are not independent of each other. 2. Unbalanced designs (e.g. unequal numbers of sample points in each combination of explanatory factors). This typically occurs if post- stratification is used to classify units by the explanatory variables but can also occur in pre-stratification if the manager decides not to allocate equal effort in each stratum. The analysis of unbalanced data is described by Milliken and Johnson (1984). 3. Missing cells, i.e., certain combinations of explanatory variables may not occur in the survey. The analysis of such surveys is complex, but refer to Milliken and Johnson (1984). 4. If the range of the explanatory variable is naturally limited in the population, then extrapolation outside of the observed range is not recommended. More sophisticated techniques can also be used in analytical surveys. For example, correspondence analysis, ordination methods, factor analysis, multidimensional scaling, and cluster analysis all search for post-hoc associations among measured variables that may give rise to hypotheses for further investigation. Unfortunately, most of these methods assume that units have been selected independently of each other using a simple random sample; extensions where units have been selected via a complex sampling design have not yet developed. Simpler designs are often highly preferred to avoid erroneous conclusions based on inappropriate analysis of data from complex designs.
c
2012 Carl James Schwarz
245
December 21, 2012
CHAPTER 3. SAMPLING Pitfall: While the analysis of analytical surveys and designed experiments are similar, the strength of the conclusions is not. In general, causation cannot be inferred without manipulation. An observed relationship in an analytical survey may be the result of a common response to a third, unobserved variable. For example, consider the two following experiments. In the first experiment, the explanatory variable is elevation (high or low). Ten stands are randomly selected at each elevation. The amount of growth is measured and it appears that stands at higher elevations have less growth. In the second experiment, the explanatory variables is the amount of fertilizer applied. Ten stands are randomly assigned to each of two doses of fertilizer. The amount of growth is measured and it appears that stands that receive a higher dose of fertilizer have greater growth. In the first experiment, the manager is unable to say whether the differences in growth are a result of differences in elevation or amount of sun exposure or soil quality as all three may be highly related. In the second experiment, all uncontrolled factors are present in both groups and their effects will, on average, be equal. Consequently, the assignment of cause to the fertilizer dose is justified because it is the only factor that differs (on average) among the groups. As noted by Eberhardt and Thomas (1991), there is a need for a rigorous application of the techniques for survey sampling when conducting analytical surveys. Otherwise they are likely to be subject to biases of one sort or another. Experience and judgment are very important in evaluating the prospects for bias, and attempting to find ways to control and account for these biases. The most common source of bias is the selection of survey units and the most common pitfall is to select units based on convenience rather than on a probabilistic sampling design. The potential problems that this can lead to are analogous to those that occur when it is assumed that callers to a radio-phone- in show are representative of the entire population.
3.13
References
• Cochran, W.G. (1977). Sampling Techniques. New York:Wiley. One of the standard references for survey sampling. Very technical • Gillespie, G.E. and Kronlund, A.R. (1999). A manual for intertidal clam surveys, Canadian Technical Report of Fisheries and Aquatic Sciences 2270. A very nice summary of using sampling methods to estimate clam numbers. • Keith, L.H. (1988), Editor. Principles of Environmental Sampling. New York: American Chemical Society. A series of papers on sampling mainly for environmental contaminants in ground and surface water, soils, and air. A detailed discussion on sampling for pattern. • Kish, L. (1965). Survey Sampling. New York: Wiley. An extensive discussion of descriptive surveys mostly from a social science perspective. • Kish, L. (1984). On Analytical Statistics from complex samples. Survey Methodology, 10, 1-7. An overview of the problems in using complex surveys in analytical surveys. • Kish, L. (1987). Statistical designs for research. New York: Wiley. One of the more extensive discussions of the use of complex surveys in analytical surveys. Very technical.
c
2012 Carl James Schwarz
246
December 21, 2012
CHAPTER 3. SAMPLING • Krebs, C. (1989). Ecological Methodology. A collection of methods commonly used in ecology including a section on sampling • Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999). Survey methodology for intertidal bivalves. Canadian Technical Report of Fisheries and Aquatic Sciences 2214. An overview of how to use surveys for assessing intertidal bivalves - more technical than Gillespie and Kronlund (1999). • Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem management. New York: Wiley. Good primer on how to measure common ecological data using direct survey methods, aerial photography, etc. Includes a discussion of common survey designs for vegetation, hydrology, soils, geology, and human influences. • Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal of the Royal Statistical Society, Series B, 27, 264-278. • Thompson, S.K. (1992). Sampling. New York:Wiley. A good companion to Cochran (1977). Has many examples of using sampling for biological populations. Also has chapters on mark-recapture, line-transect methods, spatial methods, and adaptive sampling.
3.14
Frequently Asked Questions (FAQ)
3.14.1
Confusion about the definition of a population
What is the difference between the "population total" and the "population size"?
Population size normally refers to the number of “final sampling” units in the population. Population total refers to the total of some variable over these units. For example, if you wish to estimate the total family income of families in Vancouver, the “final” sampling units are families, the population size is the number of families in Vancouver, and the response variable is the income for this household, and the population total will be the total family income over all families in Vancouver. Things become a bit confusing when sampling units differ from “final” units that are clustered and you are interested in estimates of the number of “final” units. For example in the grouse/pocket bush example, the population consists of the grouse which are clustered into 248 pockets of brush. The grouse is the final sampling unit, but the sampling unit is a pocket of bush. In cluster sampling, you must expand the estimator by the number of CLUSTERS, not by the number of final units. Hence the expansion factor is the number of pockets (248), the variable of interest for a cluster is the number of grouse in each pocket, and the population total is the number of grouse over all pockets.
c
2012 Carl James Schwarz
247
December 21, 2012
CHAPTER 3. SAMPLING Similarly, for the oysters on the lease. The population is the oysters on the lease. But you don’t randomly sample individual oysters – you randomly sample quadrats which are clusters of oysters. The expansion factor is now the number of quadrats. In the salmon example, the boats are surveyed. The fact that the number of salmon was measured is incidental - you could have measured the amount of food consumed, etc. In the angling survey problem, the boats are the sampling units. The fact that they contain anglers or that they caught fish is what is being measured, but the set of boats that were at the lake that day is of interest.
3.14.2
How is N defined
How is N (the expansion factor defined). What is the best way to find this value? This can get confusing in the case of cluster or multi-phase designs as there are different N ’s at each stage of the design. It might be easier to think of N as an expansion factor. The expansion factor will be known once the frame is constructed. In some cases, this can only be done after the fact - for example, when surveying angling parties, the total number of parties returning in a day is unknown until the end of the day. For planning purposes, some reasonable guess may have to done in order to estimate the sample size. If this is impossible, just choose some arbitrary large number - the estimated future sample size will be an overestimate (by a small amount) but close enough. Of course, once the survey is finished, you would then use the actual value of N in all computations.
3.14.3
Multi-stage vs. Multi-phase sampling
What is the difference between Multi-stage sampling and multi-phase sampling?
In multi-stage sampling, the selection of the final sampling units takes place in stages. For example, suppose you are interested in sampling angling parties as they return from fishing. The region is first divided into different landing sites. A random selection of landing sites is selected. At each landing site, a random selection of angling parties is selected. In multi-phase sampling, the units are NOT divided into larger groups. Rather a first phase selects some units and they are measured quickly. A second phase takes a sub-sample of the first phase and measures more intently. Returning back to the angling survey. A multi-phase design would select angling parties. All of the selected parties could fill out a brief questionnaire. A week later, a sample of the questionnaires is selected, and the angling parties RECONTACTED for more details. The key difference is that in multi-phase sampling, some units are measured TWICE; in multi-phase sampling, there are different sizes of sampling units (landing sites vs. angling parties), but each sampling unit is only selected once. c
2012 Carl James Schwarz
248
December 21, 2012
CHAPTER 3. SAMPLING
3.14.4
What is the difference between a Population and a frame?
Frame = list of sampling units from which a sample will be taken. The sampling units may not be the same as the “final” units that are measured. For example, in cluster sampling, the frame is the list of clusters, but the final units are the objects within the cluster. Population = list of all “final” units of interest. Usually the “final units” are the actual things measured in the field, i.e. what is the final object upon which a measurement is taken. In some cases, the frame doesn’t match the population which may cause biases, but in ideal cases, the frame covers the population.
3.14.5
How to account for missing transects.
What do you do if an entire cluster is “missing”?
Missing data can occur at various parts in a survey and for various reasons. The easiest data to handle is data ‘missing completely at random’ (MCAR). In this situation, the missing data provides no information about the problem that is not already captured by other data point and the ‘missingness’ is also noninformative. In this case, and if the design was a simple random sample, the data point is just ignored. So if you wanted to sample 80 transects, but were only able to get 75, only the 75 transects are used. If some of the data are missing within a transect - the problem changes from a cluster sample to a two-stage sample so the estimation formulae change slightly. If data is not MCAR, this is a real problem - welcome to a Ph.D. in statistics in how to deal with it!
c
2012 Carl James Schwarz
249
December 21, 2012
Chapter 4
Designed Experiments - Terminology and Introduction Contents 4.1
4.2
4.3
4.4
4.5
Terminology and Introduction . . . . . . . . . . . . . . . . . . . 4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Treatment, Experimental Unit, and Randomization Structure 4.1.3 The Three R’s of Experimental Design . . . . . . . . . . . 4.1.4 Placebo Effects . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Single and bouble blinding . . . . . . . . . . . . . . . . . 4.1.6 Hawthorne Effect . . . . . . . . . . . . . . . . . . . . . . Applying some General Principles of Experimental Design . . . 4.2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Experiment 5 . . . . . . . . . . . . . . . . . . . . . . . . Some Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Salk Vaccine Experiment . . . . . . . . . . . . . . . . 4.3.2 Testing Vitamin C - Mistakes do happen . . . . . . . . . . Key Points in Design of Experiments . . . . . . . . . . . . . . . 4.4.1 Designing an Experiment . . . . . . . . . . . . . . . . . . 4.4.2 Analyzing the data . . . . . . . . . . . . . . . . . . . . . 4.4.3 Writing the Report . . . . . . . . . . . . . . . . . . . . . A Road Map to What is Ahead . . . . . . . . . . . . . . . . . . 4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Experimental Protocols . . . . . . . . . . . . . . . . . . . 4.5.3 Some Common Designs . . . . . . . . . . . . . . . . . . .
250
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
251 251 252 255 257 257 258 258 259 259 259 260 260 261 261 262 262 263 264 264 265 265 265 267
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
4.1
Terminology and Introduction
This chapter contains definitions and a general introduction that forms a foundation for the rest of the course work. These concepts may seem abstract at first, but they will become more meaningful as they are applied in later chapters.
4.1.1
Definitions
Experimental design and analysis has a standardized terminology that is, unfortunately, different from that used in survey sampling (refer to the section on Analytical Surveys in the chapter on Sampling for an equivalence table). • factor - one of the variables under the control of the experimenter that is varied over different experimental units. If a variable is kept constant over all experimental units, then it is not a factor because we cannot discern its influence on the response (e.g. if the water temperature in an experiment is held constant and equal for all tanks, it is impossible to determine a temperature effect). Factors are sometimes called explanatory variables. A factor has 2 or more levels. • levels - values of the factor used in the experiment. For example, in an experiment to assess the effects of different amounts of UV radiation upon the growth rate of smolt, the UV radiation was held at normal, 1/2 normal, and 1/5 normal levels. These would the three levels for this factor. • treatment - the combination of factor levels applied to an experimental unit. If an experiment has a single factor, then each treatment would correspond to one of the levels. If an experiment had two or more factors, then the combination of levels from each factor applied to an experimental unit would be the treatment. For example, with two factors having 2 and 3 levels respectively, there are 6 possible treatment combinations. • response variable - what outcome is being measured. For example, in an experiment to measure smolt growth in response to UV levels, the response variable for each smolt could be final weight after 30 days. • experimental unit - the unit to which the treatment is applied. For example, several smolt could be placed into a tank and the tank is exposed to different amount of UV radiation. The tank is the experimental unit. • observational unit - the unit on which the response is measured. In some cases, the observational unit may be different from the experimental unit - be careful! CAUTION: A common mistake in the analysis of experimental data is to confuse the experimental and observational unit. This leads to pseudo-replication as discussed in the very nice paper by Hurlbert (1984). For example, consider an experiment to investigate the effects of UV levels on the growth of smolt. Two tanks are prepared; one tank has high levels of UV light, the second tank has no UV light. Many fish are placed in each tank. At the end of the experiment, the individual fish are measured. In this c
2012 Carl James Schwarz
251
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION experiment, the observational unit is the smolt, but the experimental unit is the tank. The treatment are NOT individually administered to single fish - a whole group of fish are simultaneously exposed to the UV radiation. Here any tank effect e.g. closeness of the tank to a window, is completely confounded with the experimental treatment and cannot be separated. You CANNOT analyze this data at the individual fish level. Identify the factor, its levels, the treatments, the response variable, the experimental unit, and the observational unit in the following situations: • An agricultural experimental station is going to test two varieties of wheat. Each variety will be planted on 3 fields, and the yield from the field will be measured. • An agricultural experimental station is going to test two varieties of wheat. Each variety will be tested with two types of fertilizers. Each combination will be applied to two plots of land. The yield will be measured for each plot. • Fish farmers want to study the effect of an anti-bacterial drug on the amount of bacteria in fish gills. The drug is administered at three dose-levels (none, 20, and 40 mg/100L). Each dose is administered to a large controlled tank through the filtration system. Each tank has 100 fish. At the end of the experiment, the fish are killed, and the amount of bacteria in the gills of each fish is measured.
4.1.2
Treatment, Experimental Unit, and Randomization Structure
Every experiment can be decomposed into three components: • Treatment Structure This describes the relationships among the factors. In this course, you will only see factorial experiments where every treatment combination appears in the experiment. • Experimental Unit Structure This describes how the experimental units are arranged among themselves. In this course you will see three types of experimental unit structures - independent, blocked, or split-plotted. • Randomization Structure This describes how treatments are assigned to experimental units. In this course you will see completely randomized designs and blocked designs. By looking at these three structures, it is possible to correctly and reliably analyze any experimental design without resorting to cookbook methods. This philosophy is known as No-Name Experimental Design and is exemplified by the book of the same name by Lorenzen and Anderson (1993). The raw data cannot, by itself, provide sufficient information to decide on the experimental design used to collect the data. For example, consider an experiment to investigate the influence of lighting level (High or Low) and Moisture Level (Wet or Dry) upon the growth of plants grown in pots.1 Four possible experimental designs are shown below: 1
This is a popular pasttime in British Columbia!
c
2012 Carl James Schwarz
252
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
In each case, plants are potted, the treatment applied, and the resulting growth of the plant’s leaves (say final total biomass) is measured. Treatment structure. In all four designs, the treatment structure is the same. There are 2 factors (Lighting level and Moisture level) each with 2 levels giving a total of 4 treatments. All treatments appear
c
2012 Carl James Schwarz
253
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION in all of the experiments (giving what is known as a factorial treatment structure). Experimental Unit structure. In Dssign A, the experimental and observational unit are the pot. The treatments (HD, HW, LD, LW) are assigned to individual pots, and a single measurement is obtained on the single plant from each pot. In Design B, the pots are first grouped into two houses. As before, the experimental and observational units are the pots. In Design C, the experimental unit is the pot (which contains two plants) but the observational unit is the plant. In Design D, four growth chambers are obtained. The lighting level (H or L) is assigned to a growth chamber. Two pots are placed in each growth chamger. The moisture level is assigned to the individual pot. The observational unit is the single plant in each pot. Randomization Structure. In Design A, there is complete randomization of treatments to experimental units. In Design B, there is a restricted randomization. Each house receives all four treatments (exactly once in each house) and the treatments are randomized to pots independently in each house. In Design C, there is complete randomization of treatments to the pot. Both plants in a both receive the same treatment. Finally, in Design D, the lighting levels are randomized to the growth chambers, while the moisture level is randomized to the individual pots. All four designs give rise to 8 observations (the growth of each plant). However, the four designs MUST BE ANALYZED DIFFERENTLY(!) because the experimental design is different. Design A is known as a completely randomized design (CRD). This is the default design assumed by most computer packages. Design B is an example of a Randomized Complete Block (RCB) design. Each house is a block (stratum) and each block has every treatment (is complete). It would be incorrect to analyze Design B using the default settings of most computer pacakges. Design C is an example of pseudo-replicated design (Hurlbert, 1984). While there are two plants in each pot, each plant did not receive its own treatment, but they received treatments in pairs. There is no valid analysis of this experiment because there are no replicated treatments, i.e. there is only one pot for each combination of lighting level and mositure level. It would be incorrect to analyze Design C using the default settings of most computer pacakges. Finally Design D is a combination of two different designs operating at two different sizes of experiment units. The assignment of lighting levels to growth chambers looks like a completely randomized design at the growth chamber level. The assignment of moisture level to the two pots within a growth chamber looks like a blocked design with growth chambers serving as blocks, and the individual pots serving as the experimental units for moisture level. This is an illustration of what is known as a split-plot (from its agricultural heritage) design with main plots (growth chambers) being assigned to lighting levels and split-plots (the pots) being assigned to moisture level. You could, of course, reverse the roles of lightling level and moisture level to
c
2012 Carl James Schwarz
254
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION get yet another split-plot design. It would be incorrect to analyze Design D using the default settings of most computer pacakges. The moral of the story is that you must carefully consider how the data were collected before using a computer package. This is the most common error in the analysis of experimental data - the unthinking use of a computer package where the actual design of the experiment does not match the analysis being done. It is ALWAYS helpful to draw a picture of the actual experimental design to try and decide upon the treatment, experimental, and randomization structure. Be sure that the brain is engaged, before putting a package into gear!
4.1.3
The Three R’s of Experimental Design
It is important to • randomize because it averages out the effect of all other lurking variables. Note, that randomization doesn’t remove their effects, but makes, on average, their effects equal in all groups. • replicate because (1) some estimate of natural variation is required in order to know if any observed difference can be attributed to the factor or just random chance; and (2) the experiment should have sufficent power to detect an effect of a biologically meaningful size. • stratify (block) to account for and remove the effects of a known extraneous variable. What do the above mean in practice? Consider the very simple agricultural problem of comparing two varieties of tomatoes. The purpose of the comparison is to find the variety which produces the greater quantity of marketable quality fruit from a given area for large scale commercial planting. What should we do? A simple approach would be to plant a tract of land in each variety and measure the total weight of marketable fruit produced. However, there are some obvious difficulties. The variety that cropped most heavily may have done so simply because it was growing in better soil. A number of factors affect growth: soil fertility, soil acidity, irrigation and drainage, wind exposure, exposure to sunlight (e.g. shading, northfacing or south-facing hillside). Unfortunately no one knows exactly to what extent changes in these factors affect growth. So unless the two tracts of land are comparable with respect to all of these features, we won’t be able to conclude that the more heavily producing variety is better as it may just be planted in a tract that is better suited to growth. If it was possible (and it never will be) to find two tracts of land that were identical in these respects, using just those two tracts for comparison would result in a fair comparison but the differences found might be so special to that particular combination of growing conditions that the results obtained would not be a good guide to full scale agricultural production anyway.
c
2012 Carl James Schwarz
255
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION Why randomize? Let us think about it another way. Suppose we took a large tract of land and subdivided it into smaller plots by laying down a rectangular grid. By using some sort of systematic design to decide what variety to plant in each plot, we may come unstuck if there is a feature of the land like an unknown fertility gradient. We may still end up giving one variety better plots on average. Instead, let’s do it randomly by numbering the plots and randomly choosing half of them to receive the first variety. The rest receive the second variety. We might expect the random assignment to ensure that both varieties were planted in roughly the same numbers of high fertility and low fertility plots, high pH and low pH plots, well drained and poorly drained plots etc. In that sense we might expect the comparison of yields to be fair. Moreover, although we have thought of some factors affecting growth, there will be many more that we, and even the specialist, will not have thought of. And we can expect the random assignment of treatments to ensure some rough balancing of those as well! Why replicate? Random sampling gives representative samples, on average. However, in small samples, it may occur, just by chance, that your sample may be a ‘bit weird’. Unfortunately, we can only expect the random allocation of treatments to lead to balanced samples (e.g. a fair division of the more and less fertile plots) if we have a large number of experimental units to randomize. In many experiments this is not true (e.g. using 6 plots to compare 2 varieties) so that in any particular experiment there may well be a lack of balance on some important factor. Random assignment still leads to fair or unbiased comparisons, but only in the sense of being fair or unbiased when averaged over a whole sequence of experiments. This is one of the reasons why there is such an emphasis in science on results being repeatable. A more important reason for adequate replication is that large experiments have a greater chance of detecting important differences. This seems self evident, but in practice can be hard to achieve. One common error often made is to confuse pseudo-replication with real replication - refer to your assignments for examples of this. Why replicate at all – why not just measure one unit under each treatment? Without replication, it is impossible to determine if a difference is caused by random chance or the factors. Why block? Partly because random assignment of treatments does not necessarily ensure a fair comparison when the number of experimental units is small, more complicated experimental designs are available to ensure fairness with respect to those factors which we believe to be very important. Suppose with our tomato example that, because of the small variation in the fertility of the land we were using, the only thing that we thought mattered greatly was drainage. We could then try and divide the land into two blocks, one well drained and one badly drained. These would then be subdivided into smaller plots, say 10 plots per block. Then in each block, 5 plots are assigned at random to the first variety and the remaining 5 plots to the second variety. We would then only compare the two varieties within each block so that well drained plots are only compared with well drained plots, and similarly for badly drained plots. This idea is called blocking. By allocating varieties to plots within a block at random we would provide some protection against other extraneous factors. Another application of this idea is in the comparison of the effects of two pain relief drugs on people. If the situation allowed, we would like to try both drugs on each person and look at the differential effects within people. To protect ourselves against biases caused by a drift in day to day pain levels or other physical factors within a person that might affect the response, the order in which each person received the drugs would be randomly chosen. Here the block is the person and we have two treatments (drugs) per person
c
2012 Carl James Schwarz
256
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION applied in a random order. The randomization could be done in such a way that the same number received each drug first so that the design is “balanced” with regard to order. In the words of Box, Hunter and Hunter [1978, page 103] the best general strategy for experimentation is, “block what you can and randomize what you cannot.”
4.1.4
Placebo Effects
There are psychological effects which complicate the effects of treatments on people. In medicine, there is a tendency for people to get better because they think that something is being done about their complaint. So when you give someone a pain relief drug there are two effects at work, a chemical effect of the drug and a psychological boost. To give you some idea of the strength of these psychological effects, approximately 35% of patients have been shown to respond favorably to a placebo (or inert dummy treatment) for a wide range of conditions including alleviation of pain, depression, blood pressure, infantile asthma, angina and gastro-intestinal problems (Grimshaw and Jaffe [1982]). Consequently, any control treatment must look and feel as similar as possible as any of the active treatments being applied.
4.1.5
Single and bouble blinding
In evaluating the effect of the drug, we want to filter out the real effect of the drug from the effect of the patient’s own psychology. The standard method is to compare what happens to patients who are given the drug (the treatment group) with what happens to patients given an inert dummy treatment called a placebo (the control group). This will only work if the patients do not know whether they are getting the real drug or the placebo. Similarly, when we are comparing two or more treatments the subjects (patients) should not know which treatment they are receiving, if at all possible. This idea is called single blinding the subjects. The results are not then contaminated by any preconceived ideas about the relative effectiveness of the treatments. For example, consider a comparative study of two asthma treatments, one in liquid form and one in powder form. To blind the subjects, each subject had to receive both a powder and a liquid. One was a real treatment, the other a placebo. The idea of using a control or control group to evaluate the effect of an experimental intervention is basic to all experimentation. Blinding, to whatever extent is possible, tends to be desirable in any experiment involving human responses (e.g. medical, psychological and educational experiments). It is often advisable to have the experimenter blinded as to the treatment applied as well. An experiment was performed in England to evaluate the effect of providing free milk to school children. There was a random allocation of children to the group who received milk (the treatment group) and the control group which received no milk. However, because the study designers were afraid that random assignment may not not necessarily have balanced the groups on general health and social background at the classroom level, teachers were allowed to switch children between treatment and control to equalize the groups. It is easy to imagine the effect this must have had. Most teachers are caring people who would be unhappy to
c
2012 Carl James Schwarz
257
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION watch affluent children drinking free school milk while malnourished children go without. We would expect the sort of interchanging of students between groups that went on would result in too many malnourished children receiving the milk thus biasing the study, perhaps severely. To protect against this sort of thing, medical studies are made double blind whenever possible. In addition to the subjects not knowing what treatment they are getting, the people administering the treatments don’t know either! The people evaluating the results, e.g. deciding whether tissue samples are cancerous or not, should also be “blinded”.
4.1.6
Hawthorne Effect
In studies of people, the actual process of measuring or observing people changes their behavior. According to Wikipedia2 , The “Hawthorne effect” was not named after a researcher but rather refers to the factory where the effect was first thought to be observed and described: the Hawthorne works of the Western Electric Company in Chicago, 1924-1933. The phrase was coined by Landsberger in 1955. One definition of the Hawthorne effect is: • An experimental effect in the direction expected but not for the reason expected; i.e., a significant positive effect that turns out to have no causal basis in the theoretical motivation for the intervention, but is apparently due to the effect on the participants of knowing themselves to be studied in connection with the outcomes measured. For example, would you change your viewing habits on TV if you knew that your viewing was being monitored?
4.2
Applying some General Principles of Experimental Design
Does taking Vitamin C reduce the incidence of colds? This is a popular theory - how would you design an experiment to test this research question using students as “guinea pigs”. [We will ignore for a moment the whole problem of how representative students are of the population in general. This is a common problem with many drug studies that are often conducted on a single gender and age group and then the drug company wishes to generalize the results to the other gender and other age groups.] Assume that you have a group of 50 students available for the experiment. First: What are the experimental units, the factor, its levels, and the treatments? 2 http://en.wikipedia.org/wiki/Hawthorne_effect
c
2012 Carl James Schwarz
258
Accessed on 2007-10-01.
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
4.2.1
Experiment 1
Have all 50 students take Vitamin C supplements for 6 months and record the number of colds. After 6 months, the data was collected, and on average, the group of students had 1.4 colds per subject. What is the flaw in this design? • there is no ‘control group’ that the results can be compared against. Is an average of 1.4 colds in 6 months unusually low?
4.2.2
Experiment 2
The students are divided into two groups. All of the males receive Vitamin C supplements for 6 months while all of the females receive nothing. Each student records the number of colds incurred in the 6 month period. At the end of the 6 months, the males had an average of 1.4 colds per subject; the female had an average of 1.9 colds per subject. What is the flaw in this design? • there is a ‘control group’, but any effect of the Vitamin C is confounded with gender, i.e., we can’t tell if the difference in the average number of colds is due to the Vitamin C or to the gender of the students.
4.2.3
Experiment 3
The students are randomly assigned to each of the two groups. This is accomplished by putting 25 slips of paper into a hat marked ‘Vitamin C’, and 25 slips of paper in to a hat marked ‘Control’. The slips of paper are mixed. Each student then selects a slip of paper. Those with the slips marked ‘Vitamin C’ are given Vitamin C supplements for 6 months, while those with the slips marked ‘Control’ are not given anything. Each student records the number of colds incurred in the 6 month period. At the end of the 6 months, the Vitamin C group had an average of 1.4 colds per subject; the control group had an average of 1.9 colds per subject. What is the flaw in this design? • the study is not ‘double-blinded’, Because the students knew if they were taking Vitamin C or not, they may modify their behavior accordingly. For example, those taking Vitamin C may feel that they don’t have to wash their hands as often as those not taking Vitamin C under the belief that the Vitamin C will ‘protect’ them.
c
2012 Carl James Schwarz
259
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
4.2.4
Experiment 4
The researcher makes up 50 identical pill bottles. In half of the researcher places Vitamin C tables. In the other half, the research puts identical looking and tasting sugar pills (a placebo). The researcher then asks an associate to randomly assign the jars and to keep track of which jar has which tablet, but the associate is not to tell the researcher. The key is put into a sealed envelope. The numbered jars are then placed in a box, and mixed up. Each student then selects a jar and tells the researcher the number on the jar. Each student takes the pill for the next six month. Each student records the number of colds incurred in the 6 month period. After the data are recorded, the associate opens the sealed envelope and each student is then labeled as having taken a Vitamin C or a placebo. The Vitamin C group had an average of 1.4 colds per subject; the control group had an average of 1.9 colds per subject. What is the flaw in this design? • None! This is a properly conducted experiment. It has a control group so that comparisons can be made between two groups. Students are randomly assigned to each group so that any lurking variable will be roughly present in both groups and its effects will be roughly equal. The control groups gets an identical tasting placebo so the subjects don’t know what treatment they are receiving. The researcher doesn’t know what the students are taking until after the experiment is complete so can’t unconsciously influence students behavior. [This is known as double-blinding, neither the subject nor the research knows the treatment until the experiment is over.] This experiment can be diagramed as follows:
4.2.5
Experiment 5
The researcher was unable to recruit enough students from a single class, and so has to recruit students from two different classes. Because the contact patterns of students may differ among the classes (e.g. one class may be in the morning and students have to cram together on the busses to get to class on time), the research would like to “control” for any class effect. Consequently, the two classes are treated as separate blocks of students, and within each class, researcher makes up identical pill bottles. In half of bottles, the researcher places Vitamin C tables. In the other half, the research puts identical looking and tasting sugar pills (a placebo). The researcher then asks an associate to randomly label the jars and to keep track of which jar has c
2012 Carl James Schwarz
260
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION which tablet, but the associate is not to tell the researcher. The key matching the bottles to the pill type is put into a sealed envelope. The numbered jars are then placed in a box, and mixed up. Each student then selects a jar and tells the researcher the number on the jar. Each student takes the pill for the next six month. Each student records the number of colds incurred in the 6 month period. After the data are recorded, the associate opens the sealed envelope and each student is then labeled as having taken a Vitamin C or a placebo. The Vitamin C group from Class A had an average of 1.4 colds per subject; the control group in Class A had an average of 1.9 colds per subject. The Vitamin C group from class B had an average of 1.2 colds per subject; the control group had an average of 1.5 colds per subject. What is the flaw in this design? • None! This is even a better experiment! There are two mini-experiments being conducted simultaneously in the two ‘blocks’ or ‘strata’ - the two classes. Blocking or stratification is useful when there may be a known lurking variable that is strongly suspected to affect the results. It is not necessary to stratify or to block, but statistical theory can show that in most cases, you can get a more powerful experiment, i.e. able to detect smaller differences, for the same sample size by suitable chosen blocking variables.
4.3 4.3.1
Some Case Studies The Salk Vaccine Experiment
Many of the issues above arise in the famous story of the 1954 Salk vaccine trial. Polio plagued many parts of the world in the first half of this century. It struck in epidemics causing deformity and death, particularly to children. In 1954 the US Public Health service began a field trial of a vaccine developed by Jonas Salk. Two million children were involved, and of these, roughly half a million were vaccinated, half a million were left unvaccinated and the parents of a million refused the vaccination. The trial was conducted on the most vulnerable group (in grades 1, 2 and 3) in some of the highest risk areas of the country. The National Foundation for Infantile Paralysis (NFIP) put up a design in which all grade 2 children whose parents would c
2012 Carl James Schwarz
261
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION consent were vaccinated, while all grades 1 and 3 children then formed the controls. What flaws are apparent here? To begin with, polio is highly infectious, being passed on by contact. It could therefore easily sweep through grade 2 in some districts, thus biasing the study against the vaccine, or it could appear in much greater numbers in grade 1 or grade 3, making the vaccine look better than it is. Also the treatment group consisted only of children whose parents agreed to the vaccination whereas no consent was required for the control group. At that time, well-educated, higher-income parents were more likely to agree to vaccination, so that treatment and control groups were unbalanced on social status. Such an imbalance could be important because higher income children tended to live in more hygienic surroundings and, surprisingly, this increases the incidence of polio. Children from less hygienic surroundings were more likely to contract mild cases when still young enough to be protected by antibodies from their mothers. They then generated their own antibodies which protected them later on. We might expect this effect to bias the study against the vaccine. Some school districts saw these problems with the NFIP design and decided to do it a better way. You guessed it. A double-blind, controlled, randomized study. The placebo was an injection of slightly salty water; the treatment and control groups were obtained by randomly assigning to the vaccine or placebo only those children whose parents consented to the vaccination. Whose results would you believe?
4.3.2
Testing Vitamin C - Mistakes do happen
Even the best laid plans can go awry, however. Mosteller et al. [1983 : page 242] describe a Toronto study conducted to test the theory of two time Nobel Prize winner Linus Pauling that large doses of vitamin C tend to prevent the common cold. Patients were randomized with regard to receiving vitamin C or the placebo and blinded as to the treatment they were receiving. However some participants tried to find out whether they were getting vitamin C by opening their capsules and tasting! Ostensibly, the trial showed significantly fewer colds in the vitamin group than the control group. However, when those who tasted and correctly guessed what they were getting were eliminated from the analysis, any difference between the groups was small enough to be due to chance variation.
4.4
Key Points in Design of Experiments
This section is adapted from the article: Tingley, M.A. (1996). Some Practical Advice for Experimental Design and Analysis. Liason 10(2).
c
2012 Carl James Schwarz
262
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
4.4.1
Designing an Experiment
1. Write down the single most important results that you expect to conclude from this experiment. The statement should be clear and concise. 2. Describe the experimental protocol. It is often difficult to describe, in simple language, how the data will be collected. The following questions may help in describing the experiment: (a) Demonstrate the experimental apparatus. How does it work? Where will the experiment be run? (b) What will the data look like (e.g. continuous between 10 and 50; integers between 0 and 10; categories such as “low”, “medium”, or “high”)? (c) What environmental conditions could affect the measurements? (d) Which factors will be controlled? At what levels? (e) There must be a control group otherwise the results can’t be compared against anything. The control group should receive treatment as identical as possible to the experimental groups to avoid the placebo effect.3 Also be aware of the Hawthorne effect where the simple act of measurement cases a change in the response. (f) Avoid experimenter bias. The experiment, if possible, should be double-blinded. This is to avoid introducing bias from the researcher into the experimenter. For example, the researcher may discourage students who are in the control group from walking outside on cold days. In some cases, the experimental treatment is so different than the control, that it is impossible to blind it. Great care must be taken in these experiments. • in a trial of a post-heart attack drug, the researcher assigned the drug to the more severe cases so that they would have a better chance of survival! (g) Which external influences are not controlled? Can they be blocked? Consider ‘blocking’ or ‘stratifying’ the experiment if you suspect there is an external lurking variable that will influence the results. The distinction between a blocking variables and a second experimental factor is not always clear cut. Some typical blocking or stratification factors are: • location - is the climate the same in different locations in the province? • cage position - experimental animals may respond differently on the edges. • plots of land (h) Which combination of controlled factors will produce the lowest response? Which will produce the highest response? (i) What would be a statistically significant result? (j) What would be a practically significant result? This is often HARD to determine. (k) Make a guess as to the noise level, i.e., variation in the data. (l) What are the sampling costs or the cost to take one measurement? (m) Estimate the sample size requirements to detect important differences and compare the required sample size to the sample size afforded by your budget. 3 Recall, that a placebo effect occurs when the actual process involved in the experiment causes the results regardless of the treatment. For example, if students know they are in a study and are receiving Vitamin C, they may change their eating habits which could affect the results regardless if they are receiving the treatment or not.
c
2012 Carl James Schwarz
263
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION (n) Describe the randomization. The experimental units should be randomly assigned to the treatments. Lurking variables are always present and in many cases, their effects are unknown. By randomly assigning experimental units to treatments, both the experimental and control groups should have roughly the same ‘distribution’ of lurking variables, and their effects should be roughly equal in both groups. Hence in any comparison between groups, the effect of the lurking variables should cancel out. Note that randomization doesn’t eliminate the effect of lurking variables on the individual subjects - it only tries to ensure that the total effects on both groups is about the same. Some typical problems caused by failure to randomize: • cage position having an effect on rats behavior. Cages on the edge get more light. • fields differ in fertility and so variety trials may be confounded with field fertility differences.
4.4.2
Analyzing the data
1. Look at the data. Use plots and basic descriptive statistics to check that the data “look” sensible. Are there any outliers? Decide what to do with outliers. 2. Can you see the main result directly from the data? 3. Draw a picture of how the data were actually collected – it may differ from the plan that was proposed. 4. Think about transformations. In some cases, the correct form of a variable is not obvious, e.g. should fuel consumption be specified in km/L or L/100 km? 5. Try a preliminary analysis. The analysis MUST match the design, i.e. an RCB analysis must be used to analyze data that were collected using a blocked design. This is the most crucial step of the analysis! 6. Plot residuals from the fit. Plot the residuals against fitted values, all predictors, time, or any other measurement that you can get your hand on (e.g. “change in lab technicians”). Check for outliers. 7. Which factors appear to be unimportant? Which appear to be most important? 8. Fit a simple sub-model that describes the basic characteristics of the data. 9. Check residuals again. 10. Check to see that the final model is sensible, e.g. if interactions are present, all main effects contained in the interaction must also be present. 11. Multiple comparisons?
4.4.3
Writing the Report
1. State clearly the major findings of the analysis. 2. Construct a suitable graph to display the results. c
2012 Carl James Schwarz
264
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION 3. Never report “naked” p-values. Remember, statistical significance is not the same as biologically important. Conversely, failure to detect a difference does not mean no difference. 4. Estimate effect sizes. Never report “naked” estimates – be sure to report a measure of precision (a standard error or a confidence interval). 5. Put the results in a biological context. A statistically significant result may not be biologically important. 6. Put relevant technical stuff into an appendix. Then halve the size of the appendix.
4.5 4.5.1
A Road Map to What is Ahead Introduction
Experiments are usually done to compare the average response or the proportion of responses among two or more groups. For example, you may wish to compare the mean weight gain of two different feeds given to livestock in order to decide which is better. In these cases, there is no particular value of interest for each feed-type; rather the difference in performance is of interest. Other example of experimental situations: • has the level of popular support for a particular political party changed since the last poll. • is a new drug more effective in curing a disease than the standard drug. • is a new diet more effective in weight loss than a standard diet? • is a new method of exercise more beneficial than a standard method?
4.5.2
Experimental Protocols
There are literally hundreds of different experimental designs that can be used in research. One of the jobs of a statistician is to be able to recognize the wide variety of designs and to understand how to apply very general methods to help analyze these experiments. We shall start with a particular type of experiment - namely a single factor experiment with two or more levels. For example,: • we may be testing types of drugs (the factor) at three levels (a placebo, a standard drug, and a new drug).
c
2012 Carl James Schwarz
265
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION • we may be examining the level of support over time (the factor) at three particular points in a campaign. • we may be interested in comparing brand of batteries (the factor) and looking at two particular brands (the levels) - Duracell or Eveready.
The same methods can also be used when surveys are taken from two different populations. For example, is the level of support for a political party the same for males and females. Clearly, gender cannot be randomized (not yet) to individual people, so you must take a survey of males and females and then make a decision about the levels of support for each gender. Here gender would be the factor with 2 levels (male and female). When the experiment is performed, you must pay attention to the RRR of statistical design (randomization, replication, and blocking). Randomization ensures that the influence of other, uncontrollable factors will be roughly equal in all treatment groups. Adequate replication ensures that the results will be precise (confidence intervals narrow) or hypothesis tests are powerful. Blocking is a method of trying to control the suspected effects of a particular variable in the experiment. Blocking is a very powerful tool. For example: • you know that the resting heart rate varies considerably among people. Consequently, you may decide to measure a person before and after an exercise to see the change in heart rate, rather than measuring one person before and a second person after the exercise. By taking both measurements on the same person, the difference in heart rate should be less variable than the difference between two different people. • you know that driving habit vary considerably among drivers. Consequently, you may decide to compare durability of two brands by mounting both brands on the same car and doing a direct comparison under the same driving conditions, rather than using different cars with different drivers and (presumable) different driving conditions for each brand. It is important that you recognize when such pairing or blocking takes place. In general, you can recognize it when repeated measurements are taken on the same experimental unit (a person, a car) rather than measurements on different experimental units. There is a restriction in the randomization because you have to ensure that every treatment occurs on every unit (both brands of tire; both time points on a person) If blocking has not taken place, we say that the experiment was completely randomized or that it is a case of independent samples. The latter term is used, because no unit is measured twice, and consequently, measurements are independent of each other. For example: • separate surveys of males and females • people randomized into one and only one of placebo or vitamin C.
c
2012 Carl James Schwarz
266
December 21, 2012
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION If blocking has taken place, we say that the experiment was blocked and in the case of two levels (e.g. before/after measurements on a person; two brands of tire on each car) we use the term paired experiment. For example: • measuring every person using two different walking styles • measuring every person before and after a drug is administered. • panel surveys where the same set of people is surveyed more than once. Johnson (2002) in Johnson, D. H. (2002). The important of replication in wildlife studies. Journal of Wildlife Management 66, 919-932. has a very nice discussion about the role of replication in wildlife studies, particularly the role of metareplication where entire studies are repeated. It is recommended that this paper be added to your pile to be read!
4.5.3
Some Common Designs
The most common analyses gather information on population means, proportions, and slopes. The analyzes for each of these types of parameters can be summarized as follows: (not all will be covered in this course):
c
2012 Carl James Schwarz
267
December 21, 2012
268
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
c
2012 Carl James Schwarz
Table 4.1: Some common experimental designs and analyses
December 21, 2012
269
Number of levels
Name of Analysis
Example
Mean
Completely randomized design or independent surveys
Two
Two independent sample t-test assuming equal variances
20 patients are randomized to placebo or new drug. Time until death is measured.
Mean
Completely randomized design or independent surveys
Two or more
‘One-way’ Analysis of Variance; ‘Single factor CRD - ANOVA’
30 patients are randomized to placebo or 2 other drugs. Time until death is measured.
Mean
Blocked Design or panel surveys
Two
Paired t-test
Two brands of tires are randomly assigned to either the left or right of each car. Every car has both tires.
Mean
Blocked Design or panel surveys
Two or more
‘Blocked’ Analysis of Variance; ‘Single factor RCB - ANOVA’
Four brands of tires are randomly assigned to the 4 tire positions of each car. Every car has all four brands.
Mean
Two factors in a completely randomized design
Two or more for each factor
‘Two-way ANOVA’; Multifactor - CRD ANOVA
The effects of both UV radiation and water temperature are randomly assigned to tanks to measure the growth of smolt.
Mean
Two factors in a blocked design
Two or more for each factor
‘Two-way RCB’; Multi-factor RCB - ANOVA
The effects of both UV radiation and water temperature are randomly assigned to tanks to measure the growth of smolt. The experiment is performed at two different laboratories
December 21, 2012
Type of Experimental Design or Survey
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
c
2012 Carl James Schwarz
Type of Parameter
Number of levels
Name of Analysis
Example
Mean
Two factors in a split-plot design
Two or more for each factor
Split-plot design
The effects of both UV radiation and water temperature are assigned to tanks to measure the growth of smolt. Each tank has a UV bulb suspended above it, but several tanks in series are connected to water of the same temperature.
Proportion
Completely randomized design or independent surveys
Two or more
Chi-square test for independence
Political preference for three parties are measured at two points in time in two separate polls. Different people are used in both polls.
Proportion
Blocked design or panel surveys
Two or more
VERY COMPLEX ANALYSIS - not covered in this class -
Political preference for three parties are measured at two points in time by asking the same people in both polls (a panel study)
Slopes
Completely randomized design or independent surveys
Two or more
Analysis of Covariance (not covered in this class)
Plots of land are randomized to two brand of fertilizer. The dose is varied and interest lies in comparing the increase in yield per change in fertilizer for both brands.
Slopes
Blocked design or panel surveys
Two or more
VERY COMPLEX ANALYSIS!
Two drugs are to be compared to see how the heart rate reduction varies by dose. Each person is given multiple doses of the same drug, one week apart, and the heart rate determined
270
Type of Experimental Design or Survey
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
c
2012 Carl James Schwarz
Type of Parameter
December 21, 2012
Type of Experimental Design or Survey
Number of levels
Name of Analysis
Example
Notice that the two-sample t-test and the paired t-test are special cases of more general methods called the ‘Analysis of Variance’ or ANOVA for short, to be explored in the following chapter.
271
CHAPTER 4. DESIGNED EXPERIMENTS - TERMINOLOGY AND INTRODUCTION
c
2012 Carl James Schwarz
Type of Parameter
December 21, 2012
Chapter 5
Single Factor - Completely Randomized Designs (a.k.a. One-way design) Contents 5.1 5.2
5.3
5.4 5.5 5.6 5.7 5.8
5.9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Using a random number table . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Using a computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assumptions - the overlooked aspect of experimental design . . . . . . . . . . 5.3.1 Does the analysis match the design? . . . . . . . . . . . . . . . . . . . . 5.3.2 No outliers should be present . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Equal treatment group population standard deviations? . . . . . . . . . . . 5.3.4 Are the errors normally distributed? . . . . . . . . . . . . . . . . . . . . . 5.3.5 Are the errors are independent? . . . . . . . . . . . . . . . . . . . . . . . Two-sample t-test- Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . Example - comparing mean heights of children - two-sample t-test . . . . . . . Example - Fat content and mean tumor weights - two-sample t-test . . . . . . Example - Growth hormone and mean final weight of cattle - two-sample t-test Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Basic ideas of power analysis . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Prospective Sample Size determination . . . . . . . . . . . . . . . . . . . 5.8.3 Example of power analysis/sample size determination . . . . . . . . . . . 5.8.4 Further Readings on Power analysis . . . . . . . . . . . . . . . . . . . . 5.8.5 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . 5.8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ANOVA approach - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5.9.1 An intuitive explanation for the ANOVA method . . . . . . . . . . . . . .
272
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
273 274 275 276 285 286 286 287 288 289 289 290 297 303 310 311 312 313 319 320 321 322 323
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
5.10 5.11 5.12 5.13
5.14
5.15 5.16
5.17 5.18 5.19
5.1
5.9.2 A modeling approach to ANOVA . . . . . . . . . . . . . . . . . . Example - Comparing phosphorus content - single-factor CRD ANOVA Example - Comparing battery lifetimes - single-factor CRD ANOVA . . Example - Cuckoo eggs - single-factor CRD ANOVA . . . . . . . . . . Multiple comparisons following ANOVA . . . . . . . . . . . . . . . . . 5.13.1 Why is there a problem? . . . . . . . . . . . . . . . . . . . . . . 5.13.2 A simulation with no adjustment for multiple comparisons . . . . . 5.13.3 Comparisonwise- and Experimentwise Errors . . . . . . . . . . . 5.13.4 The Tukey-Adjusted t-Tests . . . . . . . . . . . . . . . . . . . . . 5.13.5 Recommendations for Multiple Comparisons . . . . . . . . . . . . 5.13.6 Displaying the results of multiple comparisons . . . . . . . . . . . Prospective Power and sample sizen - single-factor CRD ANOVA . . . 5.14.1 Using Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.14.2 Using SAS to determine power . . . . . . . . . . . . . . . . . . . 5.14.3 Retrospective Power Analysis . . . . . . . . . . . . . . . . . . . . Pseudo-replication and sub-sampling . . . . . . . . . . . . . . . . . . . Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . 5.16.1 What does the F -statistic mean? . . . . . . . . . . . . . . . . . . 5.16.2 What is a test statistic - how is it used? . . . . . . . . . . . . . . . 5.16.3 What is MSE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16.4 Power - various questions . . . . . . . . . . . . . . . . . . . . . . 5.16.5 How to compare treatments to a single control? . . . . . . . . . . . 5.16.6 Experimental unit vs. observational unit . . . . . . . . . . . . . . 5.16.7 Effects of analysis not matching design . . . . . . . . . . . . . . . Table: Sample size determination for a two sample t-test . . . . . . . . Table: Sample size determination for a single factor, fixed effects, CRD Scientific papers illustrating the methods of this chapter . . . . . . . . 5.19.1 Injury scores when trapping coyote with different trap designs . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
328 331 343 353 366 366 367 369 370 372 373 375 376 377 378 379 381 381 381 381 382 384 384 385 388 390 393 393
Introduction
This is the most basic experimental design and the default design that most computer packages assume that you have conducted. A single factor is varied over two or more levels. Levels of the factor (treatments) are completely randomly assigned to experimental units, and a response variable is measured from each unit. Interest lies in determining if the mean response differs among the treatment levels in the respective populations. Despite its ‘simplicity’, this design can be used to illustrate a number of issues common to all experiments. It also is a key component of more complex designs (such as the split-plot design). NOT ALL EXPERIMENTS ARE OF THIS TYPE! Virtually all computer packages (even Excel) can correctly analyze experiments of this type. Unfortunately, not all experiments are single-factor CRD. In my c
2012 Carl James Schwarz
273
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) experience in reviewing reports it often occurs that a single-factor CRD analysis is applied to designs when it is inappropriate. Therefore just don’t blindly use a computer package to experimental data before verifying that you understand the design. Be sure to draw a picture of the design as illustrated in previous chapters. An experiment MUST have the following attributes before it should be analyzed using single-factor CRD methods: 1. Single factor with two or more levels. Is there a single experimental factor that is being manipulated over the experimental units? [In more advanced courses, this can be relaxed somewhat.] 2. Experimental unit=observational unit. Failure to have the observational and experiment unit being the same, is the most common error made in the design and analysis of experiments - refer to the publication by Hurlbert (1984). Look at the physical unit being measured in the experiment - could each individual observational unit be individually randomized to a treatment level? Common experiments that fail this test are sub-sampling designs (e.g. fish in a tank). 3. Complete randomization. Could each experimental unit be randomized without restriction to any of the treatment levels? In many designs there are restrictions on randomization (e.g. blocking) that restrict the complete randomization of experimental units to treatments. As noted in the chapter on Survey Sampling, analytical surveys can be conducted with similar aims. In this case, there must be separate and complete randomization in the selection of experimental units with equal probability of selection for each unit before using these methods. These designs are commonly analyzed using two seemingly different, but equivalent procedures: • Two-sample t-test used with exactly two levels in the factor • Single factor CRD ANOVA used with two or more levels in the factor. The two-sample t-test is a special case of the more general single-factor CRD ANOVA and the two analyses give identical results for this case. Because of the large numbers of experiments that fall into this design, both types of analyses will be explored.
5.2
Randomization
There are two “types” of experiments or surveys that are treated as Completely Randomized Designs (CRD). First in a true experiment, there is a complete randomization of treatments to the experimental units. Second, in some cases, assignment of treatments to experimental units is not feasible, e.g. it is currently impossible to randomly assign sex to an animal! These latter “experiments” are more properly called Analytical Surveys, and units need to be chosen at random from the populations forming each treatment group. c
2012 Carl James Schwarz
274
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The basic method of randomizing the assignment of treatments to experimental units is analogous to placing slips of paper with the treatment levels into one hat, placing slips of paper with a list of the experimental units into another hat, mixing the slips of paper in both hats, and then sequentially drawing slips from each hat to match the treatment with the experimental unit. In the case where treatments cannot be randomly assigned to experimental units (e.g. you can’t randomly assign sex to rats), you must selecte experiment units from the relevant population using a simple random sample. For example, if the factor is Sex with levels males and females, you must select the experimental units (people) from all the males or females from the population of males and females using a simple random sample. seen in the chapter on Survey Sampling. In practice, randomization is done using either a random number table or generating random numbers on the computer.
5.2.1
Using a random number table
Many text book contain random number tables. Some tables are also available on the web e.g. http:// ts.nist.gov/WeightsAndMeasures/upload/AppenB-HB133-05-Z.pdf. Most tables are arranged in a similar fashion. They contain a list of random one digit numbers (from 0 to 9 inclusive) that have be arbitrarily grouped into sets of 5 digits and arbitrarily divided into groups of 10 rows. [The row number is NOT part of the table.] Each entry in the table is equally likely to be one of the values from 0 to 9; each pair of digits in the table is equally likely to be one of the values from 00 to 99; each triplet in the table is equally likely to be one of the values 000 to 999; and so on.
Assigning treatments to experimental units Suppose that you wanted to randomly assign 50 experimental units to two treatments. Consequently, each treatment must be assigned to 25 experimental units.
1. Label the experimental units from 1 to 50. 2. Enter the table at an arbitrary row and position in the row, and pick off successive two digit groups. Each two digit group will select one of the experimental units. [Ignore 00, and 51-99.]. For example, suppose that you enter the table at row 48. The random digits from the table are: 48
46499 94631 17985 09369 19009 51848 58794 48921 22845 55264
and so the two digits groups from this line are: 46 49 99 46 31 17 98 50 93 69 19 00 95 18 48 c
2012 Carl James Schwarz
275
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The first 25 distinct two-digit groups that are between 01 and 50 (inclusive) are used to select the units for the first treatment. From the above table, experimental units 46, 49, 31, 17, 50, 19, 18, 48 . . . belong to the treatment 1 group. Note that the value 46 occurred twice - it is only used once. 3. The remainder of the experimental units belong to the second treatment.
This can be extended to many treatment groups, by choosing first those experimental units that belong to the first group, then the experimental units that belong to the second group, etc. An experimental unit cannot be assigned to more than one treatment group, so it belongs to the first group it is assigned to.
Selecting from the population Suppose you wish to select 10 units from each of two population of animals representing males and females. There are 50 animals of each sex. The following is repeated twice, once for males and once for females. 1. Label the units in the population from 1 to 50. 2. Enter the table at an arbitrary row and position in the row, and pick off successive two digit groups. Each two digit group will select one of the units from the population. Continue until 10 are selected. For example, suppose that you enter the table at row 48. The random digits from the table are: 48
46499 94631 17985 09369 19009 51848 58794 48921 22845 55264
and so the two digits groups from this line are: 46 49 99 46 31 17 98 50 93 69 19 00 95 18 48 The first 10 distinct two-digit groups that are between 01 and 50 (inclusive) are used to select the units for the first treatment. From the above table, experimental units 46, 49, 31, 17, 50, 19, 18, 48 . . . are selected from the first treatment population. Note that the value 46 occurred twice - it is only used once. This can be extended to many treatment groups, by choosing first those experimental units that belong to the first group, then the experimental units that belong to the second group, etc.
5.2.2
Using a computer
A computer can be used to speed the process.
c
2012 Carl James Schwarz
276
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Randomly assign treatments to experimental units JMP has a Design of Experiments module that is helpful in doing the randomization of treatments to experimental units. Start by selecting the Full Factorial Design under the DOE menu item:
For a single factor CRD, click on the Categorical button in the Factors part of the dialogue box and specify the number of levels in the factor. For example for a factor with 3 levels, you would specify:
c
2012 Carl James Schwarz
277
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
If you have more than one factor, you would add the new factors in the same way. For each factor, click and change the name of the factor (default name for the first factor is X1) and the names of the levels (default labels for the levels of the first factors are L1, L2, etc.). Suppose the factor of interest is the dose of the drug and three levels are Control (0 dose), 1 mg/kg, and 2 mg/kg. The final dialogue box would look like:
c
2012 Carl James Schwarz
278
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Press the Continue button when all factors and their levels have been specified. It is not necessary that all factors have the same number of levels (e.g. one factor could have three levels and a second factor could have 2 levels).
c
2012 Carl James Schwarz
279
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Finally, specify the total number of experimental units available by changing the Number of replicates box. JMP labels the second set of experimental units as the first replicate, etc. For example, if you have 12 experimental units to be assigned to the 3 levels of the factor, there are 4 experimental units to be assigned to each level which JMP treats as 3 replicates. Then press the Make Table button to generate an experimental randomization:
c
2012 Carl James Schwarz
280
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Assign a dose of 2 mg/kg to the first animal, treat the second animal as a control, assign a dose of 2 mg/kg to the 3rd animal etc. You may wish to change the name of the response to something more meaningful than simply Y . Once the experiment is done, enter the data in the Y column. The above randomization assumes that you want equal numbers of experimental units for each level. This is NOT a requirement for a valid design – there are cases where you may wish more experimental units for certain levels of the experiment (e.g. it may be important to have smaller standard errors for the 2 mg/kg dose than the 1 mg/kg dose). Such designs can also be developed on the computer – ask me for details.
Randomly selecting from populations As noted earlier, it is sometimes impossible to randomize treatments to experimental units (e.g. it is not feasible to randomly assign sex to animals). In these Analytical Surveys, the key assumption is that the experimental units used in the experiment are a random sample from the relevant population. While the random selection of units from a population is a method commonly used in survey sampling and there are plenty of methods to ensure this selection is done properly in a survey sample context, the experimental design context is fraught with many difficulties that make much of the randomization moot. For example, in experiments with animals and sex as a factor, the experimenter has no way to ensure that the animals available are a random sample from all animals of each sex. Typically, for small rodents such a mice, the animals are ordered from a supply company, housed in an animal care facility not under the direct c
2012 Carl James Schwarz
281
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) control of the experimenter, and are supplied on an as-needed basis. All than can be done is typically hope for the best. What is the actual population of interest? All animals of that sex? All animals of that sex born in a particular year? All animals of that sex born in that year for that particular supply company? Suppose you have an experiment that will examine difference in growth rates between the two sexes of animals. You clearly cannot assign sex to individual animals. But you have a supply of 20 animals of each sex that are numbered from 1 to 20 and you need to select 10 animals of each sex. You would repeat the following procedure twice Here is the list of male animals which are housed in separate cages:
Use the Table → Subset option:
c
2012 Carl James Schwarz
282
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
and specify the number of animals to be selected:
c
2012 Carl James Schwarz
283
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
This will generate a random sample of size 10 from the original table.
c
2012 Carl James Schwarz
284
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Repeat the above procedure for the female animals.
5.3
Assumptions - the overlooked aspect of experimental design
Each and every statistical procedure makes a number assumptions about the data that should be verified as the analysis proceeds. Some of these assumptions can be examined using the data at hand. Other assumptions, often the most important ones, can only be assessed using the meta-data about the experiment.
c
2012 Carl James Schwarz
285
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The set of assumptions for the single factor CRD are also applicable for the most part to most other ANOVA situations. In subsequent chapters, these will be revisited and those assumptions that are specific to a particular design will be highlighted. The assumptions for single factor CRD are as follows. The reader should refer to the examples in each chapter for details on how to assess each assumpation in actual practice using your statistical package.‘
5.3.1
Does the analysis match the design?
THIS IS THE MOST CRUCIAL ASSUMPTION! In this case, the default assumption of most computer packages is that the data were collected under a single factor Completely Randomized Design (CRD). It is not possible to check this assumption by simply looking at the data and you must spend some time examining exactly how the treatments were randomized to experimental units, and if the observational unit is the same as the experimental unit (i.e. the meta-data about the experiment). This comes down to the RRR’s of statistics - how were the experimental units randomized, what are the numbers of experimental units, and are there groupings of experimental units (blocks)? Typical problems are lack of randomization and pseudo-replication. Was randomization complete? If you are dealing with analytical survey, then verify that the samples are true random samples (not merely haphazard samples). If you are dealing with a true experiments, ensure that there was a complete randomization of treatments to experimental units. What is the true sample size? Are the experimental units the same as the observational units? In pseudo-replication (to be covered later), the experimental and observational units are different. An example of pseudo-replication are experiments with fish in tanks where the tank is the experimental unit (e.g. chemicals added to the tank) but the fish are the observational units. No blocking present? This is similar to the question about complete randomization. The experimental units should NOT be grouped into more homogeneous units with restricted randomizations within each group. The distinction between CRD and blocked designs will be come more apparent in later chapters. The simplest case of a blocked design which is NOT a CRD is a paired design where each experimental object gets both treatments (e.g. both doses of a drug in random order).
5.3.2
No outliers should be present
As you will see later in the chapter, the idea behind the tests for equality of means is, ironically, to compare the relative variation among means to the variation with each group. Outliers can severely distort estimates of the within-group variation and severely distort the results of the statistical test.
c
2012 Carl James Schwarz
286
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Construct side-by-side scatterplots of the individual observations for each group. Check for any outliers – are there observations that appear to be isolated from the majority of observations in the group? Try to find a cause for any outliers. If the cause is easily corrected, and not directly related to the treatment effects (like a data recording error) then alter the value. Otherwise, include a discussion of the outliers and their potential significance to the interpretation of the results in your report. One direct way to assess the potential impact of an outlier is to do the analysis with and without the outlier. If there is no substantive difference in the results - be happy! A demonstration of the effect of outliers in a completely randomized design is available in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
5.3.3
Equal treatment group population standard deviations?
Every procedure for comparing means that is a variation of ANOVA, assumes that all treatment groups have the same population standard deviation.1 This can be informally assessed by computing the sample standard deviation for each treatment group to see if they are approximately equal. Because the sample standard deviation is quite variables over repeated samples from the same population, exact equality of the sample standard deviation is not expected. In fact, unless the ratio of the sample standard deviations is extreme (e.g. more than a 5:1 ratio between the smallest and largest value), the assumption of equal population standard deviations is likely satisfied. More formal tests of the equality of population standard deviations can be constructed (e.g. Levene’s test is recommended), but these are not covered in this course. Often you can anticipate an increase in the amount of chance variation with an increase in the mean. For example, traps with an ineffective bait will typically catch very few insects. The numbers caught may typically range from 0 to under 10. By contrast, a highly effective bait will tend to pull in more insects, but also with a greater range. Both the mean and the standard deviation will tend to be larger. If you have equal or approximately equal numbers of replicates in each group, and you have not too many groups, heteroescadicity (unequal population standard deviations) will not cause serious problems with an Analysis of Variance. However, heteroscedasticity does cause problems for multiple comparisons (covered later in this section). By pooling the data from all groups to estimate a common σ, you can introduce serious bias into the denominator of the t-statistics that compares the means for those groups with larger standard deviations. In fact, you will underestimate the true standard errors of these means, and could easily misinterpret a large chance error for a real, systematic difference. I recommend that you start by constructing the side-by-side dot plots comparing the observations for each group. Does the scatter seem similar for all groups? Then compute the sample standard deviations of each group. Is there a wide range in the standard deviations? [I would be concerned if the ratio of the largest to the smallest standard deviation is 5x or more.] Plot the standard deviation of each treatment group against the mean of each treatment group. Does there appear to be relationship between the standard deviation and 1 The sample standard deviations are estimates of the population standard deviations, but you don’t really care about the sample standard deviations and testing if the sample standard deviations are equal is nonsensical.
c
2012 Carl James Schwarz
287
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) the mean?2 Sometimes, transformations can be used to alleviate some of the problems. For example, if the response variable are counts, often a log or sqrt transform makes the standard deviations approximately equal in all groups. If all else fails, procedures are available that relax this assumption (e.g. the two-sample t-test using the Satterthwaite approximation or bootstrap methods).3 CAUTION: despite their name, non-parametric methods often make similar assumptions about equal variation in the populations. It is a common fallacy that non-parametric methods have NO assumptions - they just have different assumptions.4 Special case for CRD with two groups. It turns out that it is not necessary to make the assumption of equal group standard deviations in the special case of a single factor CRD design with exactly 2 levels. In this special case, a variant of the standard t-test can be used which is robust against inequality of standard deviations. This will be explored in the examples.
5.3.4
Are the errors normally distributed?
The procedures in this and later chapters for test hypotheses about equality of means in their respective populations assume that observations WITHIN each treatment group are normally distributed. It is a common misconception that the assumption of normality applies to the pooled set of observations – the assumption of normality applies WITHIN each treatment group. However, because ANOVA estimates treatment effects using sample averages, the assumption of normality is less important when sample sizes within each treatment group are reasonably large. Conversely, when sample sizes are very small in each treatment group, any formal tests for normality will have low power to detect non-normality. Consequently, this assumption is most crucial in cases when you can do least to detect it! I recommend that you construct side-by-side dot-plots or boxplots of the individual observation for each group. Does the distribution about the mean seem skewed? Find the residuals after the model is fit and examine normal probability plots. Sometimes problems can be alleviated by transformations. If the sample sizes are large, non-normality really isn’t a problem. Again if all else fails, a bootstrap procedure or a non-parametric method (but see the cautions above) can be used. 2 Taylor’s
Power Law is an empirical rule that relates the standard deviation and the mean. By fitting Taylor’s Power Law to these plots, the appropriate transform can often be determined. This is beyond the scope of this course. 3 These will not be covered in this course 4 For example, the rank based methods where the data are ranked and the ranks used an analysis still assume that the populations have equal standard deviations.
c
2012 Carl James Schwarz
288
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
5.3.5
Are the errors are independent?
Another key assumption is that experimental units are independent of each other. For example, the response of one experimental animal does not affect the response of another experimental animal. This is often violated by not paying attention to the details of the experimental protocol. For example, technicians get tired over time and give less reliable readings. Or the temperature in the lab increases during the day because of sunlight pouring in from a nearby window and this affects the response of the experimental units. Or multiple animals are housed in the same pen, and the dominant animals affect the responses of the sub-dominant animals. If the chance errors (residual variations) are not independent, then the reported standard errors of the estimated treatment effects will be incorrect and the results of the analysis will be INVALID! In particular, if different observations from the same group are positively correlated (as would be the case if the “replicates” were all collected from a single location, and you wanted to extend your inference to other locations), then you could seriously underestimate the standard error of your estimates, and generate artificially significant p-values. This sin is an example of a type of spatial-pseudo-replication (Hurlbert, 1984). I recommend that you plot the residuals in the order the experiment was performed. The residual plot should show a random scatter about 0. A non-random looking pattern in the residual plot should be investigated. If your experiment has large non-independence among the experimental units, seek help.
5.4
Two-sample t-test- Introduction
This is the famous “Two-sample t-test” first developed in the early 1900’s. It is likely the most widely used methodology in research studies, followed by (perhaps) the slightly more general single factor CRD ANOVA (Analysis of Variance) methodology. The basic approach in hypothesis testing is to: formulate a hypothesis in terms of population parameters, collect some data, and then see if the data are unusual if the hypothesis were true. If this is the case, then there is evidence against the null hypothesis. However, as seen earlier in this course, the issue of the role of hypothesis testing in research studies was discussed. Modern approaches to this problem play down the role of hypothesis testing in favor of estimation (confidence intervals). Many books spend an inordinate amount of time worrying about one- or two-sided hypothesis tests. My personal view on this matter is that one of three outcomes usually occurs: 1. The results are so different in the two groups that using a one- or two- sided test makes no difference. 2. The results are so similar in the two groups, that neither test will detect a statistically significant difference. 3. The results are borderline and I’d worry about other problems in the experiment such as violations of assumptions. c
2012 Carl James Schwarz
289
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Consequently, I don’t worry too much about one- or two-sided hypotheses - I’d much rather see a confidence interval, more attention spend on verifying the assumptions of the design, and good graphical methods displaying the results of the experiment. Consequently, we will always conduct two-sided tests, i.e. our alternate hypothesis will always be looking for differences in the two means in either direction.
5.5
Example - comparing mean heights of children - two-sample t-test
It is well known that adult males and females are, on average, different heights. But is this true at 12 years of age? A sample of 63 children (12 years old) was measured in a school and the height and weight recorded. The data is available in the htwt12.csv file in the Sample Program Library at http://www.stat. sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual way:
data height; length gender $4; infile ’htwt12.csv’ dlm=’,’ dsd missover firstobs=2; input gender $ height weight; run;
Part of the raw data are shown below: Obs
gender
height
weight
1
f
62.3
105.0
2
f
63.3
108.0
3
f
58.3
93.0
4
f
58.8
89.0
5
f
59.5
78.5
6
f
61.3
115.0
7
f
56.3
83.5
8
f
64.3
110.5
9
f
61.3
94.0
10
f
59.8
84.5
The first question to resolved before any analysis is attempted is to verify that this indeed is a singlec
2012 Carl James Schwarz
290
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) factor CRD. The single factor is sex with two levels (males and females). The treatments are the sexes. This is clearly NOT a true experiment as sex cannot be randomized to children - it is an analytical survey. What is the population of interest and are children (the experimental units) randomly selected from each population? The population is presumably all 12 year-olds in this part of the city. Do children that come to this school represent a random selection from all students in the area? The experimental unit and the observational unit are equal (there is only one measurement per child).5 There doesn’t seem to be any grouping of children into more homogeneous sets (blocks as discussed in later chapters). that could account for some of the height differences (again ignoring twins). Hence, it appears that this is indeed a single-factor CRD design. Let: • µm and µf represent the population mean height of males and females, respectively. • nm and nf represent the sample sizes for males and females respectively. • Y m and Y f represent the sample mean height of males and females, respectively. • sm and sf represent the sample standard deviation of heights for males and females respectively. The two-sample test proceeds as follows. 1. Specify the hypotheses. We are not really interested in the particular value of the mean heights for the males or the females. What we are really interested in is comparing the mean heights of the two populations, or, equivalently, in examining the difference in the mean heights between the two populations. At the moment, we don’t really know if males or females are taller and we are interested in detecting differences in either direction. Consequently, the hypotheses of interest are: H: µf = µm or µf − µm = 0 A: µf 6= µm or µf − µm 6= 0 Note again that the hypotheses are in terms of population parameters and we are interested in testing if the difference is 0. [A difference of 0 would imply no difference in the mean heights.] We are interested in both if males are higher (on average) or lower (on average) in height compared to females.6 2. Collect data. Because the samples are independent and no person is measured twice, each person has separate row in the table. 5 We
shall ignore the small number of twin children. In these cases what is the experimental unit? The family or the child? is technically called a two-sided test.
6 This
c
2012 Carl James Schwarz
291
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The data does not need to be sorted in any particular order, but must be entered in this “stacked” format with one column representing the factor and one column representing the data. We start by using Proc SGplot to creates a side-by-side dot-plots and box plots: proc sgplot data=height; title2 ’Plot of height vs. gender’; scatter x=gender y=height; xaxis offsetmin=.05 offsetmax=.05; run; which gives
And then:
proc sgplot data=height; title2 ’Box plots’; vbox height / group=gender notches; /* the notches options creates overlap region to compare if medians are equal */ run;
c
2012 Carl James Schwarz
292
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) which gives
Proc Tabulate is used to construct a table of means and standard deviations: proc tabulate data=height; title2 ’some basic summary statistics’; class gender; var height; table gender, height*(n*f=5.0 mean*f=5.1 std*f=5.1 stderr*f=7.2 run; which gives:
height gender
N
Mean
Std
StdErr
95_LCLM
95_UCLM
30
59.5
3.0
0.55
58.4
60.6
f
c
2012 Carl James Schwarz
293
December 21, 2012
lclm*f=7.1 uclm
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) height m
N
Mean
Std
StdErr
95_LCLM
95_UCLM
33
58.9
3.1
0.54
57.8
60.0
The side-by-side dot plot show roughly comparable scatter for each group and the sample standard deviations of the two groups are roughly equal. The assumption of equal standard deviations in each treatment group appears to be tenable. [Note that in the two-sample CRD experiment, the assumption of equal standard deviations is not required if the unequal variance t-test is used.] As noted earlier, in the special case of a single factor CRD with two levels, this assumption can be relaxed. There are no obvious outliers in either group. The basic summary statistics show that there doesn’t appear to be much of a difference between means of the two groups. You could compute a confidence interval for the mean of each group using each group’s own data (the sample mean and estimated standard error) using methods described earlier. 3. Compute a test-statistic, a p-value and make a decision. Proc Ttest is used to perform the test of the hypothesis that the two means are the same: ods graphics on; proc ttest data=height plot=all dist=normal; title2 ’test of equality of heights between males and females’; class gender; var height; ods output ttests = TtestTest; ods output ConfLimits=TtestCL; ods output Statistics=TtestStat; run; ods graphics off; The output is voluminous, and selected portions are reproduced below: t Value
DF
Pr > |t|
Variable
Method
Variances
height
Pooled
Equal
0.82
61
0.4171
height
Satterthwaite
Unequal
0.82
60.714
0.4165
Variable
gender
Method
Variances
Mean
Lower Limit of Mean
height
Diff (1-2)
Pooled
Equal
0.6252
-0.9049
c
2012 Carl James Schwarz
294
December 21, 2012
Upper Limit of Mean 2.1552
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Variable
gender
Method
Variances
Mean
Lower Limit of Mean
height
Diff (1-2)
Satterthwaite
Unequal
0.6252
-0.9029
Upper Limit of Mean 2.1532
N
Mean
Std Error
Lower Limit of Mean
f
30
59.5100
0.5455
58.3943
60.6257
height
m
33
58.8848
0.5351
57.7950
59.9747
height
Diff (1-2)
_
0.6252
0.7652
-0.9049
2.1552
Variable
gender
height
Upper Limit of Mean
and a final plot of:
This table has LOTS OF GOOD STUFF!. • The Unequal Variance t-test (also known as the Welch test, or the Satterthwaite test) does NOT assume equal standard deviations in both groups. Most statisticians recommend ALWAYS using this procedure rather than the traditional two-sample equal variance (also known as the pooledvariance) t-test, even if the two sample standard deviations are similar which would indicate that the latter procedure would be valid. • The estimated difference in the population mean heights (male average - female average) is estimated by Y m − Y f = −.625 inches with a standard error of the estimated difference in th
c
2012 Carl James Schwarz
295
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) means of 0.764 inches. [You don’t have to worry about the formula for the estimated standard error, but you can get this by dividing the t-statistic by the difference in means.] A 95% confidence interval for the difference in the population mean heights is also shown and ranges from (−2.15 → .91) inches. [An approximate 95% confidence interval can be computed as estimate ± 2se of estimate.] Because the 95% c.i. for the difference in the population mean heights includes zero, there is no evidence that the population means7 are unequal. Depending on the package used, the difference in the means may be computed in the other direction with the corresponding confidence interval reversed appropriately. At this point, the confidence interval really provides all the information you need to make a decision about your hypothesis. The confidence interval for the difference in mean heights includes zero, so there is no evidence of a difference in the population mean heights. Notice that the confidence intervals (see the plot) cover the value of 0 indicating no evidence of a difference. We continue with the formal hypothesis testing: • The test-statistic is T = −.818. This is a measure of how unusual the data is compared to the the (null) hypothesis of no difference in the population means, expressed as a fractions of standard errors. In this case, the observed difference in the means is about 0.8 standard errors away from the value of 0 (representing no difference in the means). [It is not necessary to know how to compute this value.] Test statistics are “hold-overs” from the BC (before computer era) when the test statistic was then compared to a statistical table to see if it was “statistically significant” or not. In modern days, the test-statistic really doesn’t serve any useful purpose and can usually be ignored. Similarly the line labeled as the “DF” (degrees of freedom) is not really needed as well when computers are doing the heavy lifting. • The p-value is 0.416. The p-value is a measure of how consistent the data is with the null hypothesis. It DOES NOT MEASURE the probability that the hypothesis is true!8 The p-value is attached to the data, not to the hypothesis. Because the p-value is large (a rough rule of thumb is to compare the p-value to the value of 0.05), we conclude that there is no evidence against the hypothesis that the average height of males and female children is equal. Of course, we haven’t proven that both genders have the same mean height. All we have shown is that based on our sample of size 63, there is not enough evidence to conclude that the mean heights in the population are different. Maybe our experiment was too small? Most good statistical packages have extensive facilities to help you plan future experiments. This will be discussed in the section on Statistical Power later in this chapter. SAS also provided information for the equal-variance two-sample t-test in the above output. Because the two sample standard deviations are so similar, the results are virtually identical between the two variants of the t-test. 7 Why
do we say the population means? Why is the sentence in terms of sample means? must be true or false, they cannot have a probability of being true. For example, suppose you ask a child if he/she took a cookie. It makes no sense to say that there is a 47% chance the child took the cookie – either the child took the cookie or the child didn’t take the cookie. 8 Hypotheses
c
2012 Carl James Schwarz
296
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Modern Statistical Practice recommends that you ALWAYS use the unequal variance t-test (the first test) as it always works properly regardless of the standard deviations being approximately equal or not. The latter “equal-variance” t-test is of historical interest, but is a special case of the more general Analysis of Variance methods which will be discussed later in this chapter. The formula to compute the test statistic and df are available in many textbooks and on the web e.g. http://en.wikipedia.org/wiki/Student’s_t-test and not repeated here and they provide little insight into the logic of the process.. Similarly, many text books show how to look up the test statistic in a table to find the p-value but this is pointless now that most computers can compute the p-value directly.
5.6
Example - Fat content and mean tumor weights - two-sample ttest
Recent epidemiological studies have shown that people who consume high fat diets have higher cancer rates and more severe cancers than low fat diets. Rats were randomized to one of two diets, one low in fat and the other high in fat. [Why and how was randomization done?] At the end of the study, the rats were sacrificed, the tumors excised, and the weight of the tumors found. Here are the raw data: Low fat
High fat
12.2
12.3
9.7
10.2
9.2
11.8
8.2
11.7
11.2
11.1
9.5
14.6
8.4
11.9
9.3
9.8
11.1
11.3
10.8
10.3
The data is available in the fattumor.csv file in the Sample Program Library at http://www.stat. sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual way:
c
2012 Carl James Schwarz
297
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) data weight; length diet $10.; infile ’fattumor.csv’ dlm=’,’ dsd missover firstobs=2; input diet $ weight; run;
Part of the raw data are shown below: Obs
diet
weight
1
High Fat
12.3
2
High Fat
10.2
3
High Fat
11.8
4
High Fat
11.7
5
High Fat
11.1
6
High Fat
14.6
7
High Fat
11.9
8
High Fat
9.8
9
High Fat
11.3
10
High Fat
10.3
First verify that a single-factor CRD analysis is appropriate. What is the factor? How many levels? What are the treatments? How were treatments assigned to experimental units? Is the experimental unit the same as the observational unit? Let • µL and µH represent the true mean weight of tumors in all rats under the two diets • Y L and Y H , etc. represent the sample statistics 1. Formulate the hypotheses. H: µH = µL or µH − µL = 0 A: µH 6= µL or µH − µL 6= 0 Note that we have formulated the alternate hypothesis as a two-sided hypothesis (we are interested in detecting differences in either direction). It is possible to formulate the hypothesis so that changes in a single direction only (i.e. does higher fat lead to larger (on average) differences in tumor weight), but this is not done in this course. c
2012 Carl James Schwarz
298
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) 2. Collect data and look at summary statistics. The data should be entered into most packages as a case-by-variable structure, i.e. each row should contain data for a SINGLE experimental unit and each column represents different variables. Most packages will require one column to be the experimental factor and a second column to be the response variable. The order of the data rows is not important, nor do the data for one group have to entered before the data for the second group. It is important that the factor variable have the factor or character attribute. The response variable should be a continuous scaled variable. We start by using Proc SGplot to creates a side-by-side dot-plots and box plots: proc sgplot data=weight; title2 ’Plot of weight vs. diet’; scatter x=diet y=weight; xaxis offsetmin=.05 offsetmax=.05; run; which gives
c
2012 Carl James Schwarz
299
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) And then:
proc sgplot data=weight; title2 ’Box plots’; vbox weight / group=diet notches; /* the notches options creates overlap region to compare if medians are equal */ run; which gives
Proc Tabulate is used to construct a table of means and standard deviations: proc tabulate data=weight; title2 ’some basic summary statistics’; class diet; var weight; table diet, weight*(n*f=5.0 mean*f=5.1 std*f=5.1 stderr*f=7.2 run;
c
2012 Carl James Schwarz
300
December 21, 2012
lclm*f=7.1 uclm*f
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) which gives:
weight diet
N
Mean
Std
StdErr
95_LCLM
95_UCLM
10
11.5
1.4
0.43
10.5
12.5
10
10.0
1.3
0.41
9.0
10.9
High Fat Low Fat
From the dot plot, we see that there are no obvious outliers in either group. We notice that the sample standard deviations are about equal in both groups so the assumption of equal population standard deviations is likely tenable. There is some, but not a whole lot of overlap between the confidence intervals for the individual group means which would indicate some evidence that the population means may differ. 3. Find the test-statistic,the p-value and make a decision. Proc Ttest is used to perform the test of the hypothesis that the two means are the same: ods graphics on; proc ttest data=weight plot=all dist=normal; title2 ’test of equality of weights between the two diets’; class diet; var weight; ods output ttests = TtestTest; ods output ConfLimits=TtestCL; ods output Statistics=TtestStat; run; ods graphics off; The output is voluminous, and selected portions are reproduced below: t Value
DF
Pr > |t|
Variable
Method
Variances
weight
Pooled
Equal
2.58
18
0.0190
weight
Satterthwaite
Unequal
2.58
17.967
0.0190
Variable
diet
Method
Variances
Mean
Lower Limit of Mean
weight
Diff (1-2)
Pooled
Equal
1.5400
0.2844
c
2012 Carl James Schwarz
301
December 21, 2012
Upper Limit of Mean 2.7956
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Variable
diet
Method
Variances
Mean
Lower Limit of Mean
weight
Diff (1-2)
Satterthwaite
Unequal
1.5400
0.2843
Upper Limit of Mean 2.7957
N
Mean
Std Error
Lower Limit of Mean
High Fat
10
11.5000
0.4315
10.5238
12.4762
weight
Low Fat
10
9.9600
0.4134
9.0247
10.8953
weight
Diff (1-2)
_
1.5400
0.5976
0.2844
2.7956
Variable
diet
weight
Upper Limit of Mean
and a final plot of:
As noted in the previous example, there are two variants of the t-test, the equal and unequal variance ttest. Modern statistical practice is to the unequal-variance t-test (the one selected above) as it performs well under all circumstances without worrying if the standard deviations are equal among groups. The estimated difference (low − high) in the true (unknown) mean weights is −1.54 g (se .60 g), with a 95% confidence interval that doesn’t cover 0. [Depending on your package, the signs may be reversed in the estimates and the confidence intervals will also be reversed.] There is evidence then that the low fat diet has a lower mean tumor weight than the high fat diet. The confidence interval provides sufficient information to answer the research question, but a formal hypothesis test can also be conducted. Notice that the confidence intervals (see the plot) cover the value of 0 indicating no evidence of a difference. c
2012 Carl James Schwarz
302
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The formal test-statistic has the value of −2.577. In order to compute the p-value, the test-statistic will be compared to a t-distribution with 18 df . [It is not necessary to know how to compute this statistic, nor the df .] The two-sided p-value is 0.0190.If the alternate hypothesis was one-sided, i.e. if we were interested only if the high fat increased tumor weights (on average) over low fat diets, the the sided=L or sided=U option on the Proc Ttest statement could be used. Because the p-value is small, we conclude that there is evidence against the hypothesis that the mean weight of the tumors from the two diets are equal. Furthermore, there is evidence that the high fat diet gives tumors with a larger average weight than the low fat diet. We have not proved that the high fat diet gives heavier (on average) tumors than the low fat diet. All that we have shown is that if there was no difference in the mean, then the observed data is very unusual. Note that while it is possible to conduct a one-sided test of the hypothesis, these are rarely useful. The paper: Hurlbert, S. H. and Lombardi, C. M. (2012). Lopsided reasoning on lopsided tests and multiple comparisions. Australian and New Zealand Journal of Statistics, **, ***-**** http://dx.doi.org/10.1111/j.1467-842X.2012.00652.x discuss the problems with one-sided tests and recommends that they be rarely used. The basic problem is what do you do if you happen to find a result that is in the opposite direction from the alternative hypothesis? Do you simply ignore this “interesting finding”? About the only time that a one-tailed test is justified are situations where you are testing for compliance against a known standard. For example, in quality control, you want to know if the defect rate is more than an acceptable value. Another example, would be water quality testing where you want to ensure that the level of a chemical is below an acceptable maximum value. In all other cases, two-sided tests should be used. For the rest of this chapter (and the entire set of notes that is available on-line) we only use two-sided tests. Note that the whole question of one- or two-sided tests is irrelevant once you have more than two treatment groups as will be noted later.
SAS also provided information for the equal-variance two-sample t-test in the above output. In this experiment, the sample standard deviations are approximately equal, so the equal variance t-test give virtually the same results and either could be used. Because the unequal variance t-test can be used in both circumstances, it is the recommended test to perform for a two-sample CRD experiment.
5.7
Example - Growth hormone and mean final weight of cattle - twosample t-test
Does feeding growth hormone increase the final weight of cattle prior to market?
c
2012 Carl James Schwarz
303
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Cattle were randomized to one of two groups - either a placebo or the group that received injections of the hormone. In this experiment, the sample sizes were not equal (there was a reason that is not important for this example). Here are the raw data: Hormone
Placebo (Control)
1784
2055
1757
2028
1737
1691
1926
1880
2054
1763
1891
1613
1794
1796
1745
1562
1831
1869
1802
1796
1876
·
1970
·
The data is available in the fattumor.csv file in the Sample Program Library at http://www.stat. sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS and then stacked so that one column is the treatment variable and one column is the response variable:
data hormone; infile ’hormone.csv’ dlm=’,’ dsd missover firstobs=2; input hormone control; /* now to convert to the usual structure for analysis - stack the variables */ trt = ’hormone’; weight = hormone; output; trt = ’control’; weight = control; output; keep trt weight; run;
Part of the raw data are shown below: Obs
c
2012 Carl James Schwarz
trt
weight
1
hormone
1784
2
control
2055
3
hormone
1757
304
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Obs
trt
weight
4
control
2028
5
hormone
1737
6
control
1691
7
hormone
1926
8
control
1880
9
hormone
2054
control
1763
10
Does this experiment satisfy the criteria for a single factor - CRD? What is the factor? What are the levels? What are the treatments? How were treatments assigned to the experimental units? Are the experimental units the same as the observational units? Where there some grouping of experimental units that we should be aware of (e.g. pairs of animals kept in pens?). Let 1. µH and µC represent the true mean weight of cattle receiving hormone or placebo (control) injections. 2. Y H and Y C , etc. represent the sample statistics. 1. Formulate the hypotheses: Our hypotheses are: H: µC = µH or µC − µH = 0 A: µC 6= µH or µC − µH 6= 0 As in a previous example, we have formulated the alternate hypothesis in terms of a two sided alternative – it is possible (but not part of this course) to express the alternate hypothesis as a one-sided alternative, i.e. interest may lie only in cases where the weight has increased (on average) after injection of the hormone. 2. Collect data and look at summary statistics. Notice that the data format is different from the previous examples. The raw data file has two columns, one corresponding to the Hormone and the second corresponding to the Placebo group. We first notice that there are two missing values for the Placebo group. SAS uses a period (.) to represent missing values. Whenever data are missing, it is important to consider why the data missing. If the data are Missing Completely at Random, then the missingness is completely unrelated to the treatment or the response and there is usually no problem in ‘ignoring’ the missing values. All that happens is that the precision of estimates is reduced and the power of your experiment is also reduced. If data are Missing at Random, then the missingness may be related to the treatment, but not the response, i.e. for some reasons, only animals in one group are missing, but within the group, the missingness occurs at random. This is usually again not a problem c
2012 Carl James Schwarz
305
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) If data are not missing at random, may have a serious problem on your hands.9 In such cases, seek experienced help – it is a difficult problem. We start by using Proc SGplot to creates a side-by-side dot-plots and box plots: proc sgplot data=hormone; title2 ’Plot of weight vs. trt’; scatter x=trt y=weight; xaxis offsetmin=.05 offsetmax=.05; run; which gives
And then: proc sgplot data=hormone; title2 ’Box plots’; 9 An interesting case of data not missing at random occurs if you look at the length of hospital stays after car accidents for people wearing or not wearing seat belts. It is quite evident that people who wear seat belts, spend more time, on average, in hospitals, than people who do not wear seat belts.
c
2012 Carl James Schwarz
306
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
vbox weight / group=trt notches; /* the notches options creates overlap region to compare if medians are equal */ run; which gives
Proc Tabulate is used to construct a table of means and standard deviations: proc tabulate data=hormone; title2 ’some basic summary statistics’; class trt; var weight; table trt, weight*(n*f=5.0 mean*f=5.1 std*f=5.1 stderr*f=7.2 run; which gives:
c
2012 Carl James Schwarz
307
December 21, 2012
lclm*f=7.1 uclm*f=
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) weight trt
N
Mean
10
1805
12
1847
Std
StdErr
95_LCLM
95_UCLM
160.8
50.86
1690.3
1920.3
98.5
28.43
1784.7
1909.8
control hormone
The sample standard deviations appear to be quite different. This does NOT cause a problem in the two-sample CRD case as the unequal-variance t-test performs well in these circumstances. Formal statistical test for the equality of population standard deviations could be performed, but my recommendation is that unless the ratio of the sample standard deviations is more than 5:1, the equal-variance t-test also performs reasonably well. Formal tests for equality of standard deviations have very poor performance characteristics, i.e. poor power and poor robustness against failure of the underlying assumptions. The confidence intervals for the respective population means appear to have considerable overlap so it would be surprising if a statistically significant difference were to be detected. 3. Find the test-statistic, the p-value and make a decision. The sample standard deviations appear to be quite different. This does NOT cause a problem in the two-sample CRD case as the unequal-variance t-test performs well in these circumstances. Proc Ttest is used to perform the test of the hypothesis that the two means are the same: ods graphics on; proc ttest data=hormone plot=all dist=normal; title2 ’test of equality of weights between the two trts’; class trt; var weight; ods output ttests = TtestTest; ods output ConfLimits=TtestCL; ods output Statistics=TtestStat; run; ods graphics off; The output is voluminous, and selected portions are reproduced below:
c
2012 Carl James Schwarz
t Value
DF
Pr > |t|
Variable
Method
Variances
weight
Pooled
Equal
-0.75
20
0.4608
weight
Satterthwaite
Unequal
-0.72
14.355
0.4831
308
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Mean
Lower Limit of Mean
Upper Limit of Mean
Variable
trt
Method
Variances
weight
Diff (1-2)
Pooled
Equal
-41.9500
-158.3
74.4078
weight
Diff (1-2)
Satterthwaite
Unequal
-41.9500
-166.6
82.7209
N
Mean
Std Error
Lower Limit of Mean
control
10
1805.3
50.8575
1690.3
1920.3
weight
hormone
12
1847.3
28.4256
1784.7
1909.8
weight
Diff (1-2)
_
-41.9500
55.7813
-158.3
74.4078
Variable
trt
weight
Upper Limit of Mean
and a final plot of:
The estimated difference in mean weight between the two groups is −41.95 (se 58) lbs. Here the estimate of the difference in the means is negative indicating that the hormone group had a larger sample mean than the control group. But the confidence interval contains zero so there is no evidence that means are unequal. The confidence interval provides all the information we need to make a decision, but a formal hypothesis test can still be computed. Notice that the confidence intervals (see the plot) cover the value of 0 indicating no evidence of a difference. The test-statistic is −.72 to be compared to a t-distribution with 14.4 df , but in this age of computers, these values don’t have much use. The two-sided p-value is 0.48. The one-sided p-value could be c
2012 Carl James Schwarz
309
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) computed using the sided=L or sided=U option on the Proc Ttest statement, but are not of interest in this experiment. Because the p-value is large, there no evidence (based on our small experiment) against our hypothesis of no difference.
In this experiment, the standard deviations are not a similar as in previous examples. In this case, the equal-variance t-test gives slightly different answers, but the overall conclusion is identical. Either test could be used, but modern practice is to always use the unequal variance t-test shown earlier: SAS also provided information for the equal-variance two-sample t-test in the above output.
5.8
Power and sample size
Would you do an experiment that had less than a 50% chance of succeeding? Yet many researchers embark upon an experimental plan using inadequate sample sizes to detect important, biologically meaningful results. The statistical analysis of any experimental data usually involves a test of some (null) hypothesis that is central to the investigation. For example, is there a difference in the mean final weight of cattle between a control group that receives a placebo and a treatment group that is injected with growth hormone. Because experimental data are invariably subject to random error, there is always some uncertainty about any decision about the null hypothesis on the basis of a statistical test. There are two reasons for this uncertainty. First, there is the possibility that the data might, by chance, be so unusual that the we believe we have evidence against the null hypothesis even though it is true. For example, there may be no effect of the growth hormone, but the evidence against the hypothesis occurs by chance. This is a Type I error and is controlled by the α level of the test. In other words, if a statistical test is performed and the hypothesis will only be doubted if the observed p-value is less than 0.05 (the α level), then the researcher is willing to accept a 5% chance that this statistically significant result is an error (a false positive result). The other source of uncertainty is often not even considered. It is the possibility that, given the available data, we may fail to find evidence against the null hypothesis (a false negative result). For example, the growth hormone may give a real increase in the mean weight, but the variability of the data is so large that we fail to detect this change. This is a Type II error and is controlled by the sample size. Related to the Type II error rate is the power of a statistical test. The power of a statistical test is the probability that, when the null hypothesis is false, the test will find sufficient evidence against the null hypothesis. A powerful test is one that has a high success rate in detecting even small departures from the null hypothesis. In general, the power of a test depends on the adopted level of significance, the inherent variability of the data, the degree to which the true state of nature departs from the null hypothesis, and the sample size. Computation of this probability for one or more combinations of these factors is referred to as a power analysis. Considerations of power are important at two stages of an experiment. c
2012 Carl James Schwarz
310
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) First, at the design stage, it seems silly to waste time and effort on an experiment that doesn’t have a fairly good chance of detecting a difference that is important to detect. Hence, a power analysis is performed to give the researcher some indication of the likely sample sizes needed to be relatively certain of detecting a difference that is important to the research hypothesis. Second, after the analysis is finished, it often occurs that you failed to find sufficient evidence against the null hypothesis. Although a retrospective power analysis is fraught with numerous conceptual difficulties, it is often helpful to try and figure out why things weren’t detected. For example, if a retrospective power analysis showed that the experiment has reasonably large power to detect small differences, and you failed to detect a difference, then one has some evidence that the actual effect must be fairly small. However, this is no substitute for a consideration of power before the experiment is started.
5.8.1
Basic ideas of power analysis
The power of a test is defined as the Probability that you will find sufficient evidence against the the null hypothesis when the null hypothesis is false and an effect exists. The power of a test will depend upon the following: • α level. This is the largest value for the p-value of the test at which you will decide that the evidence is sufficiently strong to have doubts in the null hypothesis. Usually, most experiments use α = 0.05, but this is not an absolute standard. The smaller the alpha level, the more difficult it is to declare that the evidence is sufficiently strong against the null hypothesis, and hence the lower the power. • Effect size. The effect size is the actual size of the difference that is to be detected. This will depend upon economic and biological criteria. For example, in the growth hormone example, there is an extra cost associated with administering the hormone, and hence there is a minimum increase in the mean weight that will be economically important to detect. It is easier to detect a larger difference and hence power increases with the size of the difference to be detected. THIS IS THE HARDEST DECISION IN CONDUCTING A POWER ANALYSIS. There is no easy way to decide what effect size is biologically important and it needs to be based on the consequences of failing to detect an effect, the variability in the data etc. Many studies use a rough rule of thumb that a one standard deviation chance in the mean is a biologically important difference, but this has no scientific basis. • Natural variation (noise). All data has variation. If there is a large amount of natural variation in the response, then it will be more difficult to detect a shift in the mean and power will decline as variability increases. When planning a study, some estimate of the natural variation may be obtained from pilot studies, literature searches, etc. In retrospective power analysis, this is available from the statistical analysis in the Root Mean Square Error term of the output. The MSE term is the estimate of the VARIANCE within groups in the experiment and the estimated standard deviation is simply the square root of the estimated variance. • Sample size. It is easier to detect differences with larger sample sizes and hence power increases with sample size. c
2012 Carl James Schwarz
311
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
5.8.2
Prospective Sample Size determination
Before a study is started, interest is almost always on the necessary sample size required to be reasonably certain of detecting an effect of biological or economic importance. There are five key elements required to determine the appropriate sample size: • Experimental design. The estimation of power/sample size depends upon the experimental design in the sense that the computations for a single factor completely randomized design are different than for a two-factor split-plot design. Fortunately, a good initial approximation to the proper sample size/power can often be found by using the computations designed for a single-factor completely randomized design. • α level. The accepted “standard” is to use α = .05 but this can be changed in certain circumstances. Changing the α level would require special justification. • biologically important difference. This is hard! Any experiment has some goal to detect some meaningful change over the current state of ignorance. The size of the meaningful change is often hard to quantify but is necessary in order to determine the sample size required. It not sufficient to simply state that any difference is important. For example, is a .000002% difference in the means a scientifically meaningful result? The biologically important difference can be expressed either as an absolute number (e.g. a difference of 0.2 cm in the means), or as a relative percentage (e.g. a 5% change in the mean). In the latter case, some indication of the absolute mean is required in order to convert the relative change to an absolute change (e.g. a 5% change in the mean when the mean is around 50 cm, implies an absolute change of 5% × 50 = 2.5 cm. • variation in individual results. If two animals were exposed to exactly the same experimental treatment, how variable would their individual results be? Some measure of the the standard deviation of results when repeated on replicate experimental unit (e.g. individual animals) is required. This can be obtained from past studies or from expert opinion on the likely variation to be expected. Note that the standard ERROR is NOT the correct measure of variability from previous experiments as this does NOT measure individual variation. • desired power. While a 50% chance of success seems low, is 70% sufficient, is 90% sufficient? This is a bit arbitrary, but a general consensus is that power should be at least 80% before attempting an experiment – even then, it implies that the research is willing to accept a 1/5 chance of not detecting a biologically important difference! The higher the power desired, the greater the sample size required. Two common choices for the desired power are an 80% power when testing at α = .05 or a 90% power when testing at α = .10. These are customary values and have been “chosen” so that a standardized assessment of power can proceed. The biologically important difference and the standard deviation of individual animal results will require some documentation when preparing a research proposal. There are a number of ways of determining the necessary sample sizes
c
2012 Carl James Schwarz
312
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) • Computational formula such as presented in Zar ( Biostatistical Analysis, Prentice Hall). • Tables that can be found in some books or on the web at http://www.stat.sfu.ca/~cschwarz/ Stat-650/Notes/PDF/Tables.pdf and are attached at then end of this document. Two sets of tables are attached to this document. The first table is appropriate to a single factor completely randomized design with only two levels (such data would be analyzed using a two-sample t-test). The second table is appropriate for a single factor completely randomized design with two or more levels (such data are often analyzed using a “one way ANOVA”). • Computer programs such as in JMP, R, SAS or those available on the web. For example, the Java applets by Russ Lenth at http://www.cs.uiowa.edu/~rlenth/Power/ provide nice interactive power computations for a wide variety of experimental designs. Lenth also has good advice on power computations in general on his web page. Unfortunately, there is no standard way of expressing the quantities needed to determine sample size, and so care must be taken whenever a new table, program, or formula is used to be sure that it is being used correctly. All should give you the same results. How to increase power This is a just a brief note to remind you that power can be increased not only by increasing the sample size, but also by decreasing the unexplained variation (the value of σ) in the data. This can often be done by a redesign of an experiment, e.g. by blocking, or by more careful experimentation.
5.8.3
Example of power analysis/sample size determination
When planning a single-factor CRD experiment with two levels you will need to decide upon the α-level (usually 0.05), the approximate sizes of the difference of the means to be detected (µ1 − µ2 ), (either from expert opinion or past studies), and some guess as the standard deviation (σ) of units in the population (from expert opinion or past studies). A very rough guess for a standard deviation can be formed by thinking of the range of values to be seen in the population and dividing by 4. This rule-of-thumb occurs because many populations have an approximate normal distribution of the variable of interest, and in a normal distribution, abut 95% of observations are within ± 2 standard deviations of the mean. Consequently, the approximate range of observations is about 4 standard deviations. Suppose that study is to be conducted to investigate the effects of injecting growth hormone into cattle. A set of cattle will be randomized either to the control group or to the treatment group. At the end, the increase in weight will be recorded. Because of the additional costs of the growth hormone, the experimental results are only meaningful if the increase is at least 50 units. The standard deviation of the individual cattle changes in weight is around 100 units (i.e. two identical cattle given the same treatment could have differences in weight gain that are quite variable).
c
2012 Carl James Schwarz
313
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Using tables The first table is indexed on the left margin by the ratio of the biological effect to the standard deviation, i.e. δ=
|µ1 − µ2 | 50 = = .5 σ 100
Reading across the table at δ = 0.5 in the middle set of columns corresponding to α = .05 and the specific column labeled 80% power, the required sample size is 64 in EACH treatment group. Note the effect of decreasing values of δ, i.e. as the biologically important difference becomes smaller, larger sample sizes are required. The table can be extended to cases where the required sample size is greater than 100, but these are often impractical to run – expert help should be sought to perhaps redesign the experiment.
Using a package to determine power The standard deviation chosen is between the two individual standard deviations that we saw in the previous example; the difference to detect was specified as 50 lbs. The choice of alpha level (0.05) and the target power (0.80 = 80%) are “traditional” choices made to balance the chances of a type I error (the alpha level) and the ability to detect biologically important differences (the power). Another popular choice is to use α = .10 and aim for a target power of .90. These choices are used to reduce the amount of arguing among the various participants in a study. The sample size required then to detect a 50 lbs difference in the mean if the standard deviation is 100 is found as follows. SAS has several methods to determine power. Proc Power computes power for relatively simple designs with a single random error (such as the ones in this chapter). SAS also has a stand alone program for power analysis. Refer to the http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms for links for many example of power analysis in SAS. For the Hormone example in the previous section, we use Proc Power:
proc power; title ’Power analysis for hormone example’; twosamplemeans test=diff /* indicates that you wish to test for differences in the mean */ meandiff=50 /* size of difference to be detected */ stddev=100 /* the standard deviation within each group */ power=.80 /* target power of 80% */ alpha=.05 /* alpha level for the test */ c
2012 Carl James Schwarz
314
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) sides=2 ntotal=.
/* a two sided test for difference in the mean should be done */ /* solve for the total sample size assuming equal sample sizes in bo ; /* end of the twosamplemeans statement - DON’T FORGET THIS */ ods output Output=Power10; run;
This gives the output: Alpha
Mean Diff
Std Dev
0.05
50
100
Sides 2
Null Diff
Nominal Power
Actual Power
N Total
0
0.8
0.801
128
So almost 130 animals (i.e. 65 in each group) would be needed! Depending on the power program used, the results may give the sample size for EACH group, or the TOTAL sample over both groups. So a reported sample size of 128 in TOTAL or 64 PER GROUP are equivalent. Most power packages assumes that you want equal sizes in both groups. You can show mathematically that maximizes the power to detect effects. There are many resources available on the web and for purchase that allow you the flexibility of having unequal sample sizes. For example, power and sample size pages available from Russ Length at: http://www.stat.uiowa.edu/~rlenth/Power/index.html are very flexible in specifying the sample sizes in each group. It is often of interest to plot the power as a function of the sample size or effect size, or in general plot how two of the four variables in a power analysis tradeoff. We can generate tables and plots to show how power varies over different sample sizes. For example, here I use Proc Power to investigate power for a range of differences in the population means
ods graphics on; proc power; title ’Power analysis for hormone example with various sized differences’; /* We vary the size of the difference to see what sample size is needed */ twosamplemeans test=diff /* indicates that you wish to test for differences in the mean */ meandiff=30 to 150 by 10 /* size of difference to be detected */ stddev=100 /* the standard deviation within each group */ power=.80 /* target power of 80% */ alpha=.05 /* alpha level for the test */ sides=2 /* a two sided test for difference in the mean should be done */ ntotal=. /* solve for the total sample size assuming equal sample sizes in bo ; /* end of the twosamplemeans statement - DON’T FORGET THIS */ plot x=effect xopts=(ref=50 crossref=yes); /* plot the sample size vs effect size an ods output output=power20; c
2012 Carl James Schwarz
315
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) run; ods graphics off;
This gives the output: Obs
Alpha
MeanDiff
StdDev
1
0.05
30
100
2
0.05
40
3
0.05
4
Sides
NullDiff
NominalPower
Power
NTotal
2
0
0.8
0.801
352
100
2
0
0.8
0.804
200
50
100
2
0
0.8
0.801
128
0.05
60
100
2
0
0.8
0.804
90
5
0.05
70
100
2
0
0.8
0.812
68
6
0.05
80
100
2
0
0.8
0.807
52
7
0.05
90
100
2
0
0.8
0.812
42
8
0.05
100
100
2
0
0.8
0.807
34
9
0.05
110
100
2
0
0.8
0.828
30
10
0.05
120
100
2
0
0.8
0.802
24
11
0.05
130
100
2
0
0.8
0.826
22
12
0.05
140
100
2
0
0.8
0.841
20
13
0.05
150
100
2
0
0.8
0.848
18
which can be plotted:
c
2012 Carl James Schwarz
316
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Now we use Proc Power to investigate power for a range of different sample sizes:
ods graphics on; proc power; title ’Power analysis for hormone example with various sample sizes’; /* We vary the total sample size to see what power is obttained */ twosamplemeans test=diff /* indicates that you wish to test for differences in the mean */ meandiff=50 /* size of difference to be detected */ stddev=100 /* the standard deviation within each group */ power=. /* solve for power */ alpha=.05 /* alpha level for the test */ sides=2 /* a two sided test for difference in the mean should be done */ ntotal=50 to 200 by 10 /* total sample size assuming equal sample sizes in bot ; /* end of the twosamplemeans statement - DON’T FORGET THIS */ plot x=n yopts=(ref=.80 crossref=yes); /* plot the power as a function of sample si ods output output=power30; run; c
2012 Carl James Schwarz
317
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) ods graphics off;
This gives the output: Obs
Alpha
MeanDiff
StdDev
1
0.05
50
100
2
0.05
50
3
0.05
4
Sides
NTotal
NullDiff
Power
2
50
0
0.410
100
2
60
0
0.478
50
100
2
70
0
0.541
0.05
50
100
2
80
0
0.598
5
0.05
50
100
2
90
0
0.650
6
0.05
50
100
2
100
0
0.697
7
0.05
50
100
2
110
0
0.738
8
0.05
50
100
2
120
0
0.775
9
0.05
50
100
2
130
0
0.808
10
0.05
50
100
2
140
0
0.836
11
0.05
50
100
2
150
0
0.860
12
0.05
50
100
2
160
0
0.882
13
0.05
50
100
2
170
0
0.900
14
0.05
50
100
2
180
0
0.916
15
0.05
50
100
2
190
0
0.929
16
0.05
50
100
2
200
0
0.940
which can be plotted:
c
2012 Carl James Schwarz
318
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
5.8.4
Further Readings on Power analysis
The following papers have a good discussion of the role of power analysis in wildlife research. • Steidl, R. J., Hayes, J. P., and Shauber, E. (1997). Statistical power analysis in wildlife research. Journal of Wildlife Management 61, 270-279. Available at: http://dx.doi.org/10.2307/3802582 1. What are the four interrelated components of statistical hypothesis testing? 2. What is the difference between biological and statistical significance? 3. What are the advantages of a paired (blocked) design over that of a completely randomized design? What implications does this have for power analysis? 4. What is most serious problem with retrospective power analyses? 5. Under what conditions could a retrospective power analysis be useful? c
2012 Carl James Schwarz
319
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) 6. What are the advantages of confidence intervals? 7. What are the consequences of Type I and Type II errors? • Nemec, A.F.L. (1991). Power Analysis Handbook for the Design and Analysis of Forestry Trials. Biometrics Information Handout 02. available at: http://www.for.gov.bc.ca/hfd/pubs/Docs/Bio/Bio02.htm. • Peterman, R. M. (1990). Statistical power analysis can improve fisheries research and management. Canadian Journal of Fisheries and Aquatic Sciences, 47: 1-15. Available at: http://dx.doi.org/10.1139/f90-001. The Peterman paper is a bit technical, but has good coverage of the following issues: 1. Why are Type II errors often more of a concern in fisheries management? 2. What four variables affect the power of a test? Be able to explain their intuitive consequences. 3. What is the difference between an a-priori and a-posteriori power analysis? 4. What are the implications of ignoring power in impact studies? 5. What are some of the costs of Type II errors in fisheries management? 6. What are the implications of reversing the “burden of proof”?
5.8.5
Retrospective Power Analysis
This is, unfortunately, often conducted as a post mortem - the experiment failed to detect anything and you are trying to salvage anything possible from it. There are serious limitation to a retrospective power analysis! A discussion of some of these issues is presented by Gerard, P., Smith, D.R., and Weerakkody, G. 1998. Limits of retrospective power analysis. Journal of Wildlife Management, 62, 801-807. Available at: http://dx.doi.org/10.2307/3802357. which is a bit technical and repeats the advice in Steidl, Hayes, and Shauber (1997) discussed in the previous section. Their main conclusions are: • Estimates of retrospective power are usually biased (e.g. if you fail to find sufficient evidence against the null hypothesis, the calculated retrospective power using the usual power formulae can never c
2012 Carl James Schwarz
320
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) exceed 50%) and are usually very imprecise. This is not to say that the actual power must always be less than 50% – rather that the usual prospective power/sample size formula used are not appropriate for estimating the retrospective power and give incorrect estimates. Some packages have tried to implement the “corrected” formulae for retrospective power but you have to be sure to select the proper options. • The proper role of power analysis is in research planning. It is sensible to use the results of a current study (e.g. estimates of variability and standard deviations) for help in planning future studies, but be aware that typically estimates of variation are very imprecise. Use a range of standard deviation estimates when examining prospective power. • A confidence interval on the the final effect size will tell you much more than an retrospective power analysis. It indicates where the estimate is relative to biologically significant effects and its width gives and indication of the precision of the estimate.
5.8.6
Summary
As can been seen by the past examples, the actual determination of sample size required to detect biologically important differences can be relatively painless. However, the hard part of the process lies in determining the size of a biologically important difference. This requires a good knowledge of the system being studied and of past work. A statement that “any difference is important” really is not that meaningful because a simple retort of “Is a difference of .0000000001% biologically and/or scientifically meaningful?” exposes the fallacy of believing that any effect size is relevant. Similarly, determining the variation in response among experimental units exposed to the same experimental treatment is also difficult. Often past studies can provide useful information. In some cases, expert opinion can be sought and questions such as “what are some typical values that you would expect to see over replicated experimental unit exposed to the same treatment” will provide enough information to get started. It should be kept in mind that because the biologically meaningful difference and the variation over replicated experimental units are NOT known with absolute certainty, sample sizes are only approximations. Don’t get hung up on if the proper sample size is 30 or 40 or 35. Rather, the point of the exercise is know if the sample size required is 30, 300 or 3000! If the required sample size is in the 3000 area and there are only sufficient resources to use a sample size of 30, why bother doing the experiment – it has a high probability of failure. The above are simple examples of determining sample size in simple experiments that look at changes in means. Often the computations will be sufficient for planning purposes. However, in more complex designs, the sample size computations are more difficult and expert advice should be sought. Similarly, sample size/power computations can be done for other types of parameters, e.g. proportions live/dead, LD50s, survival rates from capture-recapture studies, etc. Expert help should be sought in these cases. c
2012 Carl James Schwarz
321
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
5.9
ANOVA approach - Introduction
ANOVA is a generalization of the “Two-sample t-test assuming equal population standard deviations” to the case of two or more populations. [It turns out, as you saw in the earlier example, that an ANOVA on two groups under a CRD provides the same information (p-values and confidence intervals) as the ‘two sample t-test assuming equal population standard deviations’.]. The formal name for this procedure is ‘Single factor - completely randomized design - Analysis of Variance’. While the name ANOVA conjures up analying variances, the technique s a test of the equality of population means through the comparison of sample variations. The ANOVA method is one of the most powerful and general techniques for the analysis of data. It can be used in a variety of experimental situations. We are only going to look at it applied to a few experimental designs. It is extremely important that you understand the experimental design before applying the appropriate ANOVA technique. The most common problem that we see as statistical consultants is the inappropriate use of a particular analysis method for the experiment at hand because of failure to recognize the experimental setup. The Single-Factor Completely Randomized Design (CRD)-ANOVA is also often called the ‘one-way ANOVA’. This is the generalization of the ‘two independent samples’ experiment that we saw previously. Data can be collected in one of two ways: 1. Independent surveys are taken from two or more populations. Each survey must follow the RRR outlined earlier and should be a simple random sample from the corresponding population. For example, a survey could be conducted to compare the average household incomes among the provinces of Canada. A separate survey would be conducted in each province to select households. 2. A set of experimental units is randomized to one of the treatments in the experiment. Each experimental unit receives one and only one experimental treatment. For example, an experiment could be conducted to compare the mean yields of several varieties of wheat. The field plots (experimental units) are randomly assigned to one of the varieties of wheat. Here are some examples of experiments or survey which should NOT be analyzed using the SingleFactor-CRD-ANOVA methods: • Animals are given a drug and measured at several time points in the future. In this experiment, each animal is measured more than once which violates the assumption of a simple CRD. This experiment should be analyzed using a Repeated-Measures-ANOVA. • Large plots of land are prepared using different fertilizers. Each large plot is divided into smaller plots which receive different varieties of wheat. In this experiment, there are two sizes of experimental units - large plots receiving fertilizers and smaller plots receiving variety. This violates the assumption of a CRD that there is only one size of experimental unit. This experiment should be analyzed using a Split-Plot-ANOVA (which is discussed in a later chapter). c
2012 Carl James Schwarz
322
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) • Honey bees colonies are arranged on pallets, three per pallet. Interest lies in comparing a method of killing bee mites. Three methods are used, and each pallet receives all three methods. In this experiment, there was not complete randomization because each pallet has to receive all three treatments which violates one of the assumptions of a CRD. This experiment should be analyzed using a Randomized-Block-ANOVA (which is discussed in a later chapter). • Three different types of honey bees (hygienic, non-hygienic, or a cross) are to be compared for sweetness of the honey. Five hives of each type are sampled and two samples are taken from each hive. In this experiment, two sub-samples are taken from each hive. This violates the assumption of a CRD that a single observation is taken from each experimental unit. This experiment should be analyzed using a Sub-sampling ANOVA (which is discussed in a later chapter).
The key point, is that there are many thousands of experimental designs. Every design can be analyzed using a particular ANOVA model designed for that experimental design. One of the jobs of a statistician is to be able to recognize these various experimental designs and to help clients analyze the experiments using appropriate methods.
5.9.1
An intuitive explanation for the ANOVA method
Consider the following two experiments to examine the yields of three different varieties of wheat. In both experiments, nine plots of land were randomized to three different varieties (three plots for each variety) and the yield was measured at the end of the season. The two experiments are being used just to illustrate how ANOVA works and to compare two possible outcomes where, in one case, you find evidence of a difference in the population means and, in the other case, you fail to find evidnece of a difference in the population means. In an actual experiment, you would only do a single experiment. They are not real data - they were designed to show you how the method works! Here are the raw data: Experiment I Method A B C --------65 84 75 66 85 76 64 86 74 --------Average 65 85 75
c
2012 Carl James Schwarz
Experiment II Method A B C --------80 100 60 65 85 75 50 70 90 --------65 85 75
323
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The data are available in a in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/ Stat-650/Notes/MyPrograms. Which experiment has ‘better’ evidence of a difference in the true mean yield among the varieties? Let’s look at dot plots for both experiments:
It seems that in Experiment I, it is easy to tell differences among the means of the three levels (a, b, or c) of the factor (variety) because the results are so consistent. In Experiment II, it is not so easy to tell the difference among the means of three levels (a, b, or c) because the results are less consistent. In fact, what people generally look at is the variability within each group as compared to the variability among the group means to ascertain if there is evidence of a difference in the group population means. In Experiment I, the variability among the group means is much larger than the variability of individual observations within each single group. In Experiment II, the variability among the group means is not very different than the variability of individual observation within each single group. This is the basic idea behind the Analysis of Variance (often abbreviated as ANOVA). The technique examines the data for evidence of differences in the corresponding population means by looking at the ratio of the variation among the group sample means to the variation of individual data points within the groups. If this ratio is large, there is evidence against the hypothesis of equal group population means.
c
2012 Carl James Schwarz
324
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) This ratio (called the F -ratio) can be thought of a signal-to-noise ratio. Large ratios imply the signal (difference among the means) is large relative to the noise (variation within groups) and so there is evidence of a difference in the means. Small ratios imply the signal (difference among the means) is small relative to the nose (variation within groups) and so there is no evidence that the means differ. Lets look at those two experiments in more detail and apply an analysis. 1. Formulate the hypothesis: The null and alternate hypotheses are: H: µ1 = µ2 = µ3 or all means are equal A: not all the means are equal or at least one mean is different from the rest This is a generalization of the two-sample t-test hypothesis to the case of two or more groups. Note that the null hypothesis is that all of the population means are equal while the alternate hypothesis is very vague - at least one of the means is different from the others, but we don’t know which one. The following specifications for the alternate hypothesis are NOT VALID specifications: • A:µ1 6= µ2 6= µ3 . This implies that every mean is unequal to every other mean. It may turn out that the first two means are equal but the third unequal to the first two. • A: every mean is different. Same as above. The concept of a one-sided or two-sided hypothesis does not exist when there are three or more groups unlike when there are only two groups. 2. Collect some data and compute summary statistics Here are the dot plots for both experiments and summary statistics:
c
2012 Carl James Schwarz
325
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
This confirms our earlier impression that the variation among the sample means in Experiment I is much larger than the variation (standard deviation) within each group, but in Experiment II, the variation among the sample means is about the same magnitude as the variation within each group. 3. Find a test statistic and p-value. The computations in ANOVA are at best ‘tedious’ and at worst, impossible to do by hand. Virtually no-one computes them by hand anymore nor should they. As well, don’t be tempted to program a spreadsheet to do the computations by yourself - this is a waste of time, and many of the numerical methods in spreadsheets are not stable and will give wrong results! There are many statistical packages available at a very reasonable cost (e.g. JMP, R, or SAS) that can do all of the tedious computations. What statistical packages cannot do is apply the correct model to your data! It is critical that you understand the experimental design before doing any analysis! The idea behind the ANOVA is to partition the total variation in the data (why aren’t all of the numbers from your experiment identical?) into various sources. In this case, we will partition total variation into variation due to different treatments (the varieties) and variation within each group (often called ‘error’ for historical reasons). c
2012 Carl James Schwarz
326
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) These are arranged in a standard fashion called the ANOVA table. Here are the two ANOVA tables from the two experiments.
The actual computations of the quantities in the above tables is not important - let the computers do the arithmetic. In fact, for more complex experiment, many of the concepts such as sums of squares are an old-fashioned way to analyze the experiment and better methods (e.g. REML) are used! The first three columns (entitled Source, DF, Sum of Squares) is a partitioning of the total variation (the C Total) row into two components - due to treatments (varieties) entitled ‘Model’, and the within group variation entitled ‘Error’. The DF (degrees of freedom) column measure the ‘amount’ of information available. There are a total of 9 observations, and the df for total is always total number of observations − 1. The df for ‘Model’ is the number of treatments − 1 (in this case 3 − 1 = 2). The df for ‘Error’ are obtained by subtraction (in this case 8 − 6 = 2). The df can be fractional in some complex experiments. The Sum of Squares column (the SS) measures the variation present in the data. The total SS is partitioned into two sources. In both experiments, the variation among sample means (among the means for each variety) were the same and so the SSM odel for both experiments is identical. The SSError measures the variation of individual values within each groups. Notice that the variation of individual values within groups for Experiment I is much smaller than the variation of individual values within groups for Experiment II. The Mean Square column is an intermediate step in finding the test-statistic. Each mean square is the ratio of the corresponding sum of squares and the df . For example: MSM odel = SSM odel / df M odel MSError = SSError / df Error MST otal = SST otal / df T otal Finally, the test-statistic is denoted as the F -statistic (named after a famous statistician Fisher) is computed as :
c
2012 Carl James Schwarz
327
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) F = MSM odel / MSError This is the signal-to-noise ratio which is used to examine if the data are consistent with the hypothesis of equal means. In Experiment I, the F -statistic is 300. This implies that the variation among sample means is much larger than the variation within groups. In Experiment II, the F -statistic is only 1.333. The variation among sample means is on the same order of magnitude as the variation within groups. Unfortunately, there is no simple rule of thumb to decide if the F -ratio is sufficiently large to provide evidence against the hypothesis. The F -ratio is compared to an F -distribution (which we won’t examine in this course) to find the p-value. The p-value for Experiment I is < .0001 while that for Experiment II is 0.332. 4. Make a decision The p-value is interpretted in exactly the same way as in previous chapters, i.e. small p-values are strong evidence that the data are NOT consistent with the hypothesis and so you have evidence against the hypothesis. Once the p-value is determined, the decision is made as before. In Experiment I, the p-value is very small. Hence, we conclude that there is evidence against all population means being equal. In Experiment II, the p-value is large. We conclude there is no evidence that the population means differ. If the hypothesis is doubt, you still don’t know where the differences in the population means could have occurred. All that you know (at this point) is that not all of the means are equal. You will need to use multiple comparison procedures (see below) to examine which population means appear to differ from which other population means.
5.9.2
A modeling approach to ANOVA
[The following description has been fabricated solely for the entertainment and education of the reader. Any resemblance between the characters described and real individuals is purely coincidental.] Just before Thanksgiving, Professor R. decided to run a potato-peeling experiment in his class. The nominal purpose was to compare the average speeds with which students could peel potatoes with a specialized potato peeler vs., a paring knife, and with the peeler held in the dominant hand vs. the potato held in the dominant hand. In the jargon of experimental design, these three different methods are called “treatments”. Here, there were three treatment groups, those using the peeler in the dominant hand (PEELERS), the knife in dominant hand (KNIFERS), and the potato in dominant hand (ODDBALLS). Twelve “volunteers” were selected to peel potatoes. These 12 were randomly divided into three groups of 4 individuals. Groups were labeled as above. The experimental subjects were then each given a potato in turn and asked to peel it as fast as they could. The only restrictions were that individuals in Group PEELERS were to use the potato peeler in their dominant hand, etc. Why was randomization used? Times taken to peel each potato were recorded as follows: c
2012 Carl James Schwarz
328
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Replicate Group
1
2
3
4
PEELERS
44
69
37
38
KNIFERS
42
49
32
37
ODDBALLS
50
58
78
102
This data was analyzed using the SAS system using the program potato.sas with output potato.pdf available in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-650/ Notes/MyPrograms. Do these times demonstrate that the average time taken to peel a potato depends on the tool used and the hand in which it is held? Obviously, we ought to begin by computing the average for each group: Replicate
Mean
Group
1
2
3
4
PEELERS
44
69
37
38
47
KNIFERS
42
49
32
37
40
ODDBALLS
50
58
78
102
72
The mean for the group holding the potato in the dominant hand is over one and one half times as great as is the mean for each of the other two. This strategy appears to take the longest; the other two strategies seem more comparable, but with the knife having a slight advantage over the peeler. In a intuitive sense, some of the variation among the 12 times needed to peel the potatoes come from the treatments applied, i.e. the three methods of peeling. This experimental design leaves open the possibility that the observed differences are attributable to chance fluctuations - chance fluctuations generated by the random assignment of individuals to groups, and of potatoes to individuals, etc. This possibility can be assessed by a statistical test of significance. The null hypothesis is that there are no systematic differences in the population mean time taken to peel potatoes with the three different methods. The observed differences are then just chance differences. To perform the statistical test, you must consider what any good scientist would consider. You must imagine repeating the experiment to see if the results are reproducilble. You would not, of course, expect to obtain identical results. However, if the differences were real, you would expect to see equally convincing results quite often. If not, then you would expect to see such substantial differences as these only rarely. The p-value that we are about to calculate will tell us how frequently we ought to expect to see such apparently strong evidence of differences between the group means when they are solely chance differences. To proceed, we shall need a model for the variation in the time to peel a potatoe seen in the data. This is the key aspect of any statistical analysis - formulating a mathematical model that we hope is a reasonable approximation to reality. Then we apply the rules of probability and statistics to determine if our model is c
2012 Carl James Schwarz
329
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) a reasonable fit to the data and then based upon our fitted model, to examine hypotheses about the model parameters which match a real-life hypothesis of interest. The models are developed by examining the treatment, experimental unit, and restricted randomization structures in the experiment. In this course, it is always assumed that complete randomization is done as much as possible and so the effects of restricted randomization are assumed not to exist. The treatment structure consists of the factors in the experiment and any interactions among them if there are more than one factor (to be covered in later chapters). The experimental unit structure consists of variation among identical experimental units within the same treatment group. If there was no difference in the mean time to peel a potato, then there would be NO treatment effect. It is impossible to get rid of experimental unit effects as this would imply that different experimental units would behave identically in all respects. The standard deviations for the groups are assumed to be identical. [It is possible to relax this assumption, but this is beyond the space of this course.] A highly simplified syntax is often used to specify model for experimental designs. To the left of the equals sign,10 the response variable is specified. To the right of the equals sign, the treatment and experimental units are specified. For this example, the model is Time = Method PEOPLE(R) . This expression is NOT a mathematical equality - it has NO meaning as a mathematical expression. Rather it is interpreted as the variation in response variable Time is affected by the treatments (Method) and by the experimental units (People). The effect of experimental units is random and cannot be predicted in advance (the (R) term). In general, there will be a random component for each type of experimental unit in the study - in this case there is only one type of experimental unit - a potato. It turns out, that most statistical packages and textbooks “drop” the experimental units terms UNLESS THERE IS MORE THAN ONE SIZE OF EXPERIMENTAL UNIT (such as in split-plot designs to be covered later in this course). Hence, the model is often expressed as: Time = Method with the effect of the experimental unit “dropped” for convenience. You may have noticed that these models are similar in form to the regression models you have seen in a previous course. This is not an accident - regression and analysis of variance are all part of a general method called ‘Linear Models’.) If you look at the SAS program potato.sas11 you will see that the MODEL statement in PROC GLM follows closely the syntax above. If you look at the R program potato.r12 you will see the formula in the aov() function closely follows the syntax above. . 10 This syntax is not completely uniform among packages. For example, in R the equal sign is replaced by the ∼; in JMP (Analyze>Fit Model platform), the response variable is entered in a different area from the effects 11 Available from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/ MyPrograms 12 Available from the Sample Program Library http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/ MyPrograms
c
2012 Carl James Schwarz
330
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) The ANOVA methods must partition the total variation in the data into its constituent parts. First, consider the average response for each treatment. In statistical jargon, these are called the treatment means, corresponding to the three different “treatments” applied to the three groups. These can be estimated by the corresponding group sample means. [This will not always be true - so please don’t memorize this rule.] Similarly, the overall mean can be estimated by the overall mean for the observed results. Here, this is 53. [This will not always be true - so please don’t memorize this rule - it only work here because the design is balanced, i.e. has an equal number of observations in each treatment group.] The treatment effects are estimated by the difference between the sample mean for each group and the overall grand mean. The estimates of the experimental unit effects is found by the deviations of the individual observations from their corresponding group sample means. For example, the effect of the first potato in the first treatment group is found as 44 − 47 = −3, i.e. this experimental unit was peeled faster than the average potato in this group. You may recognize that this terms look very similar to the residuals from a regression setting. This is no accident. These residuals are important for assessing model fit and adequacy as will be explored later. At this point, tedious arithmetic takes place as outlined earlier and will not be covered in this class. The end product is the ANOVA table where each term in the model is represented by a line in the table. The total variation (the left of the equal sign) appears at the bottom of the table. The variation attributable to the treatment structure appears in a separate line in the table. The variation attributable to experimental unit variation is represented by the “Error” line in the table.13 The columns representing degrees of freedom represent a measure of information available, the column labeled sums of squares represents a measure of the variation attributable to each effect, and the columns labeled Mean Square and F -statistic are the intermediate computations to arrive at the final p-value. Most computer packages will compute the various sums of squares automatically and correctly and so we won’t spend too much time on these. Similarly most computer packages will automatically compute p-values for the test-statistic and we again won’t spend much time on this. It is far more important for you to get a feeling for the rationale behind the method than to worry about the details of the computations. The p-value is found to be 0.0525. This implies that if the null hypothesis were true, there is only a 5% chance of observing this set of data (or a more extreme set) by chance. Hence, the differences between the treatment means cannot reasonably be attributed to chance alone.
5.10
Example - Comparing phosphorus content - single-factor CRD ANOVA
A horticulturist is examining differences in the phosphorus content of tree leaves from three varieties. 13 The
term “Error” to represent experimental unit variation is an historical artifact and does NOT represent mistakes in the data.
c
2012 Carl James Schwarz
331
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) She randomly selects five trees from each variety within a large orchard, and takes a sample of leaves from each tree. The phosphorus content is determined for each tree. Here are the raw data: Variety Var-1
Var-2
Var-3
0.35
0.65
0.60
0.40
0.70
0.80
0.58
0.90
0.75
0.50
0.84
0.73
0.47
0.79
0.66
The data is available in the phosphor.csv file in the Sample Program Library at http://www.stat. sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. The data are imported into SAS in the usual way:
data phosphor; infile ’phosphor.csv’ dlm=’,’ dsd missover firstobs=2; input phosphor variety $; run;
Part of the raw data are shown below: Obs
phosphor
variety
1
0.35
var1
2
0.40
var1
3
0.58
var1
4
0.50
var1
5
0.47
var1
6
0.65
var2
7
0.70
var2
8
0.90
var2
9
0.84
var2
10
0.79
var2
1. Think about the design aspects. What is the factor? What are the levels? What are the treatments? Can treatments be randomized to experimental units? If not, how were experimental units selected? c
2012 Carl James Schwarz
332
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) What are the experimental and observational units? Why is only one value obtained for each tree? Why were five trees of each variety taken - why not just take 5 samples of leaves from one tree? Is the design a single-factor CRD? 2. Statistical Model. The statistical model must contain effects for the treatment structure, the experimental unit structure, and the randomization structure. As this is a CRD, the last component does not exist. The treatment consists of a single factor Variety. The experimental units are the Trees. As this is a CRD, the effect of non-complete randomization does not exist. Hence our statistical model says that the variation in the response variable (Phosphorus) depends upon the effects of the treatments and variations among individual trees within each level of the treatment. Most statistical packages require that you specify only the treatment effects unless there is more than one experimental unit (e.g. a split-plot design to be covered later in the course). They assume that any left over variation after accounting for effects specified must be experimental unit variation. Hence, a simplified syntax for the model that represent the treatment, experimental, and randomization structure for this experiment could be written as: Phosphorus = Variety which indicates that the variation in phosphorus levels can be attributable to the different varieties (treatment structure) and any variation left over must be experimental unit variation. 3. Formulate the hypothesis of interest. We are interested in examining if all three varieties have the same mean phosphorus content. The hypotheses are: H: µVar-1 = µVar-2 = µVar-3 A: not all means are equal, i.e., at least one mean is different from the others. 4. Collect data and summarize. The data must be entered in ‘stacked column format’ with two columns, one for the factor (the variety) and one for the response (the phosphor level). Each line must represent an individual subject. Note that every subject has one measurement – the multiple leaves from a single sample are composited into one sample and one concentration is found. This is a common data format for complex experimental designs where each observation is in a different row and the different columns represent different variables. The dataset contains two variables, one of which is the factor variable and the other the response variable. You will find it easiest to code factor variables with alphanumeric codes as was done in this study. We start using Proc SGplot to create a side-by-side dot-plot to check for outliers: proc sgplot data=phosphor; title2 ’dot plot of rawdata’; scatter x=variety y=phosphor; run;
c
2012 Carl James Schwarz
333
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN)
Then Proc Tabulate is used to construct a table of means and standard deviations: proc tabulate data=phosphor; title2 ’summary statistics’; class variety; var phosphor; table variety, phosphor*(n*f=5.0 run;
mean*f=6.2 std*f=6.2) / rts=20;
which gives:
phosphor variety
c
2012 Carl James Schwarz
334
N
Mean
Std
5
0.46
0.09
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) phosphor N
Mean
Std
var2
5
0.78
0.10
var3
5
0.71
0.08
var1
We note that the sample standard deviations are similar in each group and there doesn’t appear to be any outliers or unusual data values. The assumption of equal standard deviations in each treatment group appears to be tenable. 5. Find the test-statistic and compute a p-value. A single-factor CRD can be analyzed using either Proc GLM or Proc Mixed. Both will give the same results, and it is a matter of personal preference which is used.14 We will demonstrate the output from both procedures. First, Proc Glm: ods graphics on; proc glm data=phosphor plots=all; title2 ’ANOVA using GLM’; class variety; model phosphor = variety; lsmeans variety / adjust=tukey pdiff cl stderr lines; ods output LSmeanDiffCL = GLMdiffs; ods output LSmeans = GLMLSmeans; ods output LSmeanCL = GLMLSmeansCL; ods output LSMlines = GLMlines; ods output ModelANOVA = GLManova; run; ods graphics off; Proc GLM computes various test-statistics which, in the case of a single-factor CRD are all the same. In general, you are interested in the Type III Tests from GLM.15
Dependent phosphor 14 I
HypothesisType 3
Source variety
DF
Type I SS
Mean Square
F Value
Pr > F
2
0.27664000
0.13832000
16.97
0.0003
prefer Proc Mixed because it is more general than GLM. me for details about the Type I, II, III, and IV sums-of-squares and tests.
15 Contact
c
2012 Carl James Schwarz
335
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Second, Proc Mixed: ods graphics on; proc mixed data=phosphor plots=all; title2 ’ANOVA using Mixed’; class variety; model phosphor=variety; lsmeans variety / adjust=tukey diff cl; ods output tests3 =MixedTest; ods output lsmeans=MixedLsmeans; ods output diffs =MixedDiffs; run; ods graphics off; which gives:
Effect
Num DF
Den DF
F Value
Pr > F
variety
2
12
16.97
0.0003
In Proc Mixed, the concept of Type I, II, and III tests does not exist, and there is only one table of test statistics produced.16 The F -statistic is 16.97. The p-value is 0.0003. 6. Make a decision. Because the p-value is small, we conclude that there is evidence that not all the population means are equal. At this point, we still don’t know which means may differ from each other, but the gives us a good indication of which varieties appears to have means that differ from the rest. Once again, we have not proved that the means are not all the same. We have only collected good evidence against them being the same. We may have made a Type I error, but the chances of it are rather small (this is what the p-value measures). We start by finding the estimates of the marginal means along with the standard errors and confidence intervals. These are obtained from the LSmeans statement in both Proc GLM and Proc Mixed, but beware of the slightly different syntaxes in the two procedures. The corresponding outputs for Proc GLM are:
16 Indeed, Proc Mixed does away with Sums-of-Squares and all that jazz and follows a REML procedure – contact me for more details.
c
2012 Carl James Schwarz
336
December 21, 2012
CHAPTER 5. SINGLE FACTOR - COMPLETELY RANDOMIZED DESIGNS (A.K.A. ONE-WAY DESIGN) Standard Error
Pr > |t|
LSMEAN Number
Effect
Dependent
variety
phosphor LSMEAN
variety
phosphor
var1
0.46000000
0.04037326