Please copy and paste this embed script to where you want to embed

STAT 231 (Winter 2012 - 1121) Honours Statistics Prof. R. Metzger University of Waterloo LATEXer: W. Kong Last Revision: April 3, 2012

Table of Contents 1

2

PPDAC

1

1.1 1. 1

Prob Pr oble lem m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4 1. 4

Anal An alys ysis is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5 1. 5

Concl Con clus usio ion n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Measur Mea sureme ement nt Anal Analysi ysiss

8

2.1

Measure Meas uremen ments ts of Sp Sprea read d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Measure Meas uremen ments ts of Ass Associa ociatio tion n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

Probabi Pro babilit lity y The Theor ory y

9

4

Statis Sta tistic tical al Model Modelss

10

4.1 4. 1

Gene Ge nera raliliti ties es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

4.2 4. 2

Typ ypes es of Mo Mode dels ls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

5

Estima Est imates tes and Est Estim imato ators rs

11

5.1 5. 1

Motiv Mo tivati ation on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

5.2

Maximum Maximu m Likelihoo Likelihood d Estimatio Estimation n (MLE) (MLE) Algo Algorithm rithm . . . . . . . . . . . . . . . . . . . . .

12

5.3 5. 3

Esti Es timat mator orss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

5.4 5. 4

Bias Bi ases es in Sta Stati tist stic icss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Winter 2012 6

7

TABLE OF CONTENTS

Distri Dis tributi bution on The Theor ory y

17

6.1

Student Stud ent t-D t-Dist istrib ributi ution on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

6.2

Leastt Squ Leas Squar ares es Meth Method od . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

Inte In terv rval alss

19

7.1

Conﬁde Con ﬁdence nce Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

7.2

Predic Pre dictin ting g Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

7.3

Likelyh Lik elyhood ood Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

8

Hypothe Hypo thesis sis Test esting ing

20

9

Compa Com parat rative ive Model Modelss

21

10 Experim Experimental ental Design Design

22

11 Model Assessment Assessment

23

12 Chi-Sq Chi-Square uared d Test

23

References

25

Appendix A

26

These notes are currently a work in progress, and as such may be incomplete or contain errors. [Source: lambertw.com]

ii

Winter 2012

LIST OF FIGURES

List of Figures 1.1

Popul Po pulati ation on Rel Relatio ationsh nships ips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Boxx Plots Bo Plots (fr (from om Wik Wikipe ipedia dia)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

i

Winter 2012

ACKNOWLEDGMENTS

Acknowledgments: Special thanks to Michael Baker and his LATEX formatted formatted notes. notes. They were were the inspiration inspiration for the structure of these notes.

ii

Winter 2012

ABSTRACT

Abstract

The purpose of these notes is to provide a guide to the second year honours statistics course. The contents of this course are designed to satisfy the following objectives:

•

To provide provide students students with a basic basic understa understanding nding of probabi probability lity,, the role of variatio variation n in empirical empirical problem solving and statistical concepts (to be able to critically evaluate, understand and interpret statistical studies reported in newspapers, internet and scientiﬁc articles).

•

To provide students with the statistical concepts and techniques necessary to carry out an empirical study to answer relevant questions in any given area of interest.

The recommended prerequisites are Calculus I, II (one of Math 137,147 and one of Math 138, 148), and Probability (Stat 230). Readers should have a basic understanding of single-variable diﬀerential and integral calculus as well as a good understanding of probability theory.

iii

Winter 2012

1

1

PPDAC

PPDAC

PPDAC is a process or recipe used to solve statistical problems. It stands for: Problem / Plan / Data / Analysis / Conclusion

1.1 1.1

Prob Proble lem m

The problem step’s job is to clearly deﬁne the 1. Goal or Aspect of the study 2. Target arget Population Population and Units 3. Unit’s Vari Variates ates 4. Attribute Attributess and Paramete Parameters rs Deﬁnition 1.1. The target population is the set of animals, people or things about which you wish to draw conclusions. A unit is a singleton of the target population. Deﬁnition 1.2. The sample population population is a speciﬁ speciﬁed ed subset subset of the target target popula populatio tion. n. A sample is a singleton of the sample population and a unit of the study population. Example 1.1. If I were interested in the average age of all students taking STAT 231, then:

Unit = student of STAT 231 Target Population (T.P.)=students of STAT 231 Deﬁnition 1.3. A variate is a characteristic of a single unit in a target population and is usually one of the following: 1. Response variates variates - interest in the study 2. Explanatory variate - why responses vary from unit to unit (a) Known Known - variates variates that are know to cause the responses responses i. Focal Focal - known known variates that divide the target target population into subsets (b) Unknown Unknown - variates variates that cannot be explained in the that cause responses responses Using Ex. 1.1. as a guide, we can think of the response response variate variate as the age of of a student student and the explanatory explanatory variates as factors such as the age of the student, when and where they were born, their familial situations, educational educational background background and intelligence. intelligence. Fo Forr an example example of a focal variate, variate, we could think of something something along the lines of domestic vs. international students or male vs. female. Deﬁnition 1.4. An attribute is a characteristic of a population which is usually denoted by a function of the response variate. It can have two other names, depending on the population studied:

1

Winter 2012

1

PPDAC

• Parameter is used when studying populations • Statistic is used when studying samples • Attribute can be used interchangeably with the above Deﬁnition 1.5. The aspect is the goal of the study and is generally one of the following 1. Descriptive Descriptive - describing describing or determining determining the value of an attribute 2. Comparative Comparative - comparing comparing the attribute of two two (or more) more) groups 3. Causative Causative - trying to determine whether whether a particula particularr explanatory explanatory variate causes a response response to change change 4. Predictive Predictive - to predict the value of a response variate variate using your explanatory explanatory variate

1.2

Plan

The job of the plan step is to accomplish the following: 1. Deﬁne the Study Protocol - this is the population that is actually studied and is NOT always a subset of the T.P. 2. Deﬁne the Sampling Protocol - the sampling protocol is used to draw a sample from the study population (a) Some types of the sampling sampling protocol protocol include i. Random Random sampling (ad verbatim) verbatim) ii. Judgment Judgment sampling sampling (e.g. gender gender ratios) iii. Volunteer Volunteer sampling sampling (ad verbatim) verbatim) iv. Represen Representative tative sampling A. a sample that matches the sample sample population population (S.P.) (S.P.) in all important important character characteristics istics (i.e. the proportion of a certain important characteristic in the S.P. is the same in the sample) (b) Generally Generally,, statistician statisticianss prefer prefer Random sampling 3. Deﬁne the Sample - ad verbatim 4. Deﬁne the measurement system - this deﬁnes the tools/methods used A visual visualiza izatio tion n of the relations relationship hip between between the popula population tionss is below below.. In the diagram, diagram, T.P. T.P. is the targe targett population and S.P. is the sample population. Example 1.2. Suppose that we wanted to compare the common sense between maths and arts students at the University University of Waterloo. An experiment experiment that we could do is take 50 arts students and 33 maths students students from the University of Waterloo taking an introductory statistics course this term and have them write a statistics test (non-mathematical) and average the results for each group. In this situation we have:

2

Winter 2012

1

PPDAC

T.P. S.P. Sample

Sample Error Study Error Figure 1.1: Population Relationships

• T.P. = All arts and maths students • S.P. = Arts and maths students from the University of Waterloo taking an introductory statistics course this term

• Sample = 50 arts students and 33 maths students from the University of Waterloo taking an introductory statistics course this term

• Aspect = Comparative study • Attribute(s): Average grade of arts and maths students (for T.P., S.P., and sample) – Parameter for T.P. and S.P. – Statistic for Sample There are, however some issues: 1. Sample units units diﬀer from S.P. S.P. 2. S.P. S.P. units diﬀer from T.P. T.P. units 3. We want to measure common sense, but the test is measuring measuring statistical statistical knowledge knowledge 4. Is it fair to put arts students in a maths course? Example 1.3. Suppose that we want to investigate the eﬀect of cigarettes on the incidence of lung cancer in humans. We can do this by purchasing online mice, randomly selecting 43 mice and letting them smoke 20 cigaret cigarettes tes a day day. We then conduct conduct an autops autopsyy at the time of death death to check check if they they have have lung lung cancer cancer.. In this situation, we have:

• T.P. = All people who smoke • S.P. = Mice bought online • Sample = 43 selected online mice • Aspect = Not Comparative 3

Winter 2012

1

PPDAC

• Attribute: – T.P. - proportion of smoking people with lung cancer – S.P. - proportion of mice bought online with lung cancer – Sample - proportion of sample with lung cancer Like the previous example, there are some issues: 1. Mice and humans may react diﬀerently diﬀerently to cigarettes cigarettes 2. We do not have a baseline (i.e. what does X% mean?) 3. Is have the mice smoke 20 cigarettes cigarettes a day realistic? realistic? Deﬁnition 1.6. Let a(x ) be deﬁned as an attribute as a function of some population or sample x . x . We deﬁne the study error as a(T.P.) a(S.P.).

−

Unfortunately, there is no way to directly calculate this error and so its value must be argued. In our previous examples, one could say that the study error in Ex. 1.3. is higher than that of Ex. 1.2. due to the S.P. in 1.3. being drastically diﬀerent than the T.P.. Deﬁnition 1.7. Similar to above, we deﬁne the sample error as a(S.P.)

− a(sample ).

Although we may be able to calculate a(sample ), a(S.P.) is not computable and like the above, this value must be argued. Remark 1.1. 1.1. Note that if we use a random sample, we hope it is representative and that the above errors are minimized.

1.3

D a ta

This step involves the collecting and organizing of data and statistics. Data Types Data: Simply put, there are “holes” between the numbers • Discrete Data: Data: We assume that there are no “holes” • Continuous (CTS) Data: Data: No order in the data • Nominal Data: Data: There is some order in the data • Ordinal Data: Data: e.g. Success/failure, true/false, yes/no • Binary Data: Data: Used for counting the number of events • Counting Data:

4

Winter 2012

1.4 1.4

1

PPDAC

Anal Analys ysis is

This step involves analyzing our data set and making well-informed observations and analyses. Data Quality There are 3 factors that we look at: 1. Outliers: Outliers: Data that is more extreme than their counterparts. (a) Reasons Reasons for outliers? outliers? i. Typos or data recording recording errors errors ii. Measuremen Measurementt errors iii. Valid Valid outliers outliers (b) Without Without having been involved involved in the study from the start, start, it is diﬃcult diﬃcult to tell which is which 2. Missing Missing Data Points Points 3. Measurement Measurement Issues Issues Characteristic of a Data Set Outliers could be found in any data set but these 3 always are: 1. Shape (a) Numerical Numerical methods: methods: Skewness and Kurtosis (not covered in this course) (b) Empirical Empirical methods: methods: i. ii. iii. iv.

Bell-shap Bell-shaped: ed: Symmetrical Symmetrical about about a mean Skewed Skewed left (negative): (negative): densest densest area on the right Skewed Skewed right right (positive): (positive): densest densest area on the left Uniform: Uniform: even all around; around; straight straight line

2. Center (location) (a) The “middl “middle” e” of our data data i. Mode: Mode: statistic that asks which value occurs the most frequently ii. Median (Q2): The middle data value A. The deﬁnition deﬁnition in this course is an algorithm algorithm B. We denot denotee n data values by x 1 , x 2 ,...,x n and the sorted data by x (1) (1) , x (2) (2) ,...,x (n) where x (1) (1)

≤ x ≤ ... ≤ x (2) (2)

(n) x

If n is odd then Q2 = x ( n+1 ) and if n is even, then Q2 = 2

5

−x

( n2 ) ( n+1 2 ) 2

Winter 2012

1

PPDAC

n

x

i i

iii. Mean: Mean: The sample mean is x x¯ =

i =1

n

A. The sample mean moves in the the direction direction of the outlier if an outlier is added added (median as well but less of an eﬀect) (b) Robustness: Robustness: The median is less aﬀected by outliers and is thus robust. 3. Spread (variability) (a) Range: Range: By deﬁnition, this is x (n)

− x

(1) (1)

(b) IQR (Interquartile range): The middle half of your data i. We ﬁrst deﬁne posn (a) as the function which returns the index of the statistic a in a data set. The value of Q1 is deﬁned as the median in the data set x (1) (1) , x (2) (2) ,...,x (posn(Q2)−1) and Q2 is the median of the data set x (posn(Q2)+1) , x (posn(Q2)+2) ,...,x (n) ii. The IQR of set is deﬁned to be the diﬀerence between Q3 and Q1. That is, I QR = Q3

− Q1

iii. A box plot is visual repres represent entati ation on of this. this. The whiske whiskers rs of a box box plot plot repres represent ent values values that are some value below or above the data set, according to the upper and lower fences, which are the upper and lower bounds of the whiskers whiskers respectively respectively.. The lower fence, LL, LL, is bounded by a value of LL = Q1 1.5(I QR)

−

and the upper fence, U L, a value of U L = Q3 + 1.5(I QR) We say that any value that is less than LL or greater than UL is an outlier. outlier.

6

Winter 2012

1

PPDAC

Here, we have a visualization of a box plot, courtesy of Wikipedia:

Figure 1.2: Box Plots (from Wikipedia) (c) Variance : Variance : For a sample population, the variance is deﬁned to be n

s 2 =

i =1 =1

2

(x i i

− x x¯ ) n−1

which is called the sample variance. variance. (d) Standard Deviation: Deviation: For a sample population, the standard deviation is just the square root of the variance:

− − n

(x i i

s =

x x¯ )2

i =1

n

1

which is called the sample standard standard deviation deviation..

1.5 1.5

Conc Conclu lusi sion on

In the conclusion, there are only two aspects of the study that you need to be concerned about: 1. Did you answer answer your problem problem 2. Talk about limitations (i.e. study errors, samples errors)

7

Winter 2012

2

2

MEASUR MEASUREME EMENT NT ANAL ANALYSIS YSIS

Meas Measure ureme ment nt Anal Analys ysis is

The goal of measures is to explain how far our data is spread out and the relationship of data points.

2.1 2.1

Meas Measure ureme ment ntss of Sprea Spread d

The goal of the standard deviation is to approximate the average distance a point is from the mean. Here are some other methods that we could use for standard deviation and why they fail: n

(x −x x¯ ) i i

1.

i =1

n

does not work because it is always equal to 0

n

|x −x x¯ | i i

2.

i =1

n

works but we cannot do anything mathematically signiﬁcant to it

n

3.

(x −x x¯ )2

n

i i

i =1

n

is a good guess, but note that

(a) To ﬁx this, this, we use n

n

2

(x i i

− x x¯ ) ≤ |x − x x¯ |

i =1

i i

i =1 =1

− 1 instead of n (proof comes later on)

Proposition 2.1. An interesting identity is the following:

− − n

(x i i

s =

n

x x¯ )2

i =1

n

=

1

(x i 2

2

− nx x¯ ) n−1

i =1 =1

Proof. Exercise for the reader. Deﬁnition 2.1. Coeﬃcient of Variation (CV) This measure provides a unit-less measurement of spread: CV =

2.2

s x x¯

× 100%

Measurem Measurement entss of Associat Association ion

Here, we examine a few interesting measures which test the relationship between two random variables X any Y . 1. Covariance: Covariance: In theory (a population), the covariance is deﬁned as Cov(X, Y ) = E (( ((X

8

−µ

X )(Y

−µ

Y ))

Winter 2012

3

PROBAB PROBABILI ILITY TY THEORY THEORY

but in practice (in samples) it is deﬁned as n

s XY XY = Note that Cov(X, Y ), s XY XY the magnitude. magnitude.

(x i i

i =1

− x x¯ )( y − y ¯) . n−1 i i

∈ R and both give us an idea of the direction of the relationship but not

2. Correlation: Correlation: In theory (a population), the correlation is deﬁned as ρXY =

Cov(X, Y ) σX σY

but in practice (in samples) it is deﬁned as r XY XY = Note that 1 magnitude.

− ≤ρ

XY , r XY XY

s XY XY . s X X s Y Y

≤ 1 and both give us an idea of the direction of the relationship AND the

(a) An interpretati interpretation on of the values is as follows: follows: i. r XY XY

| ii. |r iii. |r iv. |r v. |r

XY XY XY XY XY XY XY XY

|≈1 |=1 |>1 | 30, the student’s t is almost identical to the normal distribution v v

with mean 0 and variance 1

• For v 30, T is very close to a uniform distribution with thick tails and very even, unpronounced center

(From here on out, the content will be supplemental to the course notes and will only serve as a review)

6.2 6.2

Leas Leastt Squa Squares res Method Method

There are are two ways to use this method. First, First, for a given model Y and parameter θ, suppose that we get a best ﬁt y ˆ and deﬁne ˆi = y ˆ y i i . The least squares approach is through any of the two

| − | n

1. (Algebra (Algebraic) ic) Deﬁne W =

i =1

ˆi 2 . Calculate and minimize

n

2. (Geometric) (Geometric) Deﬁne Deﬁne W =

i =1 =1

ˆi 2 = ˆt ˆ. Note Note that that W

∂W ∂θ

to determine θ.

⊥ span{−→1 , −→x } and so ˆ −→1 = 0 and ˆ −→x = 0. t

t

Use these equations to determine θ.

7

Inte terv rva als

Here, we deviate from the order of lectures and focus on the various types of constructed intervals.

19

Winter 2012

7.1 7.1

8

HYPOTH HYPOTHESI ESIS S TESTING TESTING

Conﬁ Conﬁden dence ce Inte Interv rval alss

Suppose that θ˜ N (θ, V ar ar (θ˜)). Find Find L, U , equidistant from θ˜, such that P (L < θ˜ < U ) = 1 α where α is known as our margin of error. error . Usuall Usuallyy this this value value is 0.05 0.05 by conventi convention, on, unless unless otherw otherwise ise speciﬁed speciﬁed.. ˜ Normalizing θ, we get that

∼

−

(L, U ) =

− θ

c V ar (θ˜), θ + c V ar (θ˜)

where c is the value such that 1 α = P ( c < Z < c ) and Z interval. θ c V ar (θ˜), a probability interval.

−

±

−

∼

N (0 this inte interv rval al,, (0, 1). We call this

±

Usually, though, we don’t know θ, so we replace it with our best estimate θˆ to get θˆ c V ar (θ˜), which is called a (1 α)% conﬁdence interval. interval. If V If V ar (θ˜) is known, then C N (0 (0, 1) and if it is unknown, we replace ˆ V ar (θ˜) with V ar ( ˜θ) and C t n−q . Anothe Anotherr compact compact notati notation on for for conﬁde conﬁdence nce intervals intervals is ES T c SE . SE . Note that the conﬁdence interval says that we are 95% conﬁdent in our result, but does not say anything signiﬁcant. When we say that we are 95% conﬁdent, it means that given 100 randomly selected trials, we expect that in 95 of the 100 conﬁdence intervals constructed from the samples, the population parameter will be in those intervals.

−

7.2 7.2

∼

∼

±

Predi Predict ctin ing g Inte Interv rval alss

A predicting interval is an extension of a model Y , by adding the subscript p . The model is usually of the form Y p = f (θ˜) + p and the model of the form E ST c SE = f (θˆ) V ar (Y p ). Note that Y p is diﬀerent from a standard model Y i = f (θ) + i in that the ﬁrst component is random.

±

7.3 7.3

±

Lik Likelyh elyhood ood Inte Interv rvals als n

Recall the likelyhood function L(θ, y i i ) =

f ( y i i , θ ) and note that L(θˆ) is the largest value of the likelyhood

i =1

function by MLE. We deﬁne the relative likelyhood function as R(θ) =

L(θ) ,0 L(θˆ)

≤ R(θ) ≤ 1.

From a theorem in a future statistics course (STAT330), if R(θ) 0.1, then solving for θ (usually using the quadratic formula) will form an approximate 95% conﬁdence interval. This interval is called the likelyhood interval and one of its main advantages to the conﬁdence interval is that it does not require that the model be normal.

≈

8

Hypoth Hypothes esis is Testi esting ng

While our conﬁdence interval does not tell us in a yes or no way whether or not a statistical estimate is true, a hypothesis test does. Here are the steps:

20

Winter 2012

9

COMP COMPARATIV ARATIVE E MODELS MODELS

1. State the hypothesis, H 0 : θ = θ0 (this is only an example), called the null hypothesis ( H 1 is called the alternative hypothesis and states a statement contrary to the null hypothesis). 2. Calculate Calculate the discrepancy discrepancy (also called the test statistic), statistic), denoted by d =

θˆ

−

θ0

V ar (θ˜)

=

estimate H 0 value SE

−

assuming that θ˜ is unbiased and the realization of d , denoted by D , is N (0 (0, 1) if V ar (θ˜) is known and t n−q otherwise. Note that d is the number of standard deviations θ0 is from θˆ. 3. Calculate a p value given by p = 2P (D > d ). It is also the probability that one sees a value worse than θˆ, given given that the null null hypothe hypothesis sis is true. The greater greater the p value, the more evidence against the model in order to reject.

−

||

−

4. Reject or not reject (note that we do not “accept” the model) The following the table that subjectively describes interpretations for p values:

−

P value p-value< p-value10%

≤ ≤

≤

Interpretation A ton of evidence against H 0 A lot of evidence against H 0 Some evidence against H 0 Ther Th eree is virt virtua uallllyy no evide evidenc ncee agai agains nstt H 0

Note that one model for which D N (0 (0, 1) is Y i = i where i by our null hypothesis and central limit theorem.

∼

9

∼ Bin(1, Π) since

V ar (θ˜) =

ˆ 0 (1 Π ˆ 0) Π

−

n

Comp Compa arati rative ve Model Modelss

The goal of a comparative comparative model is to compare the mean of two two groups and determine if there is a causation relationship between one and the other. Deﬁnition 9.1. If x If x causes y and there is some variate z that is common between the two, then we say z is a confounding variable because it gives the illusion that z causes y . y . It is also sometimes called a lurking variable. variable. There are two main models that help determine if one variate causes another and they are the following. Experimental Experimental Study 1. Fo Forr every unit in the T.P. T.P. set the F.E.V. (focal explanator explanatoryy variate) to level 1 2. We measure measure the attribute of interest interest

21

Winter 2012

10

EXPERI EXPERIMEN MENT TAL DESIGN DESIGN

3. Repeat 1 and 2 but with set the F.E.V. to level 2 4. Only the F.E.V. changes changes and every other explanator explanatoryy variate is ﬁxed 5. If the attribute attribute changes between between steps steps 1 and 4, then causation occurs Problems?

• We cannot sample the whole T.P. • It is not possible to keep all explanatory variates ﬁxed • The attributes change (on average) Observational Study 1. First, First, observe an association association between between x and y in many places, settings, types of studies, etc. 2. There must be a reason for why x causes y (either scientiﬁcally or logically) 3. There must be a consistent consistent dose relationship relationship 4. The association association has to hold when other possible possible variates variates are held ﬁxed

10

Exper Experim imen enta tall Desi Design gn

There are three main tools that are used by statisticians to improve experimental design. 1. Replication Replication (a) Simply put, we we increase the sample size size i. This is to decrease decrease the variance variance of conﬁdence conﬁdence intervals, intervals, which improves improves accuracy 2. Randomization Randomization (a) We select select units in a random random matter (i.e. if there are 2+ groups. groups. we try to randomly assign assign units into the groups) i. This is to reduce reduce bias and create a more representativ representativee sample ii. It allows us to assume independence independence between between Y i s iii. It reduces the chance chance of confounding confounding variates variates by unknown explanatory explanatory variates variates 3. Pairing Pairing (a) In an experimental experimental study, study, we call it blocking blocking and in an observational observational study, study, we call it matching matching (b) The actual actual proces processs is just just matchin matching g units units by their their explana explanator toryy varia variates tes and matched matched units units are are called twins

22

Winter 2012

12

CHICHI-SQ SQUA UARE RED D TEST TEST

i. For example in a group of 500 twins, grouped by gender and ages, used to test a vaccine, one of the twins in each group will take the vaccine and another will take a placebo ii. We do this in order order to reduce reduce the chance of confounding confounding due to known known explanatory explanatory variates (c) Pairing Pairing also allows us to perform perform subtraction subtraction between between twins to compare compare certain attributes attributes of the population i. Note that taking diﬀerences diﬀerences does not change the variabilit variabilityy of the diﬀerence diﬀerence distributi distribution on

11

Model Model Asse Assess ssme ment nt

We usually want the following four assumptions to be true, when constructing a model Y i = f (θ) + i to ﬁt a sample. We also use certain statistical statistical tools to measure measure how well our model ﬁts these conditions. conditions. Note that these tools/tests require subjective observation. 2

• ∼ N (0 (0, σ ) i

– Why? Because Y i is not normal if if i is not normal. However, θ˜ is still likely to be normal by CLT. – Tests:

∗ Histogram of residuals ˆ = | y ˆ − y | (should be bell-shaped) ∗ QQ Plot, which is the plot of theoretical quartiles versus sample quartiles (should be linear i

i i

with intercept ~0)

· Usually a little variability at the tails of the line is okay 2

• E ( ) = 0, V ar ar ( ) = σ , s are s are independent i

i

i

– Why? All of our models, tests and estimates depend on this. – Tests: ( y -axis) -axis) versus ﬁtted values (x (x -axis) -axis) ∗ Scatter plot of residuals ( y · We hope that it is centered on 0, has no visible pattern and that the data is bounded

· ·

by two parallel lines (constant variance) If there is not a constant variance, such as a funnel (funnel eﬀect), we usually transform the ﬁtted values (e.g. y y ) ln y ) If the plot seems periodic, we will need a new model (STAT 371/372)

→

( y -axis) -axis) versus explanatory variates (x (x -axis) -axis) ∗ Scatter plot of ﬁtted values ( y · This is used mainly in regression models · We hope to see the same conditions in the previous scatter plot

12

ChiChi-Sq Squa uare red d Test est

The purpose of a Chi-squared test is to determine if there is an association between two random variables X,Y, given that they both contain only counting data. The following are the steps

23

Winter 2012

12

CHICHI-SQ SQUA UARE RED D TEST TEST

1. State the null hypothesis as H 0 : X and Y are not associated. 2. If there are m possible observations for X and n possible observations for Y , then deﬁne n

d =

m

j =1 i =1

(expected observed )2 = expected

−

n

m

j =1 =1 i =1 =1

(e i j

2

− o ) i j

e i j

where e i j = P (X = x i i ) P (Y = y j ), o i j = P (X = x i i , Y = y j ), for i = 1,...,m and j = 1,...,n. ,...,n.

·

3. Assume that D

2 (m 1)(n 1) .

∼χ

−

−

4. Calculate the p-value which in this case is P r (D > d ) since d a one-tailed one-tailed hypothesis. hypothesis. 5. Interpre Interprett it as always always (see the table in Section 8)

24

≥ 0, which means we are conducting

Winter 2012

REFERENCES

References [1] William William Kong, Kong, Stochastic Seeker : Resources , Internet: http://stochasticseeker.w http://stochasticseeker.wordp ordpress.com, ress.com, 2011. [2] Paul Paul Marriott, Marriott, STAT 231/221 Winter 2012 Course Notes , University of Waterloo, 2009.

Winter 2012

APPENDIX A

Appendix A Chi-squared c.d.f. and p.d.f. on k degrees of freedom (χk 2 )2 : f (x , k ) =

1 k 2

2 Γ( k 2 )

F (x , k ) =

k

x

x 2 −1 e − 2

k x 1 γ , ) ( Γ( k 2 ) 2 2

Student’s t c.d.f. and p.d.f. on v degrees of freedom (t v v )3 :

√

Γ 2 f (x , v ) = v πΓ F (x , v ) =

2 3

1 + x Γ 2

· √ −

v +1

v

2

v + 1 2

x 2 1+ v

2 F 1

− v +1 2

1 v +1 3 , 2 ; 2; 2

πv Γ

x 2 v

v

2

Γ(x Γ(x )) denotes the regular gamma functionand functionand γ (x ) function . x ) is the lower incomplete gamma function. function . 2 F 1 denotes the hypergeometric function.

Winter 2012

INDEX

Index aspect, 2 attribute, 1 biased, 16 box plot, 6 center, 5 Central Limit Theorem, 14 Chi-squared distribution, 18 Chi-squared test, 23 coeﬃcient of variation, 8 comparative model, 21 conﬁdence interval, 20 confounding variable, 21 correlation, 9 covariance, 8 data binary data, 4 continuous data, 4 counting data, 4 discrete data, 4 nominal nominal data, 4 ordinal data, 4 degrees of freedom, 18 error sample error, 4 study error, 4 estimates, 12 estimator, 14 experimental design, 22 experimental Study, 21 funnel funnel eﬀect, 23 histogram of residuals, 23 hypothesis test, 20 i.i.d., 14 IQR, 6 likelihood function, 12 likelyhood interval, 20 lurking variable, 21

margin of error, 20 maximum likelyhood estimation, 12 maximum likelyhood likelyhood test, 11 mean, 6 measurement system, 2 median, 5 mode, 5 model Assessment, 23 observational Study, 22 outlier, 6 Outliers, 5 pairing, 22 parameter, 2 population sample population, 1 sample, 1 target population, 1 unit, 1 PPDAC, 1 predicting interval, 20 probability interval, 20 QQ plot, 23 randomization, 22 range, 6 relative likelyhood function, 20 relative-risk, 9 replication, 22 robustness, 6 S.P., 2 sampling protocol, 2 Scatter plot of ﬁtted values versus explanatory variates, 23 scatter plot of residuals versus ﬁtted values, 23 shape, 5 skewness, 5 slope, 9 spread, 6 standard deviation, 7 sample standard deviation, 7

Winter 2012

statistic, 2 statistical models, 10 discrete, 10 regression, 11 response, 10 student’s t-distribution, 18 study protocol, 2 T.P, 1 unbiased, 16 variance, 7 sample variance, 7 variate, 1 explanatory variate, 1 response variates, 1

INDEX

View more...
Table of Contents 1

2

PPDAC

1

1.1 1. 1

Prob Pr oble lem m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4 1. 4

Anal An alys ysis is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5 1. 5

Concl Con clus usio ion n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Measur Mea sureme ement nt Anal Analysi ysiss

8

2.1

Measure Meas uremen ments ts of Sp Sprea read d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Measure Meas uremen ments ts of Ass Associa ociatio tion n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

Probabi Pro babilit lity y The Theor ory y

9

4

Statis Sta tistic tical al Model Modelss

10

4.1 4. 1

Gene Ge nera raliliti ties es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

4.2 4. 2

Typ ypes es of Mo Mode dels ls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

5

Estima Est imates tes and Est Estim imato ators rs

11

5.1 5. 1

Motiv Mo tivati ation on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

5.2

Maximum Maximu m Likelihoo Likelihood d Estimatio Estimation n (MLE) (MLE) Algo Algorithm rithm . . . . . . . . . . . . . . . . . . . . .

12

5.3 5. 3

Esti Es timat mator orss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

5.4 5. 4

Bias Bi ases es in Sta Stati tist stic icss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

Winter 2012 6

7

TABLE OF CONTENTS

Distri Dis tributi bution on The Theor ory y

17

6.1

Student Stud ent t-D t-Dist istrib ributi ution on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

6.2

Leastt Squ Leas Squar ares es Meth Method od . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

Inte In terv rval alss

19

7.1

Conﬁde Con ﬁdence nce Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

7.2

Predic Pre dictin ting g Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

7.3

Likelyh Lik elyhood ood Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

8

Hypothe Hypo thesis sis Test esting ing

20

9

Compa Com parat rative ive Model Modelss

21

10 Experim Experimental ental Design Design

22

11 Model Assessment Assessment

23

12 Chi-Sq Chi-Square uared d Test

23

References

25

Appendix A

26

These notes are currently a work in progress, and as such may be incomplete or contain errors. [Source: lambertw.com]

ii

Winter 2012

LIST OF FIGURES

List of Figures 1.1

Popul Po pulati ation on Rel Relatio ationsh nships ips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Boxx Plots Bo Plots (fr (from om Wik Wikipe ipedia dia)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

i

Winter 2012

ACKNOWLEDGMENTS

Acknowledgments: Special thanks to Michael Baker and his LATEX formatted formatted notes. notes. They were were the inspiration inspiration for the structure of these notes.

ii

Winter 2012

ABSTRACT

Abstract

The purpose of these notes is to provide a guide to the second year honours statistics course. The contents of this course are designed to satisfy the following objectives:

•

To provide provide students students with a basic basic understa understanding nding of probabi probability lity,, the role of variatio variation n in empirical empirical problem solving and statistical concepts (to be able to critically evaluate, understand and interpret statistical studies reported in newspapers, internet and scientiﬁc articles).

•

To provide students with the statistical concepts and techniques necessary to carry out an empirical study to answer relevant questions in any given area of interest.

The recommended prerequisites are Calculus I, II (one of Math 137,147 and one of Math 138, 148), and Probability (Stat 230). Readers should have a basic understanding of single-variable diﬀerential and integral calculus as well as a good understanding of probability theory.

iii

Winter 2012

1

1

PPDAC

PPDAC

PPDAC is a process or recipe used to solve statistical problems. It stands for: Problem / Plan / Data / Analysis / Conclusion

1.1 1.1

Prob Proble lem m

The problem step’s job is to clearly deﬁne the 1. Goal or Aspect of the study 2. Target arget Population Population and Units 3. Unit’s Vari Variates ates 4. Attribute Attributess and Paramete Parameters rs Deﬁnition 1.1. The target population is the set of animals, people or things about which you wish to draw conclusions. A unit is a singleton of the target population. Deﬁnition 1.2. The sample population population is a speciﬁ speciﬁed ed subset subset of the target target popula populatio tion. n. A sample is a singleton of the sample population and a unit of the study population. Example 1.1. If I were interested in the average age of all students taking STAT 231, then:

Unit = student of STAT 231 Target Population (T.P.)=students of STAT 231 Deﬁnition 1.3. A variate is a characteristic of a single unit in a target population and is usually one of the following: 1. Response variates variates - interest in the study 2. Explanatory variate - why responses vary from unit to unit (a) Known Known - variates variates that are know to cause the responses responses i. Focal Focal - known known variates that divide the target target population into subsets (b) Unknown Unknown - variates variates that cannot be explained in the that cause responses responses Using Ex. 1.1. as a guide, we can think of the response response variate variate as the age of of a student student and the explanatory explanatory variates as factors such as the age of the student, when and where they were born, their familial situations, educational educational background background and intelligence. intelligence. Fo Forr an example example of a focal variate, variate, we could think of something something along the lines of domestic vs. international students or male vs. female. Deﬁnition 1.4. An attribute is a characteristic of a population which is usually denoted by a function of the response variate. It can have two other names, depending on the population studied:

1

Winter 2012

1

PPDAC

• Parameter is used when studying populations • Statistic is used when studying samples • Attribute can be used interchangeably with the above Deﬁnition 1.5. The aspect is the goal of the study and is generally one of the following 1. Descriptive Descriptive - describing describing or determining determining the value of an attribute 2. Comparative Comparative - comparing comparing the attribute of two two (or more) more) groups 3. Causative Causative - trying to determine whether whether a particula particularr explanatory explanatory variate causes a response response to change change 4. Predictive Predictive - to predict the value of a response variate variate using your explanatory explanatory variate

1.2

Plan

The job of the plan step is to accomplish the following: 1. Deﬁne the Study Protocol - this is the population that is actually studied and is NOT always a subset of the T.P. 2. Deﬁne the Sampling Protocol - the sampling protocol is used to draw a sample from the study population (a) Some types of the sampling sampling protocol protocol include i. Random Random sampling (ad verbatim) verbatim) ii. Judgment Judgment sampling sampling (e.g. gender gender ratios) iii. Volunteer Volunteer sampling sampling (ad verbatim) verbatim) iv. Represen Representative tative sampling A. a sample that matches the sample sample population population (S.P.) (S.P.) in all important important character characteristics istics (i.e. the proportion of a certain important characteristic in the S.P. is the same in the sample) (b) Generally Generally,, statistician statisticianss prefer prefer Random sampling 3. Deﬁne the Sample - ad verbatim 4. Deﬁne the measurement system - this deﬁnes the tools/methods used A visual visualiza izatio tion n of the relations relationship hip between between the popula population tionss is below below.. In the diagram, diagram, T.P. T.P. is the targe targett population and S.P. is the sample population. Example 1.2. Suppose that we wanted to compare the common sense between maths and arts students at the University University of Waterloo. An experiment experiment that we could do is take 50 arts students and 33 maths students students from the University of Waterloo taking an introductory statistics course this term and have them write a statistics test (non-mathematical) and average the results for each group. In this situation we have:

2

Winter 2012

1

PPDAC

T.P. S.P. Sample

Sample Error Study Error Figure 1.1: Population Relationships

• T.P. = All arts and maths students • S.P. = Arts and maths students from the University of Waterloo taking an introductory statistics course this term

• Sample = 50 arts students and 33 maths students from the University of Waterloo taking an introductory statistics course this term

• Aspect = Comparative study • Attribute(s): Average grade of arts and maths students (for T.P., S.P., and sample) – Parameter for T.P. and S.P. – Statistic for Sample There are, however some issues: 1. Sample units units diﬀer from S.P. S.P. 2. S.P. S.P. units diﬀer from T.P. T.P. units 3. We want to measure common sense, but the test is measuring measuring statistical statistical knowledge knowledge 4. Is it fair to put arts students in a maths course? Example 1.3. Suppose that we want to investigate the eﬀect of cigarettes on the incidence of lung cancer in humans. We can do this by purchasing online mice, randomly selecting 43 mice and letting them smoke 20 cigaret cigarettes tes a day day. We then conduct conduct an autops autopsyy at the time of death death to check check if they they have have lung lung cancer cancer.. In this situation, we have:

• T.P. = All people who smoke • S.P. = Mice bought online • Sample = 43 selected online mice • Aspect = Not Comparative 3

Winter 2012

1

PPDAC

• Attribute: – T.P. - proportion of smoking people with lung cancer – S.P. - proportion of mice bought online with lung cancer – Sample - proportion of sample with lung cancer Like the previous example, there are some issues: 1. Mice and humans may react diﬀerently diﬀerently to cigarettes cigarettes 2. We do not have a baseline (i.e. what does X% mean?) 3. Is have the mice smoke 20 cigarettes cigarettes a day realistic? realistic? Deﬁnition 1.6. Let a(x ) be deﬁned as an attribute as a function of some population or sample x . x . We deﬁne the study error as a(T.P.) a(S.P.).

−

Unfortunately, there is no way to directly calculate this error and so its value must be argued. In our previous examples, one could say that the study error in Ex. 1.3. is higher than that of Ex. 1.2. due to the S.P. in 1.3. being drastically diﬀerent than the T.P.. Deﬁnition 1.7. Similar to above, we deﬁne the sample error as a(S.P.)

− a(sample ).

Although we may be able to calculate a(sample ), a(S.P.) is not computable and like the above, this value must be argued. Remark 1.1. 1.1. Note that if we use a random sample, we hope it is representative and that the above errors are minimized.

1.3

D a ta

This step involves the collecting and organizing of data and statistics. Data Types Data: Simply put, there are “holes” between the numbers • Discrete Data: Data: We assume that there are no “holes” • Continuous (CTS) Data: Data: No order in the data • Nominal Data: Data: There is some order in the data • Ordinal Data: Data: e.g. Success/failure, true/false, yes/no • Binary Data: Data: Used for counting the number of events • Counting Data:

4

Winter 2012

1.4 1.4

1

PPDAC

Anal Analys ysis is

This step involves analyzing our data set and making well-informed observations and analyses. Data Quality There are 3 factors that we look at: 1. Outliers: Outliers: Data that is more extreme than their counterparts. (a) Reasons Reasons for outliers? outliers? i. Typos or data recording recording errors errors ii. Measuremen Measurementt errors iii. Valid Valid outliers outliers (b) Without Without having been involved involved in the study from the start, start, it is diﬃcult diﬃcult to tell which is which 2. Missing Missing Data Points Points 3. Measurement Measurement Issues Issues Characteristic of a Data Set Outliers could be found in any data set but these 3 always are: 1. Shape (a) Numerical Numerical methods: methods: Skewness and Kurtosis (not covered in this course) (b) Empirical Empirical methods: methods: i. ii. iii. iv.

Bell-shap Bell-shaped: ed: Symmetrical Symmetrical about about a mean Skewed Skewed left (negative): (negative): densest densest area on the right Skewed Skewed right right (positive): (positive): densest densest area on the left Uniform: Uniform: even all around; around; straight straight line

2. Center (location) (a) The “middl “middle” e” of our data data i. Mode: Mode: statistic that asks which value occurs the most frequently ii. Median (Q2): The middle data value A. The deﬁnition deﬁnition in this course is an algorithm algorithm B. We denot denotee n data values by x 1 , x 2 ,...,x n and the sorted data by x (1) (1) , x (2) (2) ,...,x (n) where x (1) (1)

≤ x ≤ ... ≤ x (2) (2)

(n) x

If n is odd then Q2 = x ( n+1 ) and if n is even, then Q2 = 2

5

−x

( n2 ) ( n+1 2 ) 2

Winter 2012

1

PPDAC

n

x

i i

iii. Mean: Mean: The sample mean is x x¯ =

i =1

n

A. The sample mean moves in the the direction direction of the outlier if an outlier is added added (median as well but less of an eﬀect) (b) Robustness: Robustness: The median is less aﬀected by outliers and is thus robust. 3. Spread (variability) (a) Range: Range: By deﬁnition, this is x (n)

− x

(1) (1)

(b) IQR (Interquartile range): The middle half of your data i. We ﬁrst deﬁne posn (a) as the function which returns the index of the statistic a in a data set. The value of Q1 is deﬁned as the median in the data set x (1) (1) , x (2) (2) ,...,x (posn(Q2)−1) and Q2 is the median of the data set x (posn(Q2)+1) , x (posn(Q2)+2) ,...,x (n) ii. The IQR of set is deﬁned to be the diﬀerence between Q3 and Q1. That is, I QR = Q3

− Q1

iii. A box plot is visual repres represent entati ation on of this. this. The whiske whiskers rs of a box box plot plot repres represent ent values values that are some value below or above the data set, according to the upper and lower fences, which are the upper and lower bounds of the whiskers whiskers respectively respectively.. The lower fence, LL, LL, is bounded by a value of LL = Q1 1.5(I QR)

−

and the upper fence, U L, a value of U L = Q3 + 1.5(I QR) We say that any value that is less than LL or greater than UL is an outlier. outlier.

6

Winter 2012

1

PPDAC

Here, we have a visualization of a box plot, courtesy of Wikipedia:

Figure 1.2: Box Plots (from Wikipedia) (c) Variance : Variance : For a sample population, the variance is deﬁned to be n

s 2 =

i =1 =1

2

(x i i

− x x¯ ) n−1

which is called the sample variance. variance. (d) Standard Deviation: Deviation: For a sample population, the standard deviation is just the square root of the variance:

− − n

(x i i

s =

x x¯ )2

i =1

n

1

which is called the sample standard standard deviation deviation..

1.5 1.5

Conc Conclu lusi sion on

In the conclusion, there are only two aspects of the study that you need to be concerned about: 1. Did you answer answer your problem problem 2. Talk about limitations (i.e. study errors, samples errors)

7

Winter 2012

2

2

MEASUR MEASUREME EMENT NT ANAL ANALYSIS YSIS

Meas Measure ureme ment nt Anal Analys ysis is

The goal of measures is to explain how far our data is spread out and the relationship of data points.

2.1 2.1

Meas Measure ureme ment ntss of Sprea Spread d

The goal of the standard deviation is to approximate the average distance a point is from the mean. Here are some other methods that we could use for standard deviation and why they fail: n

(x −x x¯ ) i i

1.

i =1

n

does not work because it is always equal to 0

n

|x −x x¯ | i i

2.

i =1

n

works but we cannot do anything mathematically signiﬁcant to it

n

3.

(x −x x¯ )2

n

i i

i =1

n

is a good guess, but note that

(a) To ﬁx this, this, we use n

n

2

(x i i

− x x¯ ) ≤ |x − x x¯ |

i =1

i i

i =1 =1

− 1 instead of n (proof comes later on)

Proposition 2.1. An interesting identity is the following:

− − n

(x i i

s =

n

x x¯ )2

i =1

n

=

1

(x i 2

2

− nx x¯ ) n−1

i =1 =1

Proof. Exercise for the reader. Deﬁnition 2.1. Coeﬃcient of Variation (CV) This measure provides a unit-less measurement of spread: CV =

2.2

s x x¯

× 100%

Measurem Measurement entss of Associat Association ion

Here, we examine a few interesting measures which test the relationship between two random variables X any Y . 1. Covariance: Covariance: In theory (a population), the covariance is deﬁned as Cov(X, Y ) = E (( ((X

8

−µ

X )(Y

−µ

Y ))

Winter 2012

3

PROBAB PROBABILI ILITY TY THEORY THEORY

but in practice (in samples) it is deﬁned as n

s XY XY = Note that Cov(X, Y ), s XY XY the magnitude. magnitude.

(x i i

i =1

− x x¯ )( y − y ¯) . n−1 i i

∈ R and both give us an idea of the direction of the relationship but not

2. Correlation: Correlation: In theory (a population), the correlation is deﬁned as ρXY =

Cov(X, Y ) σX σY

but in practice (in samples) it is deﬁned as r XY XY = Note that 1 magnitude.

− ≤ρ

XY , r XY XY

s XY XY . s X X s Y Y

≤ 1 and both give us an idea of the direction of the relationship AND the

(a) An interpretati interpretation on of the values is as follows: follows: i. r XY XY

| ii. |r iii. |r iv. |r v. |r

XY XY XY XY XY XY XY XY

|≈1 |=1 |>1 | 30, the student’s t is almost identical to the normal distribution v v

with mean 0 and variance 1

• For v 30, T is very close to a uniform distribution with thick tails and very even, unpronounced center

(From here on out, the content will be supplemental to the course notes and will only serve as a review)

6.2 6.2

Leas Leastt Squa Squares res Method Method

There are are two ways to use this method. First, First, for a given model Y and parameter θ, suppose that we get a best ﬁt y ˆ and deﬁne ˆi = y ˆ y i i . The least squares approach is through any of the two

| − | n

1. (Algebra (Algebraic) ic) Deﬁne W =

i =1

ˆi 2 . Calculate and minimize

n

2. (Geometric) (Geometric) Deﬁne Deﬁne W =

i =1 =1

ˆi 2 = ˆt ˆ. Note Note that that W

∂W ∂θ

to determine θ.

⊥ span{−→1 , −→x } and so ˆ −→1 = 0 and ˆ −→x = 0. t

t

Use these equations to determine θ.

7

Inte terv rva als

Here, we deviate from the order of lectures and focus on the various types of constructed intervals.

19

Winter 2012

7.1 7.1

8

HYPOTH HYPOTHESI ESIS S TESTING TESTING

Conﬁ Conﬁden dence ce Inte Interv rval alss

Suppose that θ˜ N (θ, V ar ar (θ˜)). Find Find L, U , equidistant from θ˜, such that P (L < θ˜ < U ) = 1 α where α is known as our margin of error. error . Usuall Usuallyy this this value value is 0.05 0.05 by conventi convention, on, unless unless otherw otherwise ise speciﬁed speciﬁed.. ˜ Normalizing θ, we get that

∼

−

(L, U ) =

− θ

c V ar (θ˜), θ + c V ar (θ˜)

where c is the value such that 1 α = P ( c < Z < c ) and Z interval. θ c V ar (θ˜), a probability interval.

−

±

−

∼

N (0 this inte interv rval al,, (0, 1). We call this

±

Usually, though, we don’t know θ, so we replace it with our best estimate θˆ to get θˆ c V ar (θ˜), which is called a (1 α)% conﬁdence interval. interval. If V If V ar (θ˜) is known, then C N (0 (0, 1) and if it is unknown, we replace ˆ V ar (θ˜) with V ar ( ˜θ) and C t n−q . Anothe Anotherr compact compact notati notation on for for conﬁde conﬁdence nce intervals intervals is ES T c SE . SE . Note that the conﬁdence interval says that we are 95% conﬁdent in our result, but does not say anything signiﬁcant. When we say that we are 95% conﬁdent, it means that given 100 randomly selected trials, we expect that in 95 of the 100 conﬁdence intervals constructed from the samples, the population parameter will be in those intervals.

−

7.2 7.2

∼

∼

±

Predi Predict ctin ing g Inte Interv rval alss

A predicting interval is an extension of a model Y , by adding the subscript p . The model is usually of the form Y p = f (θ˜) + p and the model of the form E ST c SE = f (θˆ) V ar (Y p ). Note that Y p is diﬀerent from a standard model Y i = f (θ) + i in that the ﬁrst component is random.

±

7.3 7.3

±

Lik Likelyh elyhood ood Inte Interv rvals als n

Recall the likelyhood function L(θ, y i i ) =

f ( y i i , θ ) and note that L(θˆ) is the largest value of the likelyhood

i =1

function by MLE. We deﬁne the relative likelyhood function as R(θ) =

L(θ) ,0 L(θˆ)

≤ R(θ) ≤ 1.

From a theorem in a future statistics course (STAT330), if R(θ) 0.1, then solving for θ (usually using the quadratic formula) will form an approximate 95% conﬁdence interval. This interval is called the likelyhood interval and one of its main advantages to the conﬁdence interval is that it does not require that the model be normal.

≈

8

Hypoth Hypothes esis is Testi esting ng

While our conﬁdence interval does not tell us in a yes or no way whether or not a statistical estimate is true, a hypothesis test does. Here are the steps:

20

Winter 2012

9

COMP COMPARATIV ARATIVE E MODELS MODELS

1. State the hypothesis, H 0 : θ = θ0 (this is only an example), called the null hypothesis ( H 1 is called the alternative hypothesis and states a statement contrary to the null hypothesis). 2. Calculate Calculate the discrepancy discrepancy (also called the test statistic), statistic), denoted by d =

θˆ

−

θ0

V ar (θ˜)

=

estimate H 0 value SE

−

assuming that θ˜ is unbiased and the realization of d , denoted by D , is N (0 (0, 1) if V ar (θ˜) is known and t n−q otherwise. Note that d is the number of standard deviations θ0 is from θˆ. 3. Calculate a p value given by p = 2P (D > d ). It is also the probability that one sees a value worse than θˆ, given given that the null null hypothe hypothesis sis is true. The greater greater the p value, the more evidence against the model in order to reject.

−

||

−

4. Reject or not reject (note that we do not “accept” the model) The following the table that subjectively describes interpretations for p values:

−

P value p-value< p-value10%

≤ ≤

≤

Interpretation A ton of evidence against H 0 A lot of evidence against H 0 Some evidence against H 0 Ther Th eree is virt virtua uallllyy no evide evidenc ncee agai agains nstt H 0

Note that one model for which D N (0 (0, 1) is Y i = i where i by our null hypothesis and central limit theorem.

∼

9

∼ Bin(1, Π) since

V ar (θ˜) =

ˆ 0 (1 Π ˆ 0) Π

−

n

Comp Compa arati rative ve Model Modelss

The goal of a comparative comparative model is to compare the mean of two two groups and determine if there is a causation relationship between one and the other. Deﬁnition 9.1. If x If x causes y and there is some variate z that is common between the two, then we say z is a confounding variable because it gives the illusion that z causes y . y . It is also sometimes called a lurking variable. variable. There are two main models that help determine if one variate causes another and they are the following. Experimental Experimental Study 1. Fo Forr every unit in the T.P. T.P. set the F.E.V. (focal explanator explanatoryy variate) to level 1 2. We measure measure the attribute of interest interest

21

Winter 2012

10

EXPERI EXPERIMEN MENT TAL DESIGN DESIGN

3. Repeat 1 and 2 but with set the F.E.V. to level 2 4. Only the F.E.V. changes changes and every other explanator explanatoryy variate is ﬁxed 5. If the attribute attribute changes between between steps steps 1 and 4, then causation occurs Problems?

• We cannot sample the whole T.P. • It is not possible to keep all explanatory variates ﬁxed • The attributes change (on average) Observational Study 1. First, First, observe an association association between between x and y in many places, settings, types of studies, etc. 2. There must be a reason for why x causes y (either scientiﬁcally or logically) 3. There must be a consistent consistent dose relationship relationship 4. The association association has to hold when other possible possible variates variates are held ﬁxed

10

Exper Experim imen enta tall Desi Design gn

There are three main tools that are used by statisticians to improve experimental design. 1. Replication Replication (a) Simply put, we we increase the sample size size i. This is to decrease decrease the variance variance of conﬁdence conﬁdence intervals, intervals, which improves improves accuracy 2. Randomization Randomization (a) We select select units in a random random matter (i.e. if there are 2+ groups. groups. we try to randomly assign assign units into the groups) i. This is to reduce reduce bias and create a more representativ representativee sample ii. It allows us to assume independence independence between between Y i s iii. It reduces the chance chance of confounding confounding variates variates by unknown explanatory explanatory variates variates 3. Pairing Pairing (a) In an experimental experimental study, study, we call it blocking blocking and in an observational observational study, study, we call it matching matching (b) The actual actual proces processs is just just matchin matching g units units by their their explana explanator toryy varia variates tes and matched matched units units are are called twins

22

Winter 2012

12

CHICHI-SQ SQUA UARE RED D TEST TEST

i. For example in a group of 500 twins, grouped by gender and ages, used to test a vaccine, one of the twins in each group will take the vaccine and another will take a placebo ii. We do this in order order to reduce reduce the chance of confounding confounding due to known known explanatory explanatory variates (c) Pairing Pairing also allows us to perform perform subtraction subtraction between between twins to compare compare certain attributes attributes of the population i. Note that taking diﬀerences diﬀerences does not change the variabilit variabilityy of the diﬀerence diﬀerence distributi distribution on

11

Model Model Asse Assess ssme ment nt

We usually want the following four assumptions to be true, when constructing a model Y i = f (θ) + i to ﬁt a sample. We also use certain statistical statistical tools to measure measure how well our model ﬁts these conditions. conditions. Note that these tools/tests require subjective observation. 2

• ∼ N (0 (0, σ ) i

– Why? Because Y i is not normal if if i is not normal. However, θ˜ is still likely to be normal by CLT. – Tests:

∗ Histogram of residuals ˆ = | y ˆ − y | (should be bell-shaped) ∗ QQ Plot, which is the plot of theoretical quartiles versus sample quartiles (should be linear i

i i

with intercept ~0)

· Usually a little variability at the tails of the line is okay 2

• E ( ) = 0, V ar ar ( ) = σ , s are s are independent i

i

i

– Why? All of our models, tests and estimates depend on this. – Tests: ( y -axis) -axis) versus ﬁtted values (x (x -axis) -axis) ∗ Scatter plot of residuals ( y · We hope that it is centered on 0, has no visible pattern and that the data is bounded

· ·

by two parallel lines (constant variance) If there is not a constant variance, such as a funnel (funnel eﬀect), we usually transform the ﬁtted values (e.g. y y ) ln y ) If the plot seems periodic, we will need a new model (STAT 371/372)

→

( y -axis) -axis) versus explanatory variates (x (x -axis) -axis) ∗ Scatter plot of ﬁtted values ( y · This is used mainly in regression models · We hope to see the same conditions in the previous scatter plot

12

ChiChi-Sq Squa uare red d Test est

The purpose of a Chi-squared test is to determine if there is an association between two random variables X,Y, given that they both contain only counting data. The following are the steps

23

Winter 2012

12

CHICHI-SQ SQUA UARE RED D TEST TEST

1. State the null hypothesis as H 0 : X and Y are not associated. 2. If there are m possible observations for X and n possible observations for Y , then deﬁne n

d =

m

j =1 i =1

(expected observed )2 = expected

−

n

m

j =1 =1 i =1 =1

(e i j

2

− o ) i j

e i j

where e i j = P (X = x i i ) P (Y = y j ), o i j = P (X = x i i , Y = y j ), for i = 1,...,m and j = 1,...,n. ,...,n.

·

3. Assume that D

2 (m 1)(n 1) .

∼χ

−

−

4. Calculate the p-value which in this case is P r (D > d ) since d a one-tailed one-tailed hypothesis. hypothesis. 5. Interpre Interprett it as always always (see the table in Section 8)

24

≥ 0, which means we are conducting

Winter 2012

REFERENCES

References [1] William William Kong, Kong, Stochastic Seeker : Resources , Internet: http://stochasticseeker.w http://stochasticseeker.wordp ordpress.com, ress.com, 2011. [2] Paul Paul Marriott, Marriott, STAT 231/221 Winter 2012 Course Notes , University of Waterloo, 2009.

Winter 2012

APPENDIX A

Appendix A Chi-squared c.d.f. and p.d.f. on k degrees of freedom (χk 2 )2 : f (x , k ) =

1 k 2

2 Γ( k 2 )

F (x , k ) =

k

x

x 2 −1 e − 2

k x 1 γ , ) ( Γ( k 2 ) 2 2

Student’s t c.d.f. and p.d.f. on v degrees of freedom (t v v )3 :

√

Γ 2 f (x , v ) = v πΓ F (x , v ) =

2 3

1 + x Γ 2

· √ −

v +1

v

2

v + 1 2

x 2 1+ v

2 F 1

− v +1 2

1 v +1 3 , 2 ; 2; 2

πv Γ

x 2 v

v

2

Γ(x Γ(x )) denotes the regular gamma functionand functionand γ (x ) function . x ) is the lower incomplete gamma function. function . 2 F 1 denotes the hypergeometric function.

Winter 2012

INDEX

Index aspect, 2 attribute, 1 biased, 16 box plot, 6 center, 5 Central Limit Theorem, 14 Chi-squared distribution, 18 Chi-squared test, 23 coeﬃcient of variation, 8 comparative model, 21 conﬁdence interval, 20 confounding variable, 21 correlation, 9 covariance, 8 data binary data, 4 continuous data, 4 counting data, 4 discrete data, 4 nominal nominal data, 4 ordinal data, 4 degrees of freedom, 18 error sample error, 4 study error, 4 estimates, 12 estimator, 14 experimental design, 22 experimental Study, 21 funnel funnel eﬀect, 23 histogram of residuals, 23 hypothesis test, 20 i.i.d., 14 IQR, 6 likelihood function, 12 likelyhood interval, 20 lurking variable, 21

margin of error, 20 maximum likelyhood estimation, 12 maximum likelyhood likelyhood test, 11 mean, 6 measurement system, 2 median, 5 mode, 5 model Assessment, 23 observational Study, 22 outlier, 6 Outliers, 5 pairing, 22 parameter, 2 population sample population, 1 sample, 1 target population, 1 unit, 1 PPDAC, 1 predicting interval, 20 probability interval, 20 QQ plot, 23 randomization, 22 range, 6 relative likelyhood function, 20 relative-risk, 9 replication, 22 robustness, 6 S.P., 2 sampling protocol, 2 Scatter plot of ﬁtted values versus explanatory variates, 23 scatter plot of residuals versus ﬁtted values, 23 shape, 5 skewness, 5 slope, 9 spread, 6 standard deviation, 7 sample standard deviation, 7

Winter 2012

statistic, 2 statistical models, 10 discrete, 10 regression, 11 response, 10 student’s t-distribution, 18 study protocol, 2 T.P, 1 unbiased, 16 variance, 7 sample variance, 7 variate, 1 explanatory variate, 1 response variates, 1

INDEX

Thank you for interesting in our services. We are a non-profit group that run this website to share documents. We need your help to maintenance this website.