Stat 231 Coursenotes
Short Description
Stat 231 Course notes Winter 2012...
Description
STAT 231 (Winter 2012 - 1121) Honours Statistics Prof. R. Metzger University of Waterloo LATEXer: W. Kong Last Revision: April 3, 2012
Table of Contents 1
2
PPDAC
1
1.1 1. 1
Prob Pr oble lem m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4 1. 4
Anal An alys ysis is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.5 1. 5
Concl Con clus usio ion n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Measur Mea sureme ement nt Anal Analysi ysiss
8
2.1
Measure Meas uremen ments ts of Sp Sprea read d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
Measure Meas uremen ments ts of Ass Associa ociatio tion n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3
Probabi Pro babilit lity y The Theor ory y
9
4
Statis Sta tistic tical al Model Modelss
10
4.1 4. 1
Gene Ge nera raliliti ties es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
4.2 4. 2
Typ ypes es of Mo Mode dels ls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
5
Estima Est imates tes and Est Estim imato ators rs
11
5.1 5. 1
Motiv Mo tivati ation on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
5.2
Maximum Maximu m Likelihoo Likelihood d Estimatio Estimation n (MLE) (MLE) Algo Algorithm rithm . . . . . . . . . . . . . . . . . . . . .
12
5.3 5. 3
Esti Es timat mator orss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
5.4 5. 4
Bias Bi ases es in Sta Stati tist stic icss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Winter 2012 6
7
TABLE OF CONTENTS
Distri Dis tributi bution on The Theor ory y
17
6.1
Student Stud ent t-D t-Dist istrib ributi ution on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
6.2
Leastt Squ Leas Squar ares es Meth Method od . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Inte In terv rval alss
19
7.1
Confide Con fidence nce Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
7.2
Predic Pre dictin ting g Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
7.3
Likelyh Lik elyhood ood Int Interv ervals als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
8
Hypothe Hypo thesis sis Test esting ing
20
9
Compa Com parat rative ive Model Modelss
21
10 Experim Experimental ental Design Design
22
11 Model Assessment Assessment
23
12 Chi-Sq Chi-Square uared d Test
23
References
25
Appendix A
26
These notes are currently a work in progress, and as such may be incomplete or contain errors. [Source: lambertw.com]
ii
Winter 2012
LIST OF FIGURES
List of Figures 1.1
Popul Po pulati ation on Rel Relatio ationsh nships ips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Boxx Plots Bo Plots (fr (from om Wik Wikipe ipedia dia)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
i
Winter 2012
ACKNOWLEDGMENTS
Acknowledgments: Special thanks to Michael Baker and his LATEX formatted formatted notes. notes. They were were the inspiration inspiration for the structure of these notes.
ii
Winter 2012
ABSTRACT
Abstract
The purpose of these notes is to provide a guide to the second year honours statistics course. The contents of this course are designed to satisfy the following objectives:
•
To provide provide students students with a basic basic understa understanding nding of probabi probability lity,, the role of variatio variation n in empirical empirical problem solving and statistical concepts (to be able to critically evaluate, understand and interpret statistical studies reported in newspapers, internet and scientific articles).
•
To provide students with the statistical concepts and techniques necessary to carry out an empirical study to answer relevant questions in any given area of interest.
The recommended prerequisites are Calculus I, II (one of Math 137,147 and one of Math 138, 148), and Probability (Stat 230). Readers should have a basic understanding of single-variable differential and integral calculus as well as a good understanding of probability theory.
iii
Winter 2012
1
1
PPDAC
PPDAC
PPDAC is a process or recipe used to solve statistical problems. It stands for: Problem / Plan / Data / Analysis / Conclusion
1.1 1.1
Prob Proble lem m
The problem step’s job is to clearly define the 1. Goal or Aspect of the study 2. Target arget Population Population and Units 3. Unit’s Vari Variates ates 4. Attribute Attributess and Paramete Parameters rs Definition 1.1. The target population is the set of animals, people or things about which you wish to draw conclusions. A unit is a singleton of the target population. Definition 1.2. The sample population population is a specifi specified ed subset subset of the target target popula populatio tion. n. A sample is a singleton of the sample population and a unit of the study population. Example 1.1. If I were interested in the average age of all students taking STAT 231, then:
Unit = student of STAT 231 Target Population (T.P.)=students of STAT 231 Definition 1.3. A variate is a characteristic of a single unit in a target population and is usually one of the following: 1. Response variates variates - interest in the study 2. Explanatory variate - why responses vary from unit to unit (a) Known Known - variates variates that are know to cause the responses responses i. Focal Focal - known known variates that divide the target target population into subsets (b) Unknown Unknown - variates variates that cannot be explained in the that cause responses responses Using Ex. 1.1. as a guide, we can think of the response response variate variate as the age of of a student student and the explanatory explanatory variates as factors such as the age of the student, when and where they were born, their familial situations, educational educational background background and intelligence. intelligence. Fo Forr an example example of a focal variate, variate, we could think of something something along the lines of domestic vs. international students or male vs. female. Definition 1.4. An attribute is a characteristic of a population which is usually denoted by a function of the response variate. It can have two other names, depending on the population studied:
1
Winter 2012
1
PPDAC
• Parameter is used when studying populations • Statistic is used when studying samples • Attribute can be used interchangeably with the above Definition 1.5. The aspect is the goal of the study and is generally one of the following 1. Descriptive Descriptive - describing describing or determining determining the value of an attribute 2. Comparative Comparative - comparing comparing the attribute of two two (or more) more) groups 3. Causative Causative - trying to determine whether whether a particula particularr explanatory explanatory variate causes a response response to change change 4. Predictive Predictive - to predict the value of a response variate variate using your explanatory explanatory variate
1.2
Plan
The job of the plan step is to accomplish the following: 1. Define the Study Protocol - this is the population that is actually studied and is NOT always a subset of the T.P. 2. Define the Sampling Protocol - the sampling protocol is used to draw a sample from the study population (a) Some types of the sampling sampling protocol protocol include i. Random Random sampling (ad verbatim) verbatim) ii. Judgment Judgment sampling sampling (e.g. gender gender ratios) iii. Volunteer Volunteer sampling sampling (ad verbatim) verbatim) iv. Represen Representative tative sampling A. a sample that matches the sample sample population population (S.P.) (S.P.) in all important important character characteristics istics (i.e. the proportion of a certain important characteristic in the S.P. is the same in the sample) (b) Generally Generally,, statistician statisticianss prefer prefer Random sampling 3. Define the Sample - ad verbatim 4. Define the measurement system - this defines the tools/methods used A visual visualiza izatio tion n of the relations relationship hip between between the popula population tionss is below below.. In the diagram, diagram, T.P. T.P. is the targe targett population and S.P. is the sample population. Example 1.2. Suppose that we wanted to compare the common sense between maths and arts students at the University University of Waterloo. An experiment experiment that we could do is take 50 arts students and 33 maths students students from the University of Waterloo taking an introductory statistics course this term and have them write a statistics test (non-mathematical) and average the results for each group. In this situation we have:
2
Winter 2012
1
PPDAC
T.P. S.P. Sample
Sample Error Study Error Figure 1.1: Population Relationships
• T.P. = All arts and maths students • S.P. = Arts and maths students from the University of Waterloo taking an introductory statistics course this term
• Sample = 50 arts students and 33 maths students from the University of Waterloo taking an introductory statistics course this term
• Aspect = Comparative study • Attribute(s): Average grade of arts and maths students (for T.P., S.P., and sample) – Parameter for T.P. and S.P. – Statistic for Sample There are, however some issues: 1. Sample units units differ from S.P. S.P. 2. S.P. S.P. units differ from T.P. T.P. units 3. We want to measure common sense, but the test is measuring measuring statistical statistical knowledge knowledge 4. Is it fair to put arts students in a maths course? Example 1.3. Suppose that we want to investigate the effect of cigarettes on the incidence of lung cancer in humans. We can do this by purchasing online mice, randomly selecting 43 mice and letting them smoke 20 cigaret cigarettes tes a day day. We then conduct conduct an autops autopsyy at the time of death death to check check if they they have have lung lung cancer cancer.. In this situation, we have:
• T.P. = All people who smoke • S.P. = Mice bought online • Sample = 43 selected online mice • Aspect = Not Comparative 3
Winter 2012
1
PPDAC
• Attribute: – T.P. - proportion of smoking people with lung cancer – S.P. - proportion of mice bought online with lung cancer – Sample - proportion of sample with lung cancer Like the previous example, there are some issues: 1. Mice and humans may react differently differently to cigarettes cigarettes 2. We do not have a baseline (i.e. what does X% mean?) 3. Is have the mice smoke 20 cigarettes cigarettes a day realistic? realistic? Definition 1.6. Let a(x ) be defined as an attribute as a function of some population or sample x . x . We define the study error as a(T.P.) a(S.P.).
−
Unfortunately, there is no way to directly calculate this error and so its value must be argued. In our previous examples, one could say that the study error in Ex. 1.3. is higher than that of Ex. 1.2. due to the S.P. in 1.3. being drastically different than the T.P.. Definition 1.7. Similar to above, we define the sample error as a(S.P.)
− a(sample ).
Although we may be able to calculate a(sample ), a(S.P.) is not computable and like the above, this value must be argued. Remark 1.1. 1.1. Note that if we use a random sample, we hope it is representative and that the above errors are minimized.
1.3
D a ta
This step involves the collecting and organizing of data and statistics. Data Types Data: Simply put, there are “holes” between the numbers • Discrete Data: Data: We assume that there are no “holes” • Continuous (CTS) Data: Data: No order in the data • Nominal Data: Data: There is some order in the data • Ordinal Data: Data: e.g. Success/failure, true/false, yes/no • Binary Data: Data: Used for counting the number of events • Counting Data:
4
Winter 2012
1.4 1.4
1
PPDAC
Anal Analys ysis is
This step involves analyzing our data set and making well-informed observations and analyses. Data Quality There are 3 factors that we look at: 1. Outliers: Outliers: Data that is more extreme than their counterparts. (a) Reasons Reasons for outliers? outliers? i. Typos or data recording recording errors errors ii. Measuremen Measurementt errors iii. Valid Valid outliers outliers (b) Without Without having been involved involved in the study from the start, start, it is difficult difficult to tell which is which 2. Missing Missing Data Points Points 3. Measurement Measurement Issues Issues Characteristic of a Data Set Outliers could be found in any data set but these 3 always are: 1. Shape (a) Numerical Numerical methods: methods: Skewness and Kurtosis (not covered in this course) (b) Empirical Empirical methods: methods: i. ii. iii. iv.
Bell-shap Bell-shaped: ed: Symmetrical Symmetrical about about a mean Skewed Skewed left (negative): (negative): densest densest area on the right Skewed Skewed right right (positive): (positive): densest densest area on the left Uniform: Uniform: even all around; around; straight straight line
2. Center (location) (a) The “middl “middle” e” of our data data i. Mode: Mode: statistic that asks which value occurs the most frequently ii. Median (Q2): The middle data value A. The definition definition in this course is an algorithm algorithm B. We denot denotee n data values by x 1 , x 2 ,...,x n and the sorted data by x (1) (1) , x (2) (2) ,...,x (n) where x (1) (1)
≤ x ≤ ... ≤ x (2) (2)
(n) x
If n is odd then Q2 = x ( n+1 ) and if n is even, then Q2 = 2
5
−x
( n2 ) ( n+1 2 ) 2
Winter 2012
1
PPDAC
n
x
i i
iii. Mean: Mean: The sample mean is x x¯ =
i =1
n
A. The sample mean moves in the the direction direction of the outlier if an outlier is added added (median as well but less of an effect) (b) Robustness: Robustness: The median is less affected by outliers and is thus robust. 3. Spread (variability) (a) Range: Range: By definition, this is x (n)
− x
(1) (1)
(b) IQR (Interquartile range): The middle half of your data i. We first define posn (a) as the function which returns the index of the statistic a in a data set. The value of Q1 is defined as the median in the data set x (1) (1) , x (2) (2) ,...,x (posn(Q2)−1) and Q2 is the median of the data set x (posn(Q2)+1) , x (posn(Q2)+2) ,...,x (n) ii. The IQR of set is defined to be the difference between Q3 and Q1. That is, I QR = Q3
− Q1
iii. A box plot is visual repres represent entati ation on of this. this. The whiske whiskers rs of a box box plot plot repres represent ent values values that are some value below or above the data set, according to the upper and lower fences, which are the upper and lower bounds of the whiskers whiskers respectively respectively.. The lower fence, LL, LL, is bounded by a value of LL = Q1 1.5(I QR)
−
and the upper fence, U L, a value of U L = Q3 + 1.5(I QR) We say that any value that is less than LL or greater than UL is an outlier. outlier.
6
Winter 2012
1
PPDAC
Here, we have a visualization of a box plot, courtesy of Wikipedia:
Figure 1.2: Box Plots (from Wikipedia) (c) Variance : Variance : For a sample population, the variance is defined to be n
s 2 =
i =1 =1
2
(x i i
− x x¯ ) n−1
which is called the sample variance. variance. (d) Standard Deviation: Deviation: For a sample population, the standard deviation is just the square root of the variance:
− − n
(x i i
s =
x x¯ )2
i =1
n
1
which is called the sample standard standard deviation deviation..
1.5 1.5
Conc Conclu lusi sion on
In the conclusion, there are only two aspects of the study that you need to be concerned about: 1. Did you answer answer your problem problem 2. Talk about limitations (i.e. study errors, samples errors)
7
Winter 2012
2
2
MEASUR MEASUREME EMENT NT ANAL ANALYSIS YSIS
Meas Measure ureme ment nt Anal Analys ysis is
The goal of measures is to explain how far our data is spread out and the relationship of data points.
2.1 2.1
Meas Measure ureme ment ntss of Sprea Spread d
The goal of the standard deviation is to approximate the average distance a point is from the mean. Here are some other methods that we could use for standard deviation and why they fail: n
(x −x x¯ ) i i
1.
i =1
n
does not work because it is always equal to 0
n
|x −x x¯ | i i
2.
i =1
n
works but we cannot do anything mathematically significant to it
n
3.
(x −x x¯ )2
n
i i
i =1
n
is a good guess, but note that
(a) To fix this, this, we use n
n
2
(x i i
− x x¯ ) ≤ |x − x x¯ |
i =1
i i
i =1 =1
− 1 instead of n (proof comes later on)
Proposition 2.1. An interesting identity is the following:
− − n
(x i i
s =
n
x x¯ )2
i =1
n
=
1
(x i 2
2
− nx x¯ ) n−1
i =1 =1
Proof. Exercise for the reader. Definition 2.1. Coefficient of Variation (CV) This measure provides a unit-less measurement of spread: CV =
2.2
s x x¯
× 100%
Measurem Measurement entss of Associat Association ion
Here, we examine a few interesting measures which test the relationship between two random variables X any Y . 1. Covariance: Covariance: In theory (a population), the covariance is defined as Cov(X, Y ) = E (( ((X
8
−µ
X )(Y
−µ
Y ))
Winter 2012
3
PROBAB PROBABILI ILITY TY THEORY THEORY
but in practice (in samples) it is defined as n
s XY XY = Note that Cov(X, Y ), s XY XY the magnitude. magnitude.
(x i i
i =1
− x x¯ )( y − y ¯) . n−1 i i
∈ R and both give us an idea of the direction of the relationship but not
2. Correlation: Correlation: In theory (a population), the correlation is defined as ρXY =
Cov(X, Y ) σX σY
but in practice (in samples) it is defined as r XY XY = Note that 1 magnitude.
− ≤ρ
XY , r XY XY
s XY XY . s X X s Y Y
≤ 1 and both give us an idea of the direction of the relationship AND the
(a) An interpretati interpretation on of the values is as follows: follows: i. r XY XY
| ii. |r iii. |r iv. |r v. |r
XY XY XY XY XY XY XY XY
|≈1 |=1 |>1 | 30, the student’s t is almost identical to the normal distribution v v
with mean 0 and variance 1
• For v 30, T is very close to a uniform distribution with thick tails and very even, unpronounced center
(From here on out, the content will be supplemental to the course notes and will only serve as a review)
6.2 6.2
Leas Leastt Squa Squares res Method Method
There are are two ways to use this method. First, First, for a given model Y and parameter θ, suppose that we get a best fit y ˆ and define ˆi = y ˆ y i i . The least squares approach is through any of the two
| − | n
1. (Algebra (Algebraic) ic) Define W =
i =1
ˆi 2 . Calculate and minimize
n
2. (Geometric) (Geometric) Define Define W =
i =1 =1
ˆi 2 = ˆt ˆ. Note Note that that W
∂W ∂θ
to determine θ.
⊥ span{−→1 , −→x } and so ˆ −→1 = 0 and ˆ −→x = 0. t
t
Use these equations to determine θ.
7
Inte terv rva als
Here, we deviate from the order of lectures and focus on the various types of constructed intervals.
19
Winter 2012
7.1 7.1
8
HYPOTH HYPOTHESI ESIS S TESTING TESTING
Confi Confiden dence ce Inte Interv rval alss
Suppose that θ˜ N (θ, V ar ar (θ˜)). Find Find L, U , equidistant from θ˜, such that P (L < θ˜ < U ) = 1 α where α is known as our margin of error. error . Usuall Usuallyy this this value value is 0.05 0.05 by conventi convention, on, unless unless otherw otherwise ise specified specified.. ˜ Normalizing θ, we get that
∼
−
(L, U ) =
− θ
c V ar (θ˜), θ + c V ar (θ˜)
where c is the value such that 1 α = P ( c < Z < c ) and Z interval. θ c V ar (θ˜), a probability interval.
−
±
−
∼
N (0 this inte interv rval al,, (0, 1). We call this
±
Usually, though, we don’t know θ, so we replace it with our best estimate θˆ to get θˆ c V ar (θ˜), which is called a (1 α)% confidence interval. interval. If V If V ar (θ˜) is known, then C N (0 (0, 1) and if it is unknown, we replace ˆ V ar (θ˜) with V ar ( ˜θ) and C t n−q . Anothe Anotherr compact compact notati notation on for for confide confidence nce intervals intervals is ES T c SE . SE . Note that the confidence interval says that we are 95% confident in our result, but does not say anything significant. When we say that we are 95% confident, it means that given 100 randomly selected trials, we expect that in 95 of the 100 confidence intervals constructed from the samples, the population parameter will be in those intervals.
−
7.2 7.2
∼
∼
±
Predi Predict ctin ing g Inte Interv rval alss
A predicting interval is an extension of a model Y , by adding the subscript p . The model is usually of the form Y p = f (θ˜) + p and the model of the form E ST c SE = f (θˆ) V ar (Y p ). Note that Y p is different from a standard model Y i = f (θ) + i in that the first component is random.
±
7.3 7.3
±
Lik Likelyh elyhood ood Inte Interv rvals als n
Recall the likelyhood function L(θ, y i i ) =
f ( y i i , θ ) and note that L(θˆ) is the largest value of the likelyhood
i =1
function by MLE. We define the relative likelyhood function as R(θ) =
L(θ) ,0 L(θˆ)
≤ R(θ) ≤ 1.
From a theorem in a future statistics course (STAT330), if R(θ) 0.1, then solving for θ (usually using the quadratic formula) will form an approximate 95% confidence interval. This interval is called the likelyhood interval and one of its main advantages to the confidence interval is that it does not require that the model be normal.
≈
8
Hypoth Hypothes esis is Testi esting ng
While our confidence interval does not tell us in a yes or no way whether or not a statistical estimate is true, a hypothesis test does. Here are the steps:
20
Winter 2012
9
COMP COMPARATIV ARATIVE E MODELS MODELS
1. State the hypothesis, H 0 : θ = θ0 (this is only an example), called the null hypothesis ( H 1 is called the alternative hypothesis and states a statement contrary to the null hypothesis). 2. Calculate Calculate the discrepancy discrepancy (also called the test statistic), statistic), denoted by d =
θˆ
−
θ0
V ar (θ˜)
=
estimate H 0 value SE
−
assuming that θ˜ is unbiased and the realization of d , denoted by D , is N (0 (0, 1) if V ar (θ˜) is known and t n−q otherwise. Note that d is the number of standard deviations θ0 is from θˆ. 3. Calculate a p value given by p = 2P (D > d ). It is also the probability that one sees a value worse than θˆ, given given that the null null hypothe hypothesis sis is true. The greater greater the p value, the more evidence against the model in order to reject.
−
||
−
4. Reject or not reject (note that we do not “accept” the model) The following the table that subjectively describes interpretations for p values:
−
P value p-value< p-value10%
≤ ≤
≤
Interpretation A ton of evidence against H 0 A lot of evidence against H 0 Some evidence against H 0 Ther Th eree is virt virtua uallllyy no evide evidenc ncee agai agains nstt H 0
Note that one model for which D N (0 (0, 1) is Y i = i where i by our null hypothesis and central limit theorem.
∼
9
∼ Bin(1, Π) since
V ar (θ˜) =
ˆ 0 (1 Π ˆ 0) Π
−
n
Comp Compa arati rative ve Model Modelss
The goal of a comparative comparative model is to compare the mean of two two groups and determine if there is a causation relationship between one and the other. Definition 9.1. If x If x causes y and there is some variate z that is common between the two, then we say z is a confounding variable because it gives the illusion that z causes y . y . It is also sometimes called a lurking variable. variable. There are two main models that help determine if one variate causes another and they are the following. Experimental Experimental Study 1. Fo Forr every unit in the T.P. T.P. set the F.E.V. (focal explanator explanatoryy variate) to level 1 2. We measure measure the attribute of interest interest
21
Winter 2012
10
EXPERI EXPERIMEN MENT TAL DESIGN DESIGN
3. Repeat 1 and 2 but with set the F.E.V. to level 2 4. Only the F.E.V. changes changes and every other explanator explanatoryy variate is fixed 5. If the attribute attribute changes between between steps steps 1 and 4, then causation occurs Problems?
• We cannot sample the whole T.P. • It is not possible to keep all explanatory variates fixed • The attributes change (on average) Observational Study 1. First, First, observe an association association between between x and y in many places, settings, types of studies, etc. 2. There must be a reason for why x causes y (either scientifically or logically) 3. There must be a consistent consistent dose relationship relationship 4. The association association has to hold when other possible possible variates variates are held fixed
10
Exper Experim imen enta tall Desi Design gn
There are three main tools that are used by statisticians to improve experimental design. 1. Replication Replication (a) Simply put, we we increase the sample size size i. This is to decrease decrease the variance variance of confidence confidence intervals, intervals, which improves improves accuracy 2. Randomization Randomization (a) We select select units in a random random matter (i.e. if there are 2+ groups. groups. we try to randomly assign assign units into the groups) i. This is to reduce reduce bias and create a more representativ representativee sample ii. It allows us to assume independence independence between between Y i s iii. It reduces the chance chance of confounding confounding variates variates by unknown explanatory explanatory variates variates 3. Pairing Pairing (a) In an experimental experimental study, study, we call it blocking blocking and in an observational observational study, study, we call it matching matching (b) The actual actual proces processs is just just matchin matching g units units by their their explana explanator toryy varia variates tes and matched matched units units are are called twins
22
Winter 2012
12
CHICHI-SQ SQUA UARE RED D TEST TEST
i. For example in a group of 500 twins, grouped by gender and ages, used to test a vaccine, one of the twins in each group will take the vaccine and another will take a placebo ii. We do this in order order to reduce reduce the chance of confounding confounding due to known known explanatory explanatory variates (c) Pairing Pairing also allows us to perform perform subtraction subtraction between between twins to compare compare certain attributes attributes of the population i. Note that taking differences differences does not change the variabilit variabilityy of the difference difference distributi distribution on
11
Model Model Asse Assess ssme ment nt
We usually want the following four assumptions to be true, when constructing a model Y i = f (θ) + i to fit a sample. We also use certain statistical statistical tools to measure measure how well our model fits these conditions. conditions. Note that these tools/tests require subjective observation. 2
• ∼ N (0 (0, σ ) i
– Why? Because Y i is not normal if if i is not normal. However, θ˜ is still likely to be normal by CLT. – Tests:
∗ Histogram of residuals ˆ = | y ˆ − y | (should be bell-shaped) ∗ QQ Plot, which is the plot of theoretical quartiles versus sample quartiles (should be linear i
i i
with intercept ~0)
· Usually a little variability at the tails of the line is okay 2
• E ( ) = 0, V ar ar ( ) = σ , s are s are independent i
i
i
– Why? All of our models, tests and estimates depend on this. – Tests: ( y -axis) -axis) versus fitted values (x (x -axis) -axis) ∗ Scatter plot of residuals ( y · We hope that it is centered on 0, has no visible pattern and that the data is bounded
· ·
by two parallel lines (constant variance) If there is not a constant variance, such as a funnel (funnel effect), we usually transform the fitted values (e.g. y y ) ln y ) If the plot seems periodic, we will need a new model (STAT 371/372)
→
( y -axis) -axis) versus explanatory variates (x (x -axis) -axis) ∗ Scatter plot of fitted values ( y · This is used mainly in regression models · We hope to see the same conditions in the previous scatter plot
12
ChiChi-Sq Squa uare red d Test est
The purpose of a Chi-squared test is to determine if there is an association between two random variables X,Y, given that they both contain only counting data. The following are the steps
23
Winter 2012
12
CHICHI-SQ SQUA UARE RED D TEST TEST
1. State the null hypothesis as H 0 : X and Y are not associated. 2. If there are m possible observations for X and n possible observations for Y , then define n
d =
m
j =1 i =1
(expected observed )2 = expected
−
n
m
j =1 =1 i =1 =1
(e i j
2
− o ) i j
e i j
where e i j = P (X = x i i ) P (Y = y j ), o i j = P (X = x i i , Y = y j ), for i = 1,...,m and j = 1,...,n. ,...,n.
·
3. Assume that D
2 (m 1)(n 1) .
∼χ
−
−
4. Calculate the p-value which in this case is P r (D > d ) since d a one-tailed one-tailed hypothesis. hypothesis. 5. Interpre Interprett it as always always (see the table in Section 8)
24
≥ 0, which means we are conducting
Winter 2012
REFERENCES
References [1] William William Kong, Kong, Stochastic Seeker : Resources , Internet: http://stochasticseeker.w http://stochasticseeker.wordp ordpress.com, ress.com, 2011. [2] Paul Paul Marriott, Marriott, STAT 231/221 Winter 2012 Course Notes , University of Waterloo, 2009.
Winter 2012
APPENDIX A
Appendix A Chi-squared c.d.f. and p.d.f. on k degrees of freedom (χk 2 )2 : f (x , k ) =
1 k 2
2 Γ( k 2 )
F (x , k ) =
k
x
x 2 −1 e − 2
k x 1 γ , ) ( Γ( k 2 ) 2 2
Student’s t c.d.f. and p.d.f. on v degrees of freedom (t v v )3 :
√
Γ 2 f (x , v ) = v πΓ F (x , v ) =
2 3
1 + x Γ 2
· √ −
v +1
v
2
v + 1 2
x 2 1+ v
2 F 1
− v +1 2
1 v +1 3 , 2 ; 2; 2
πv Γ
x 2 v
v
2
Γ(x Γ(x )) denotes the regular gamma functionand functionand γ (x ) function . x ) is the lower incomplete gamma function. function . 2 F 1 denotes the hypergeometric function.
Winter 2012
INDEX
Index aspect, 2 attribute, 1 biased, 16 box plot, 6 center, 5 Central Limit Theorem, 14 Chi-squared distribution, 18 Chi-squared test, 23 coefficient of variation, 8 comparative model, 21 confidence interval, 20 confounding variable, 21 correlation, 9 covariance, 8 data binary data, 4 continuous data, 4 counting data, 4 discrete data, 4 nominal nominal data, 4 ordinal data, 4 degrees of freedom, 18 error sample error, 4 study error, 4 estimates, 12 estimator, 14 experimental design, 22 experimental Study, 21 funnel funnel effect, 23 histogram of residuals, 23 hypothesis test, 20 i.i.d., 14 IQR, 6 likelihood function, 12 likelyhood interval, 20 lurking variable, 21
margin of error, 20 maximum likelyhood estimation, 12 maximum likelyhood likelyhood test, 11 mean, 6 measurement system, 2 median, 5 mode, 5 model Assessment, 23 observational Study, 22 outlier, 6 Outliers, 5 pairing, 22 parameter, 2 population sample population, 1 sample, 1 target population, 1 unit, 1 PPDAC, 1 predicting interval, 20 probability interval, 20 QQ plot, 23 randomization, 22 range, 6 relative likelyhood function, 20 relative-risk, 9 replication, 22 robustness, 6 S.P., 2 sampling protocol, 2 Scatter plot of fitted values versus explanatory variates, 23 scatter plot of residuals versus fitted values, 23 shape, 5 skewness, 5 slope, 9 spread, 6 standard deviation, 7 sample standard deviation, 7
Winter 2012
statistic, 2 statistical models, 10 discrete, 10 regression, 11 response, 10 student’s t-distribution, 18 study protocol, 2 T.P, 1 unbiased, 16 variance, 7 sample variance, 7 variate, 1 explanatory variate, 1 response variates, 1
INDEX
View more...
Comments