Lecture_7_Notes.pdf
Short Description
statistics...
Description
CAS MA 115 • Statistics I Statistics I • Fall 2010 • Lecture 7.1-7.3
Lecture 7.1 Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." ~W.A. Wallis 1) 2) 3)
Sampling Distribution for Population Proportion Confidence Intervals for Population Proportion. Sample Size Determination.
The Sampling Distribution of the Sample Proportion Test
Data
Scenario
Population Proportion
Population
Sample
Parameter
Statistics
Response
Explanatory Variable
1 Sample
Categorical ________ (YES/NO, Success/Failure, True/False) Is the true population proportion of adults who believe in life after death is more than 70%?
Recall: distri bution . Let X be a Binomial random 1. Normal Cur ve Approximation to Bi nomial distri variable with the number of trials n the probability of a success p . When the sample size (or number of trials) n is large and both , the distribution of X could be approximated with a normal distribution .
2. Sampli Sampli ng Di str ibu tion of th e Sample M ean: If the parent population IS a normal distribution with a mean μ and a standard deviation σ , then for any sample size (small or . large), the sample mean will have a normal distribution 3. Centr Centr al L imit T heorem: heorem: If the parent population is NOT a normal distribution but with a mean μ and a standard deviation σ then for a large sample size (n≥30), the sample . mean will have approximately normal distribution New Result: 4. Sampli ampli ng D istri istri bution of th e Samp Sample le Proporti Proporti on: If both
, then
If many samples of the same size n are taken and sample size is large, then the distribution of possible values of is approximately a normal curve distribution with mean equals to the true population proportion p and standard deviation
.
1
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Facts: According to a spring 2009 poll of more than 2,200 college students across 40 colleges and universities: • 85 percent of students reported feeling stressed on a daily basis. • Academic concerns like school work and grades, with 77 percent and 74 percent respectively, maintain their positions as the top drivers of student stress, even over financial woes in today’s economy. • Six out of 10 students report having felt so stressed they couldn’t get their work done on one or more occasions. • Since starting college, over 70 percent of students have not considered talking to a counselor to help them deal with stress or other emotional issues. Sources: Center for the Study of College Student Retention, 2008, American College Health Association’s Spring 2008 National College Health Assessment and Associated Press ―College Stress and Mental Health Poll,‖ Spring 2009.
Exercise: Assume that the true proportion of students who feel stressed on a daily basis 85%.
(a) What is the probability that in a random poll of 100 students more than 50% are feeling stressed on a daily basis? Option 1: Use the Normal Approximation to Binomial.
Option 2: Use the Sampling Distribution of the Sa mple Proportion
(b) According to MA 115 B1 Intro Survey, 52 out of 108 students reported feeling stressed at the beginning of the semester. What is the approximate probability to obtain the sample proportion of 48% or lower, if the true proportion of all students who feel stressed during the semester is 85%?
2
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Exercise: Telephone poll taken two days after the November 8, 1994, election. From the 800 adults polled, 56% reported that they had voted. However, it was later reported in the press that, in fact, only 39% of American adults had voted. Suppose the 39% rate reported by the press is the correct population proportion. Also assume the responses of the 800 adults polled can be viewed as a random sample. for a random sample of size n= 800 adults.
1. Describe and sketch the sampling distribution of
2. What is the approximate probability that a sample proportion who voted would be 56% or larger for a random sample of 800 adults?
3. Does it seem that the poll result of 56% simply reflects a sample that, by chance, voted with greater frequency than the general population?
3
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Confidence Intervals for Population Proportion Recall Confidence interval estimates a range of values for the population parameter with a predefined level of confidence (e.g., 95% confidence interval).
Confidence level of 100(1-α)%, means that 100(1-α)% of the times when we repeat the process of taking samples from the target population of the same size and producing 100(1α)% Confidence Intervals for the population parameter, the parameter will be contained in these intervals. (Note: α is used to denote the significance level.) The basic structure for any confidence interval is: point estimate ± margin of error = point estimate ± multiplier * standard error, .
The assumptions required for CI for a population proportion to be valid:
5 and n 1 the data are a r andom sample from that population. the sample size n is large enough (check: n p
Type of the F ormul a
the Population Proportion p
p Z 1 ˆ
p(1 p) (
/ 2)
n
for Approximate Confidence Interval Formula for the Population Proportion p
for the Population Proportion p Note: here =0.5 is used to compute the standard error, ˆ
ˆ
0.5(1 0.5)
p(1 p) ˆ
p Z 1
(
p Z 1
( / 2)
ˆ
Formula for Conservative Confidence Interval
p(1 p)
ˆ
Formula
General Formu la for Confidence Interval for
SE ( p)
5)
p
ˆ
ˆ
ˆ
/ 2)
n
1 2 n
1
ˆ
n n 2 n Formula for the Sample Size required to produce an estimate for the population proportion p for the Approximate Sample Size. If Formula the population parameter p is unknown, the sample estimate can be used instead
n
n
p (1 p )
p (1 p ) ˆ
p=0.5) Formula for the Conservative Sample Size (
n
4
0.25
2
Z 1
( / 2)
E 2
Z 1
( / 2)
ˆ
Z 1
E 2 ( / 2)
E
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Exercise: The proportion of adults that believe in love at first sight. Assume for the sample of 100 people 40 will say they do believe.
1)
Compute an approximate 95% confidence interval.
Since the true population proportion is unknown, use the formula for approximate confidence interval:
p Z 1 ˆ
p(1 p ) ˆ
(
/ 2)
n
2)
Compute a conservative 95% confidence interval.
p Z 1 ˆ
3)
ˆ
1 ( / 2)
2 n
Which confidence interval is wider?
__________________________________
4) How many people do we need to survey in order to estimate the true population proportion
with 5% accuracy? (E=0.05)
NOTE: The probability that the true parameter lies in a particular, already computed, confidence interval is either 0 or 1. The interval is now fixed and the parameter is not random, so the parameter is either in that particular interval or it is not.
5
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3 Exercise: Do you work more than 40 hours per week?
A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University). A probability sample of 1000 workers resulted in 560 (for 56%) stating they work more than 40 hours per week. 5) Compute an approximate 95% confidence interval for the true population proportion of people who work more than 40 hours.
p Z 1 ˆ
p (1 p ) ˆ
(
ˆ
/ 2)
n
6) How many people do we need to survey in order to estimate the true population proportion
with 10% accuracy? (E=0.1) From the statement of the problem:
Since the true population proportion p is unknown, use the following formula n
p (1 p) ˆ
2
Z 1
( / 2)
ˆ
E
7)
n
What would a conservative sample size (assuming unknown p and )?
0.25
Z 1
2 ( / 2)
E
Exercise: Work through the Examples 7.2(p. 296) and 7.15 (p. 329).
6
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Lecture 7.2-7.3 1. Introduction to Hypothesis Testing. Types of Errors. 2. One Sample Test for Population Mean. 3. P-value. Power and Sample Size Determination.
Introduction to Hypothesis Testing Hypothesis Testing uses the point estimate to attempt to reject/accept a hypothesis about the population. Usually researchers want to reject the notion that chance alone can explain the sample results. Hypothesis testing is applied to population parameters by specifying a null hypothesis (H0) that contains a null value for the population parameter — a value that would indicate a baseline, or that nothing of interest is happening : ―old news‖, ―no difference‖, etc. In most cases, the researchers are trying to show that the null value is not correct. Hypothesis testing proceeds by obtaining a sample, computing a point estimate (sample statistic ), and assessing how unlikely to obtain this sample statistic if the null parameter value were correct. Use the following diagram to choose an appropriate statistical test and state the null hypothesis:
7
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Achieving statistical significance is equivalent to rejecting the idea that the observed results are plausible if the null value is correct, i.e., rejecting the null hypothesis (H0) in a favor of alternative hypothesis (Ha). Alternative hypothesis does not specify any specific value for the true population parameter. It only gives an open interval that may contain possible values of the true parameter, but never contains the null value.
5 Basic Steps in Any Hypothesis Test
Step 1: Determine the null (H0) and alternative (Ha) hypotheses. Note: Hypotheses are statements ABOUT population parameters NOT ABOUT sample statistics. Step 2: Verify necessary data conditions (assumptions), and if met, summarize the data into an appropriate test statistic (using appropriate data summary, or sample statistic). Step 3: Assuming the null (H0) hypothesis is true, find either Rejection Rul e (rejection region or the p-value). Step 4: Decide whether or not the result is statistically significant based on rejection region: p-value is the probability of getting a test statistic as extreme or more extreme ( in the direction of Ha ) than the observed value of the test statistic, assuming the null hypothesis(H0) is true. If the p-value ≤ α , then the result IS statistically significant, the decision is to reject H0. If the p-value >α , then the result IS NOT statistically significant, the decision is to fail to reject H0. Step 5: Report the conclusion in the context of the problem (question of interest).
8
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Level of Significance, Type I Error, Type II Error, and the Power of a Test
Note: A ― significant ‖ result in the statistical sense does not necessarily imply an ―important ‖ result. It means simply that such a difference from the null hypothesis is ―not very likely to happen just b y chance‖. The test statistic: IS a summary of the data that is used to help make the decision, IS a random variable related to the hypotheses of interest (Ha) having a known probability distribution (under the null hypothesis, H0), IS examined for evidence for or against H0. 9
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
One Sample Hypothesis Test for Population Mean Test Data Population Sample Response Scenario Parameter Statistics Population 1 Sample Numerical (Age, Mean Time, Price) Can we claim that the average GPA of all BU graduates is higher than 3.0?
Explanatory Variable __________
Example (D’Agostino, Example 5.8): Testing Mean Systolic Blood Pressure. A large, national study conducted in 2003 reported that the mean systolic blood pressure for males aged 5 0 was 130 with a standard deviation of 15. In 2004, an investigator hypothesized that due to increased stress in the work-place, faster-paced lifestyles, and poorer nutritional habits, systolic blood pressure have increased. In order to test this hypothesis that the population mean μ blood pressure increased in 2004, we set up two competing hypothesis Step 1: H0: μ = 130 (―no change‖)
Ha: μ>130 (mean blood pressure increased in 2004)
Step 2: Select a random sample from population of interest (n = 108 males aged 50 in 2004) Record the systolic blood pressure on each male Generate summary statistics ( ) – a point estimate for the population parameter of primary interest μ and compute an appropriate test statistic and based on the value of this statistic make a decision – how likely that H0 is true (―no change‖)
Consider the following scenarios:
Interpretation: The mean systolic blood pressure in the 2004 sample is identical to the population mean from 2003, which would lead us to believe that the null hypothesis, H0, is most likely true and the populations mean in 2004 HAS NOT CHANGED and IS equal to 130. Notice we cannot say with certainty that the null hypothesis is true because we hav e only a sample of males; however, because the sample was selected at random, we would expect the mean systolic blood pressure among all males aged 50 in 2004 to be close to the observed 130.
Interpretation: The mean systolic blood pressure in the 2004 sample IS SUBSTANTIALLY HIGHER than the population mean from 2003, which would lead us to believe that the alternative hypothesis, Ha, is most likely true (μ>130).
Interpretation: The mean systolic blood pressure in the 2004 sample IS NUMERICALLY HIGHER (135 > 130) than the population mean from 2003, BUT it may be solely due to chance fluctuation. How likely are we to observe a sample mean of 135 or greater from a population with μ = 130? Compute this probability using the CLT: ________________________________________________________________________
10
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
The last scenario illustrates the crux of the hypothesis-testing problem. Conduct the test of hypothesis = to decide whether H0 or H1 is more likely true. We must determine a critical value, or rejection region, such that: if the sample mean is less than the critical value, we will conclude that H0 is true (e.g., μ = 130), if sample mean is greater than the critical value, we will conclude that Ha is true (e.g., μ >130). Instead of determining critical values for that would be specific to each application, we use the CLT to standardize and produce a z-score:
is the mean value specified in H0. Assuming H0 is true Z has a standard normal distribution (N(0,1)).
Z is close to zero when is close to Z is large when is larger than
, then H0 is most likely true , then H1 is most likely true
Need to determine the critical value, or the point at which Z is "too large." Step 3: Rejection Rule
How likely is that Using the CLT
, assuming H0 is true?
How likely is that
, assuming H0 is true?
Therefore, if we serve a sample mean that exceeds 132.88 (or Z > 2) and we reject the H0 in favor of the H1, the probability that we are making a mistake rejecting (the probability of TYPE I Error α) is only 2.28%. However, if we reject the H0 for value that exceed 131.44 (or Z > 1), the probability that we are making a mistake rejecting H0 is 0.1587. We must pre-define the level of TYPE I Error (α) that can be tolerated in the analysis. Once a level of significance is selected, a decision rule is formed. The decision rule is a formal statement of the criteria used to a conclusion in the hypothesis test. 11
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Say α=0.05, then we have to find such point x (or z), such that , therefore z=1.645. Thus, the decision rule is given by: Reject Ho if Z≥ 1.645, and fail to reject H0 if Z< 1.645. Note: the rejection (decision) rule is directly related to a type of the test (one-sided or two-sided), the type of test statistics and direction of the hypothesis. Once the decision rule is in place, we compute the value of the test statistic (recall our z-score).
Step 4: The final step in the test of hypothesis is to compare the test statistic to the decision rule to draw a conclusion. The test statistic falls in the rejection region and therefore we reject Ho because 3.46 > 1.645 Step 5: Conclusion: Based on the sample of n=108 male, there is significant evidence, at level α = 0.05, to conclude that the mean systolic blood pressure for males aged 50 in 2004 has increased from 130.
12
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
13
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
Example (D’Agostino, Example 5.9): Testing Mean Cholesterol Level against Referent Suppose that the mean cholesterol level for males aged 50 is 241. Question of interest: whether cholesterol levels are significantly reduced by modifying diet only slightly.
A random sample of 12 patients (n=12) agrees to participate in the study and follow the modified diet for 3 months. 3 months, their cholesterol levels are me asured and summary statistic is produced on the n=12 subjects. The mean cholesterol level in the sample is 235 with a standard deviation of s=12.5. Based on the data, is there significant evidence that the modified diet reduces cholesterol? Run the appropriate test using a 5% level of significance. Step1. Determine the null (H0) and alternative (Ha) hypotheses. We start with the definition of the parameter of interest. Parameter______________________________________________ H0:_______________
Ha:_________________ Significance level α =_________
Step 2: Verify necessary data conditions (assumptions), and if met, summarize the data into an appropriate test statistic (using appropriate data summary, or sample statistic). In this example, we have a sample of size n = 12 (n -1.796
Step 4: Decide whether or not the result is statistically significant based on rejection region: We do not reject Ho because -1.66 > -1.796. Step 5: Report the conclusion in the context of the problem (question of interest). Based on the sample of n=12 male, there is no significant evidence, at level α = 0.05,to conclude that the diet reduces cholesterol level, on average. Also go through Example 5.10 on page 198.
14
CAS MA 115 • Statistics I • Fall 2010 • Lecture 7.1-7.3
P-value Decision on whether or not the result is statistically significant can be also made on p-value . p-value is the probability of getting a test statistic as extreme or more extreme ( in the direction of Ha ) than the observed value of the test stati sti c , assuming the nul l hypothesis(H 0) is tru e .
If the p-value ≤ α , then the result IS statistically significant, the decision is to reject H0. If the p-value >α , then the result IS NOT statistically significant, the decision is to fail to reject H0. Recall Example 5.8 M ean Systol ic B lood Pr essur e again st Referent H0: μ = 130_(“no change”) Ha: μ>130 (i.e., mean blood pressure increased) Thus, the decision rule is given by: Reject H0 if Z≥ 1.645, and do not reject H0 if Z< 1.645. The test statistic is
.
Decision: the test statistic falls in the rejection region and therefore we reject H0 because 3.46 > 1.645. Instead of creating rejection region, we can compute the p-value:
= P(test statistic as extreme or more extreme|H0 is true) = P(Z>3.46)
View more...
Comments