Statistics and Probability

Share Embed Donate


Short Description

statistic...

Description

Commission on Higher Education in collaboration with the Philippine Normal University

TEACHING GUIDE FOR SENIOR HIGH SCHOOL

Statistics and Probability CORE SUBJECT

This Teaching Guide was collaboratively developed and reviewed by educators from public and private schools, colleges, and universities. We encourage teachers and other education stakeholders to email their feedback, comments, and recommendations to the Commission on Higher Education, K to 12 Transition Program Management Unit - Senior High School Support Team at [email protected]. We value your feedback and recommendations.

Published by the Commission on Higher Education, 2016
 Chairperson: Patricia B. Licuanan, Ph.D. Commission on Higher Education
 K to 12 Transition Program Management Unit
 Office Address: 4th Floor, Commission on Higher Education, 
 C.P. Garcia Ave., Diliman, Quezon City
 Telefax: (02) 441-1143 / E-mail Address: [email protected]

DEVELOPMENT TEAM

Team Leader: Jose Ramon G. Albert, Ph.D. Writers:
 Zita VJ Albacea, Ph.D., Mark John V. Ayaay Isidoro P. David, Ph.D., Imelda E. de Mesa Technical Editors:
 Nancy A. Tandang, Ph.D., Roselle V. Collado Copy Reader: Rea Uy-Epistola Illustrator: Michael Rey O. Santos Cover Artists: Paolo Kurtis N. Tan, Renan U. Ortiz CONSULTANTS THIS PROJECT WAS DEVELOPED WITH THE PHILIPPINE NORMAL UNIVERSITY.


University President: Ester B. Ogena, Ph.D.
 VP for Academics: Ma. Antoinette C. Montealegre, Ph.D.
 VP for University Relations & Advancement: Rosemarievic V. Diaz, Ph.D. Ma. Cynthia Rose B. Bautista, Ph.D., CHED
 Bienvenido F. Nebres, S.J., Ph.D., Ateneo de Manila University
 Carmela C. Oracion, Ph.D., Ateneo de Manila University
 Minella C. Alarcon, Ph.D., CHED
 Gareth Price, Sheffield Hallam University
 Stuart Bevins, Ph.D., Sheffield Hallam University SENIOR HIGH SCHOOL SUPPORT TEAM


CHED K TO 12 TRANSITION PROGRAM MANAGEMENT UNIT

Program Director: Karol Mark R. Yee Lead for Senior High School Support: Gerson M. Abesamis Lead for Policy Advocacy and Communications: Averill M. Pizarro Course Development Officers:
 John Carlo P. Fernando, Danie Son D. Gonzalvo Teacher Training Officers:
 Ma. Theresa C. Carlos, Mylene E. Dones Monitoring and Evaluation Officer: Robert Adrian N. Daulat Administrative Officers: Ma. Leana Paula B. Bato, 
 Kevin Ross D. Nera, Allison A. Danao, Ayhen Loisse B. Dalena

This Teaching Guide by the Commission on Higher Education is licensed under a Creative Commons AttributionNonCommercial-ShareAlike 4.0 International License. This means you are free to: Share — copy and redistribute the material in any medium or format Adapt — remix, transform, and build upon the material. The licensor, CHED, cannot revoke these freedoms as long as you follow the license terms. However, under the following terms: Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. NonCommercial — You may not use the material for commercial purposes. ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. Printed in the Philippines by EC-TEC Commercial, No. 32 St. Louis Compound 7, Baesa, Quezon City, [email protected]

Preface Prior to the implementation of K-12, Statistics was taught in public high schools in the Philippines typically in the last quarter of third year. In private schools, Statistics was taught as either an elective, or a required but separate subject outside of regular Math classes. In college, Statistics was taught practically to everyone either as a three unit or six unit course. All college students had to take at least three to six units of a Math course, and would typically “endure” a Statistics course to graduate. Teachers who taught these Statistics classes, whether in high school or in college, would typically be Math teachers, who may not necessarily have had formal training in Statistics. They were selected out of the understanding (or misunderstanding) that Statistics is Math. Statistics does depend on and uses a lot of Math, but so do many disciplines, e.g. engineering, physics, accounting, chemistry, computer science. But Statistics is not Math, not even a branch of Math. Hardly would one think that accounting is a branch of mathematics simply because it does a lot of calculations. An accountant would also not describe himself as a mathematician.

Math largely involves a deterministic way of thinking and the way Math is taught in schools leads learners into a deterministic way of examining the world around them. Statistics, on the other hand, is by and large dealing with uncertainty. Statistics uses inductive thinking (from specifics to generalities), while Math uses deduction (from the general to the specific).

“Statistics has its own tools and ways of thinking, and statisticians are quite insistent that those of us who teach mathematics realize that statistics is not mathematics, nor is it even a branch of mathematics. In fact, statistics is a separate discipline with its own unique ways of thinking and its own tools for approaching problems.” - J. Michael Shaughnessy, “Research on Students’ Understanding of Some Big Concepts in Statistics” (2006)

Statistics deals with data; its importance has been recognized by governments, by the private sector, and across disciplines because of the need for evidence-based decision making. It has become even more important in the past few years, now that more and more data is being collected, stored, analyzed and re-analyzed. From the time when humanity first walked the face of the earth until 2003, we created as much as 5 exabytes of data (1 exabyte being a billion “gigabytes”). Information communications technology (ICT) tools have provided us the means to transmit and exchange data much faster, whether these data are in the form of sound, text, visual images, signals or any other form or any combination of those forms using desktops, laptops, tablets, mobile phones, and other gadgets with the use of the internet, social media (facebook, twitter). With the data deluge arising from using ICT tools, as of 2012, as much as 5 exabytes were being created every two days (the amount of data created from the beginning of history up to 2003); a year later, this same amount of data was now being created every ten minutes.

In order to make sense of data, which is typically having variation and uncertainty, we need the Science of Statistics, to enable us to summarize data for describing or explaining phenomenon; or to make predictions (assuming trends in the data continue). Statistics is the science that studies data, and what we can do with data. Teachers of Statistics and Probability can easily spend much time on the formal methods and computations, losing sight of the real applications, and taking the excitement out of things. The eminent statistician Bradley Efron mentioned how diverse statistical applications are:

“During the 20th Century statistical thinking and methodology has become the scientific framework for literally dozens of fields including education, agriculture, economics, biology, and medicine, and with increasing influence recently on the hard sciences such as astronomy, geology, and physics. In other words, we have grown from a small obscure field into a big obscure field.”

In consequence, the work of a statistician has become even fashionable. Google’s chief economist Hal Varian wrote in 2009 that “the sexy job in the next ten years will be statisticians.” He went on and mentioned that “The ability to take data - to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. “

This teaching guide, prepared by a team of professional statisticians and educators, aims to assist Senior High School teachers of the Grade 11 second semester course in Statistics and Probability so that they can help Senior High School students discover the fun in describing data, and in exploring the stories behind the data. The K-12 curriculum provides for concepts in Statistics and Probability to be taught from Grade 1 up to Grade 8, and in Grade 10, but the depth at which learners absorb these concepts may need reinforcement. Thus, the first chapter of this guide discusses basic tools (such as summary measures and graphs) for describing data. While Probability may have been discussed prior to Grade 11, it is also discussed in Chapter 2, as a prelude to defining Random Variables and their Distributions. The next chapter discusses Sampling and Sampling Distributions, which bridges Descriptive Statistics and Inferential Statistics. The latter is started in Chapter 4, in Estimation, and further discussed in Chapter 5 (which deals with Tests of Hypothesis). The final chapter discusses Regression and Correlation.

Although Statistics and Probability may be tangential to the primary training of many if not all Senior High School teachers of Statistics and Probability, it will be of benefit for them to see why this course is important to teach. After all, if the teachers themselves do not find meaning in the course, neither will the students. Work developing this set of teaching materials has been supported by the Commission on Higher Education under a Materials Development Sub-project of the K-12 Transition Project. These materials will also be shared with Department of Education.

Writers of this teaching guide recognize that few Senior High School teachers would have formal training or applied experience with statistical concepts. Thus, the guide gives concrete suggestions on classroom activities that can illustrate the wide range of processes behind data collection and data analysis.

It would be ideal to use technology (i.e. computers) as a means to help teachers and students with computations; hence, the guide also provides suggestions in case the class may have access to a computer room (particularly the use of spreadsheet applications like Microsoft Excel). It would be unproductive for teachers and students to spend too much time working on formulas, and checking computation errors at the expense of gaining knowledge and insights about the concepts behind the formulas.

The guide gives a mixture of lectures and activities, (the latter include actual collection and analysis of data). It tries to follow suggestions of the Guidelines for Assessment and Instruction in Statistics Education (GAISE) Project of the American Statistical Association to go beyond lecture methods, and instead exercise conceptual learning, use active learning strategies and focus on real data. The guide suggests what material is optional as there is really a lot of material that could be taught, but too little time. Teachers will have to find a way of recognizing that diverse needs of students with variable abilities and interests.

This teaching guide for Statistics and Probability, to be made available both digitally and in print to senior high school teachers, shall provide Senior High School teachers of Statistics and Probability with much-needed support as the country’s basic education system transitions into the K-12 curriculum. It is earnestly hoped that Senior High School teachers of Grade 11 Statistics and Probability can direct students into examining the context of data, identifying the consequences and implications of stories behind Statistics and Probability, thus becoming critical consumers of information. It is further hoped that the competencies gained by students in this course will help them become more statistical literate, and more prepared for whatever employment choices (and higher education specializations) given that employers are recognizing the importance of having their employee know skills on data management and analysis in this very data-centric world.

The learner demonstrates understanding of key concepts of normal probability distribution.

Normal Distribution

The learner is able to accurately formulate and solve real-life problems in different disciplines

PERFORMANCE STANDARDS The learner is able to apply an appropriate random variable for a given real-life problem (such as in decision making and games of chance).

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

The learner demonstrates understanding of key concepts of random variables and probability distributions.

CONTENT STANDARDS

Random Variables and Probability Distributions

CONTENT

M11/12SP-IIIa-3 M11/12SP-IIIa-4 M11/12SP-IIIa-5 M11/12SP-IIIa-6 M11/12SP-IIIb-1 M11/12SP-IIIb-2 M11/12SP-IIIb-3 M11/12SP-IIIb-4

3. finds the possible values of a random variable. 4. illustrates a probability distribution for a discrete random variable and its properties. 5. constructs the probability mass function of a discrete random variable and its corresponding histogram. 6. computes probabilities corresponding to a given random variable. 7. illustrates the mean and variance of a discrete random variable. 8. calculates the mean and the variance of a discrete random variable. 9. interprets the mean and the variance of a discrete random variable. 10. solves problems involving mean and variance of probability distributions. 11. illustrates a normal random variable and its characteristics. 12. constructs a normal curve.

Page 1 of 7

M11/12SP-IIIc-2

M11/12SP-IIIc-1

M11/12SP-IIIa-2

2. distinguishes between a discrete and a continuous random variable.

The learner …

M11/12SP-IIIa-1

CODE

1. illustrates a random variable (discrete and continuous).

The learner …

LEARNING COMPETENCIES

No. of Hours/Semester: 80 hours/semester Prerequisite (if needed): Core Subject Description: At the end of the course, the students must know how to find the mean and variance of a random variable, to apply sampling techniques and distributions, to estimate population mean and proportion, to perform hypothesis testing on population mean and proportion, and to perform correlation and regression analyses on real-life problems.

Grade: 11/12 Core Subject Title: Statistics and Probability

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT

The learner demonstrates understanding of key concepts of sampling and sampling distributions of the sample mean.

The learner demonstrates understanding of key concepts of estimation of population mean and

Sampling and Sampling Distributions

Estimation of Parameters

The learner is able to estimate the population mean and population proportion to make sound

The learner is able to apply suitable sampling and sampling distributions of the sample mean to solve real-life problems in different disciplines.

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

CONTENT STANDARDS

CONTENT

M11/12SP-IIIf-3

2. distinguishes between point and interval estimation.

Page 2 of 7

M11/12SP-IIIf-2

1. illustrates point and interval estimations.

The learner …

M11SP-IIIe-f-1

M11/12SP-III-3

7. defines the sampling distribution of the sample mean using the Central Limit Theorem. 8. solves problems involving sampling distributions of the sample mean.

M11/12SP-IIIe-2

M11/12SP-IIIe-1

M11/12SP-IIId-5

M11/12SP-IIId-4

M11/12SP-IIId-3

M11/12SP-IIId-2

6. illustrates the Central Limit Theorem.

3. identifies sampling distributions of statistics (sample mean). 4. finds the mean and variance of the sampling distribution of the sample mean. 5. defines the sampling distribution of the sample mean for normal population when the variance is: (a) known (b) unknown

2. distinguishes between parameter and statistic.

The learner …

M11/12SP-IIIc-d1

15. computes probabilities and percentiles using the standard normal table.

1. illustrates random sampling.

M11/12SP-IIIc-4

M11/12SP-IIIc-3

CODE

14. converts a normal random variable to a standard normal variable and vice versa.

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT PERFORMANCE LEARNING COMPETENCIES STANDARDS involving normal 13. identifies regions under the normal curve distribution. corresponding to different standard normal values.

population proportion.

CONTENT STANDARDS

M11/12SP-IIIg-3 M11/12SP-IIIg-4

11. identifies regions under the t-distribution corresponding to different t-values.

M11/12SP-IIIi-2

16. computes for the point estimate of the population proportion.

19. solves problems involving confidence interval estimation of the population proportion.

Page 3 of 7

M11/12SP-IIIi-5

M11/12SP-IIIi-4

M11/12SP-IIIi-3

M11/12SP-IIIi-1

15. identifies point estimator for the population proportion.

17. identifies the appropriate form of the confidence interval estimator for the population proportion based on the Central Limit Theorem. 18. computes for the confidence interval estimate of the population proportion.

M11/12SP-IIIh-3

M11/12SP-IIIh-2

M11/12SP-IIIh-1

M11/12SP-IIIg-5

14. draws conclusion about the population mean based on its confidence interval estimate.

12. computes for the confidence interval estimate based on the appropriate form of the estimator for the population mean. 13. solves problems involving confidence interval estimation of the population mean.

11. identifies percentiles using the t-table.

M11/12SP-IIIg-2

M11/12SP-IIIg-1

M11/12SP-IIIf-5

M11/12SP-IIIf-4

CODE

10. constructs a t-distribution.

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

CONTENT

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT PERFORMANCE LEARNING COMPETENCIES STANDARDS inferences in real-life 3. identifies point estimator for the population mean. problems in different disciplines. 4. computes for the point estimate of the population mean. 5. identifies the appropriate form of the confidence interval estimator for the population mean when: (a) the population variance is known, (b) the population variance is unknown, and (c) the Central Limit Theorem is to be used. 9. illustrates the t-distribution.

The learner demonstrates understanding of key concepts of tests of hypotheses on the population mean and population proportion.

CONTENT STANDARDS

The learner is able to perform appropriate tests of hypotheses involving the population mean and population proportion to make inferences in real-life problems in different disciplines.

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

Tests of Hypothesis

CONTENT

M11/12SP-IVb-1

M11/12SP-IVb-2

4. formulates the appropriate null and alternative hypotheses on a population mean. 5. identifies the appropriate form of the test-statistic when: (a) the population variance is assumed to be known (b) the population variance is assumed to be unknown; and (c) the Central Limit Theorem is to be used.

Page 4 of 7

M11/12SP-IVa-3

M11/12SP-IVa-2

M11/12SP-IVa-1

3. identifies the parameter to be tested given a real-life problem.

1. illustrates: (a) null hypothesis (b) alternative hypothesis (c) level of significance (d) rejection region; and (e) types of errors in hypothesis testing. 2. calculates the probabilities of committing a Type I and Type II error.

The learner …

M11/12SP-IIIj-4

M11/12SP-IIIj-3

M11/12SP-IIIj-2

22. computes for the length of the confidence interval. 23. computes for an appropriate sample size using the length of the interval. 24. solves problems involving sample size determination.

M11/12SP-IIIj-1

M11/12SP-IIIi-6

CODE

21. identifies the length of a confidence interval.

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT PERFORMANCE LEARNING COMPETENCIES STANDARDS 20. draws conclusion about the population proportion based on its confidence interval estimate

CONTENT STANDARDS

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

CONTENT

M11/12SP-IVd-2 M11/12SP-IVe-1

M11/12SP-IVe-2

M11/12SP-IVe-3

M11/12SP-IVe-4

8. draws conclusion about the population mean based on the test-statistic value and the rejection region. 9. solves problems involving test of hypothesis on the population mean. 10. formulates the appropriate null and alternative hypotheses on a population proportion. 11. identifies the appropriate form of the test-statistic when the Central Limit Theorem is to be used. 12. identifies the appropriate rejection region for a given level of significance when the Central Limit Theorem is to be used.

M11/12SP-IVf-g1

15. solves problems involving test of hypothesis on the population proportion.

Page 5 of 7

M11/12SP-IVf-2

14. draws conclusion about the population proportion based on the test-statistic value and the rejection region.

M11/12SP-IVf-1

M11/12SP-IVd-1

7. computes for the test-statistic value (population mean).

13. computes for the test-statistic value (population proportion).

M11/12SP-IVc-1

CODE

6. identifies the appropriate rejection region for a given level of significance when: (a) the population variance is assumed to be known (b) the population variance is assumed to be unknown; and (c) the Central Limit Theorem is to be used.

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT PERFORMANCE LEARNING COMPETENCIES STANDARDS

The learner demonstrates understanding of key concepts of correlation and regression analyses.

ENRICHMENT

PERFORMANCE STANDARDS The learner is able to perform correlation and regression analyses on real-life problems in different disciplines.

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

Correlation and Regression Analyses

CONTENT STANDARDS

CONTENT

M11/12SP-IVg-4 M11/12SP-IVh-1 M11/12SP-IVh-2 M11/12SP-IVh-3 M11/12SP-IVi-1 M11/12SP-IVi-2 M11/12SP-IVi-3 M11/12SP-IVi-4 M11/12SP-IVj-1 M11/12SP-IVj-2

3. describes shape (form), trend (direction), and variation (strength) based on a scatter plot. 4. estimates strength of association between the variables based on a scatter plot. 5. calculates the Pearson’s sample correlation coefficient. 6. solves problems involving correlation analysis. 7. identifies the independent and dependent variables. 8. draws the best-fit line on a scatter plot. 9. calculates the slope and y-intercept of the regression line. 10. interprets the calculated slope and y-intercept of the regression line. 11. predicts the value of the dependent variable given the value of the independent variable. 12. solves problems involving regression analysis.

Page 6 of 7

M11/12SP-IVg-3

M11/12SP-IVg-2

CODE

2. constructs a scatter plot.

1. illustrates the nature of bivariate data.

LEARNING COMPETENCIES

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT

K to 12 Senior High School Core Curriculum – Statistics and Probability December 2013

Competency

Arabic Number

Third Quarter

Quarter

illustrates a random variable (discrete and continuous)

Week one

Statistics and Probability

Domain/Content/ Component/ Topic

Week

Grade 11/12

Mathematics

SAMPLE

Grade Level

Learning Area and Strand/ Subject or Specialization

*Put a hyphen (-) in between letters to indicate more than a specific week

Lowercase Letter/s

*Zero if no specific quarter

Roman Numeral

Uppercase Letter/s

First Entry

LEGEND

Sample: M11/12SP-IIIa-1

Code Book Legend

K to 12 BASIC EDUCATION CURRICULUM SENIOR HIGH SCHOOL – CORE SUBJECT

Page 7 of 7

1

-

a

III

-

SP

M11/12

Table of Contents Chapter 1: Exploring Data

Chapter 3: Sampling

• Introducing Statistics

1

• Data Collection Activity

7

Statistical Perspective

204

• Basic Terms in Statistics

17

• The Need for Sampling

221

• Levels of Measurement

24

• Sampling Distribution of the Sample

• Data Presentation

31

• Measures of Central Tendency

44

• Sampling without Replacement

• Other Measures of Location

54

• Sampling from a Box of Marbles, Nips,

• Measures of Variation

60

Mean

69

242 265

or Colored Paper Clips and One-Peso Coins

• More on Describing Data: Summary Measures and Graphs

• Coin Tossing revisited from a

• Sampling from the Periodic Table

285 299

Chapter 2: Random Variables and

Chapter 4: On Estimation of Parameters

Probability Distributions

• Concepts of Point and Interval

• Probability

86

• Geometric Probability

98

• Random Variables

108

• Probability Distributions of

Estimation • Point Estimation of the Population Mean

117

• Population Mean

• Probability Density Functions

130

• Point and Confidence Interval

• Mean and Variance of Discrete

Estimation of the Population Proportion 144

• More about Means and Variances

351

Chapter 5: Tests of Hypothesis 164 182

• Areas under a Normal Distribution

344

155

• Areas Under a Standard Normal Distribution

328

• More on Point Estimates and Confidence Intervals

• The Normal Distribution and Its Properties

321

• Confidence Interval Estimation of the

Discrete Random Variables

Random Variables

316

194

• Basic Concepts in Hypothesis Testing

362

• Steps in Hypothesis Testing

368

• Test on Population Mean

374

• Test on Population Proportion

385

• More on Hypothesis Tests Regarding the Population Proportion

390

Chapter 6: Correlation and Regression Analysis • Examining Relationships with Correlation Biographical Notes !

399 421

!

CHAPTER 1: EXPLORING DATA Lesson 1: Introducing Statistics TIME FRAME: 60 minutes

OVERVIEW OF LESSON In decision making, we use statistics although some of us may not be aware of it. In this lesson, we make the students realize that to decide logically, they need to use statistics. An inquiry could be answered or a problem could be solved through the use of statistics. In fact, without knowing it we use statistics in our daily activities.

LEARNING COMPETENCIES: At the end of the lesson, the learner should be able to identify questions that could be answered using a statistical process and describe the activities involved in a statistical process.

LESSON OUTLINE: 1. Motivation 2. Statistics as a Tool in Decision-Making 3. Statistical Process in Solving a Problem

REFERENCES: Albert, J. R. G. (2008).Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

1"

!

DEVELOPMENT OF THE LESSON A. Motivation You may ask the students, a question that is in their mind at that moment. You may write their answers on the board. (Note: You may try to group the questions as you write them on the board into two, one group will be questions that are answerable by a fact and the other group are those that require more than one information and needs further thinking). The following are examples of what you could have written on the board: Group 1: • • • • • •

How old is our teacher? Is the vehicle of the Mayor of our city/town/municipality bigger than the vehicle used by the President of the Philippines? How many days are there in December? Does the Principal of the school has a post graduate degree? How much does the Barangay Captain receive as allowance? What is the weight of my smallest classmate?

Group 2: • • • • • •

How old are the people residing in our town? Do dogs eat more than cats? Does it rain more in our country than in Thailand? Do math teachers earn more than science teachers? How many books do my classmates usually bring to school? What is the proportion of Filipino children aged 0 to 5 years who are underweight or overweight for their age?

The first group of questions could be answered by a piece of information which is considered always true. There is a correct answer which is based on a fact and you don’t need the process of inquiry to answer such kind of question. For example, there is one and only one correct answer to the first question in Group 1 and that is your age as of your last birthday or the number of years since your birth year. On the other hand, in the second group of questions one needs observations or data to be able to respond to the question. In some questions you need to get the observations or responses of all those concerned to be able to answer the question. On the first question in the second group, you need to ask all the people in the locality about their age and among the values you obtained you get a representative value. To answer the second question in the second group, 2"

!

you need to get the amount of food that all dogs and cats eat to respond to the question. However, we know that is not feasible to do so. Thus what you can do is get a representative group of dogs and another representative group for the cats. Then we measure the amount of food each group of animal eats. From these two sets of values, we could then infer whether dogs do eat more than cats. So as you can see in the second group of questions you need more information or data to be able to answer the question. Either you need to get observations from all those concerned or you get representative groups from which you gather your data. But in both cases, you need data to be able to respond to the question. Using data to find an answer or a solution to a problem or an inquiry is actually using the statistical process or doing it with statistics. Now, let us formalize what we discussed and know more about statistics and how we use it in decision-making. B. Main Lesson 1. Statistics as a Tool in Decision-Making Statistics is defined as a science that studies data to be able to make a decision. Hence, it is a tool in decision-making process. Mention that Statistics as a science involves the methods of collecting, processing, summarizing and analyzing data in order to provide answers or solutions to an inquiry. One also needs to interpret and communicate the results of the methods identified above to support a decision that one makes when faced with a problem or an inquiry. Trivia: The word “statistics” actually comes from the word “state”— because governments have been involved in the statistical activities, especially the conduct of censuses either for military or taxation purposes. The need for and conduct of censuses are recorded in the pages of holy texts. In the Christian Bible, particularly the Book of Numbers, God is reported to have instructed Moses to carry out a census. Another census mentioned in the Bible is the census ordered by Caesar Augustus throughout the entire Roman Empire before the birth of Christ. Inform students that uncovering patterns in data involves not just science but it is also an art, and this is why some people may think “Stat is eeeks!” and may view any statistical procedures and results with much skepticism Make known to students that Statistics enable us to • characterize persons, objects, situations, and phenomena; • explain relationships among variables; • formulate objective assessments and comparisons; and, more importantly • make evidence-based decisions and predictions.

3"

!

And to use Statistics in decision-making there is a statistical process to follow which is to be discussed in the next section. 2. Statistical Process in Solving a Problem You may go back to one of the questions identified in the second group and use it to discuss the components of a statistical process. For illustration on how to do it, let us discuss how we could answer the question “Do dogs eat more than cats?” As discussed earlier, this question requires you to gather data to generate statistics which will serve as basis in answering the query. There should be plan or a design on how to collect the data so that the information we get from it is enough or sufficient for us to minimize any bias in responding to the query. In relation to the query, we said earlier that we cannot gather the data from all dogs and cats. Hence, the plan is to get representative group of dogs and another representative group of cats. These representative groups were observed for some characteristics like the animal weight, amount of food in grams eaten per day and breed of the animal. Included in the plan are factors like how many dogs and cats are included in the group, how to select those included in the representative groups and when to observe these animals for their characteristics. After the data were gathered, we must verify the quality of the data to make a good decision. Data quality check could be done as we process the data to summarize the information extracted from the data. Then using this information, one can then make a decision or provide answers to the problem or question at hand. To summarize, a statistical process in making a decision or providing solutions to a problem include the following: • • • • •

Planning or designing the collection of data to answer statistical questions in a way that maximizes information content and minimizes bias; Collecting the data as required in the plan; Verifying the quality of the data after they were collected; Summarizing the information extracted from the data; and Examining the summary statistics so that insight and meaningful information can be produced to support decision-making or solutions to the question or problem at hand.

Hence, several activities make up a statistical process which for some the process is simple but for others it might be a little bit complicated to implement. Also, not all questions or problems could be answered by a simple statistical

4"

!

process. There are indeed problems that need complex statistical process. However, one can be assured that logical decisions or solutions could be formulated using a statistical process. KEY POINTS • Difference between questions that could be and those that could not answered using Statistics. •

Statistics is a science that studies data.



There are many uses of Statistics but its main use is in decision-making.



Logical decisions or solutions to a problem could be attained through a statistical process.

ASSESSMENT Note: Answers are provided inside the parentheses and italicized. 1. Identify which of the following questions are answerable using a statistical process. a. What is a typical size of a Filipino family? (answerable through a statistical process) b. How many hours in a day? (not answerable through a statistical process) c. How old is the oldest man residing in the Philippines? (answerable through a statistical process) d. Is planet Mars bigger than planet Earth? (not answerable through a statistical process) e. What is the average wage rate in the country? (answerable through a statistical process) f. Would Filipinos prefer eating bananas rather than apple? (answerable through a statistical process) g. How long did you sleep last night? (not answerable through a statistical process) h. How much a newly-hired public school teacher in NCR earns in a month? (not answerable through a statistical process) i. How tall is a typical Filipino? (answerable through a statistical process) j. Did you eat your breakfast today? (not answerable through a statistical process) 2. For each of the identified questions in Number 1 that are answerable using a statistical process, describe the activities involved in the process.

5"

!

For a. What is a typical size of a Filipino family? (The process includes getting a representative group of Filipino families and ask the family head as to how many members do they have in their family. From the gathered data which had undergone a quality check a typical value of the number of family members could be obtained. Such typical value represents a possible answer to the question.) For c. How old is the oldest man residing in the Philippines? (The process includes getting the ages of all residents of the country. From the gathered data which had undergone a quality check the highest value of age could be obtained. Such value is the answer to the question.) For e. What is the average wage rate in the country? (The process includes getting all prevailing wage rates in the country. From the gathered data which had undergone a quality check a typical value of the wage rate could be obtained. Such value is the answer to the question.) For f. Would Filipinos prefer eating bananas rather than apple? (The process includes getting a representative group of Filipinos and ask each one of them on what fruit he/she prefers, banana or apple? From the gathered data which had undergone a quality check the proportion of those who prefers banana and proportion of those who prefer apple will be computed and compared. The results of this comparison could provide a possible answer to the question.) For i. How tall is a typical Filipino? (The process includes getting a representative group of Filipinos and measure the height of each member of the representative group. From the gathered data which had undergone a quality check a typical value of the height of a Filipino could be obtained. Such typical value represents a possible answer to the question.) Note: Tell the students that getting a representative group and obtaining a typical value are to be learned in subsequent lessons in this subject.

6"

CHAPTER 1: EXPLORING DATA Lesson 2: Data Collection Activity TIME FRAME: 60 minutes

OVERVIEW OF LESSON As we have learned in the previous lesson, Statistics is a science that studies data. Hence to teach Statistics, real data set is recommend to use. In this lesson,we present an activity where the students will be asked to provide some data that will be submitted for consolidation by the teacher for future lessons. Data on heights and weights, for instance, will be used for calculating Body Mass Index in the integrative lesson. Students will also be given the perspective that the data they provided is part of a bigger group of data as the same data will be asked from much larger groups (the entire class, all Grade 11 students in school, all Grade 11 students in the district). The contextualization of data will also be discussed.

LEARNING COMPETENCIES: At the end of the lesson, the learner should be able to: • • • •

Recognize the importance of providing correct information in a data collection activity; Understand the issue of confidentiality of information in a data collection activity; Participate in a data collection activity; and Contextualize data

LESSON OUTLINE: 1. Preliminaries in a Data Collection Activity 2. Performing a Data Collection Activity 3. Contextualization of Data

REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 https://www.khanacademy.org/math/probability/statistical-studies/statistical-questions/v/statisticalquestions https://www.illustrativemathematics.org/content-standards/tasks/703

7"

DEVELOPMENT OF THE LESSON A. Preliminaries in a Data Collection Activity Before the lesson, prepare a sheet of paper listing everyone’s name in class with a “Class Student Number” (see Attachment A for the suggested format). The class student number is a random number chosen in the following fashion: (a) Make a box with “tickets” (small pieces of papers of equal sizes) listing the numbers 1 up to the number of students in the class. (b) Shake the box, get a ticket, and assign the number in the ticket to the first person in the list. (c) Shake the box again, get another ticket, and assign the number of this ticket to the next person in the list. (d) Do (c) until you run out of tickets in the box. At this point all the students have their corresponding class student number written across their names in the prepared class list. Note that the preparation of the class list is done before the class starts. At the start of the class, inform each student confidentially of his/her class student number. Perhaps, when the attendance is called, each student can be provided a separate piece of paper that lists her/his name and class student number. Tell students to remember their class student number, and to always use this throughout the semester whenever data are requested of them. Explain to students that in data collection activity, specific identities like their names are not required, especially because people have a right to confidentiality, but there should be a way to develop and maintain a database to check quality of data provided, and verify from respondent in a data collection activity the data that they provided (if necessary). These preliminary steps for generating a class student number and informing students confidentially of their class student number are essential for the data collection activities to be performed in this lesson and other lessons so that students can be uniquely identified, without having to obtain their names. Inform also the students that the class student numbers they were given are meant to identify them without having to know their specific identities in the class recording sheet (which will contain the consolidated records that everyone had provided). This helps protect confidentiality of information. In statistical activities, facts are collected from respondents for purposes of getting aggregate information, but confidentiality should be protected. Mention that the agencies mandated to collect data is bound by law to protect the confidentiality of information provided by respondents. Even market research organizations in the private sector and individual researchers also guard confidentiality as they merely want to obtain aggregate data. This way, respondents can be truthful in giving 8"

information, and the researcher can give a commitment to respondents that the data they provide will never be released to anyone in a form that will identify them without their consent. B. Performing a Data Collection Activity Explain to the students that the purpose of this data collection activity is to gather data that they could use for their future lessons in Statistics. It is important that they do provide the needed information to the best of their knowledge. Also, before they respond to the questionnaire provided in the Attachment B as Student Information Sheet (SIS), it is recommended that each item in the SIS should be clarified. The following are suggested clarifications to make for each item: 1. CLASS STUDENT NUMBER: This is the number that you provided confidentially to the student at the start of the class. 2. SEX: This is the student’s biological sex and not their preferred gender. Hence, they have to choose only one of the two choices by placing a check mark (√) at space provided before the choices. 3. NUMBER OF SIBLINGS: This is the number of brothers and sisters that the student has in their nuclear or immediate family. This number excludes him or her in the count. Thus, if the student is the only child in the family then he/she will report zero as his/her number of siblings. 4. WEIGHT (in kilograms): This refers to the student’s weight based on the student’s knowledge. Note that the weight has to be reported in kilograms. In case the student knows his/her weight in pounds, the value should be converted to kilograms by dividing the weight in pounds by a conversion factor of 2.2 pounds per kilogram. 5. HEIGHT (in centimeters): This refers to the student’s height based on the student’s knowledge. Note that the height has to be reported in centimeters. In case the student knows his/her height in inches, the value should be converted to centimeters by multiplying the height in inches by a conversion factor of 2.54 centimeters per inch. 6. AGE OF MOTHER (as of her last birthday in years): This refers to the age of the student’s mother in years as of her last birthday, thus this number should be reported in whole number. In case, the student’s mother is dead or nowhere to be found, ask the student to provide the age as if the mother is alive or around.You could help the student in determining his/her mother’s age based

9"

on other information that the student could provide like birth year of the mother or student’s age. Note also that a zero value is not an acceptable value. 7. USUAL DAILY ALLOWANCE IN SCHOOL (in pesos): This refers to the usual amount in pesos that the student is provided for when he/she goes to school in a weekday. Note that the student can give zero as response for this item, in case he/she has no monetary allowance per day. 8. USUAL DAILY FOOD EXPENDITURE IN SCHOOL (in pesos): This refers to the usual amount in pesos that the student spends for food including drinks in school per day. Note that the student can give zero as response for this item, in case he/she does not spend for food in school. 9. USUAL NUMBER OF TEXT MESSAGES SENT IN A DAY: This refers to the usual number of text messages that a student send in a day. Note that the student can give zero as response for this item, in case he/she does not have the gadget to use to send a text message or simply he/she does not send text messages. 10. MOST PREFERRED COLOR: The student is to choose a color that could be considered his most preferred among the given choices. Note that the student could only choose one. Hence, they have to place a check mark (√) at space provided before the color he/she considers as his/her most preferred color among those given. 11. USUAL SLEEPING TIME: This refers to the usual sleeping time at night during a typical weekday or school day. Note that the time is to be reported using the military way of reporting the time or the 24-hour clock (0:00 to 23:59 are the possible values to use) 12. HAPPINESS INDEX FOR THE DAY : The student has to response on how he/she feels at that time using codes from 1 to 10. Code 1 refers to the feeling that the student is very unhappy while Code 10 refers to a feeling that the student is very happy on the day when the data are being collected. After the clarification, the students are provided at most 10 minutes to respond to the questionnaire. Ask the students to submit the completed SIS so that you could consolidate the data gathered using a formatted worksheet file provided to you as Attachment C. Having the data in electronic file makes it easier for you to use it in the future lessons. Be sure that the students provided the information in all items in the SIS.

10"

Inform the students that you are to compile all their responses and compiling all these records from everyone in the class is an example of a census since data has been gathered from every student in class. Mention that the government, through the Philippine Statistics Authority (PSA), conducts censuses to obtain information about socio-demographic characteristics of the residents of the country. Census data are used by the government to make plans, such as how many schools and hospitals to build. Censuses of population and housing are conducted every 10 years on years ending in zero (e.g., 1990, 2000, 2010) to obtain population counts, and demographic information about all Filipinos. Mid-decade population censuses have also been conducted since 1995. Censuses of Agriculture, and of Philippine Business and Industry, are also conducted by the PSA to obtain information on production and other relevant economic information. PSA is the government agency mandated to conduct censuses and surveys. Through Republic Act 10625 (also referred to as The Philippine Statistical Act of 2013), PSA was created from four former government statistical agencies, namely: National Statistics Office (NSO), National Statistical Coordination Board (NSCB), Bureau of Labor and Employment of Statistics (BLES) and Bureau of Agricultural Statistics (BAS). The other agency created through RA 10625 is the Philippine Statistical Research and Training Institute (PSRTI) which is mandated as the research and training arm of the Philippine Statistical System. PSRTI was created from its forerunner the former Statistical Research and Training Center (SRTC).

C. Contextualization of Data Ask students what comes to their minds when they hear the term “data” (which may be viewed as a collection of facts from experiments, observations, sample surveys and censuses, and administrative reporting systems). Present to the student the following collection of numbers, figures, symbols, and words, and ask them if they could consider the collection as data. 3, red, F, 156, 4, 65, 50, 25, 1, M, 9, 40, 68, blue, 78, 168, 69, 3, F, 6, 9, 45, 50, 20, 200, white, 2, pink, 160, 5, 60, 100, 15, 9, 8, 41, 65, black, 68, 165, 59, 7, 6, 35, 45, Although the collection is composed of numbers and symbols that could be classified as numeric or non-numeric, the collection has no meaning or it is not contextualized, hence it cannot be referred to as data.

11"

Tell the students that data are facts and figures that are presented, collected and analyzed. Data are either numeric or non-numeric and must be contextualized. To contextualize data, we must identify its six W’s or to put meaning on the data, we must know the following W’s of the data: 1. Who? Who provided the data? 2. What? What are the information from the respondents and What is the unit of measurement used for each of the information (if there are any)? 3. When? When was the data collected? 4. Where? Where was the data collected? 5. Why? Why was the data collected? 6. HoW? HoW was the data collected? Let us take as an illustration the data that you have just collected from the students, and let us put meaning or contextualize it by responding to the questions with the Ws. It is recommended that the students answer theW-questions so that they will learn how to do it. 1. Who? Who provided the data? •

The students in this class provided the data.

2. What? What are the information from the respondents and What is the unit of measurement used for each of the information (if there are any)? •

The information gathered include Class Student Number, Sex, Number of Siblings, Weight, Height, Age of Mother, Usual Daily Allowance in School, Usual Daily Food Expenditure in School, Usual Number of Text Messages Sent in a Day, Most Preferred Color, Usual Sleeping Time and Happiness Index for the Day.



The units of measurement for the information on Number of Siblings, Weight, Height, Age of Mother, Usual Daily Allowance in School, Usual Daily Food Expenditure in School, and Usual Number of Text Messages Sent in a Day are person, kilogram, centimeter, year, pesos, pesos and message, respectively.

3. When? When was the data collected?

12"



The data was collected on the first few days of classes for Statistics and Probability.

4. Where? Where was the data collected? •

The data was collected inside our classroom.

5. Why? Why was the data collected? •

As explained earlier, the data will be used in our future lessons in Statistics and Probability

6. HoW? HoW was the data collected? •

The students provided the data by responding to the Student Information Sheet prepared and distributed by the teacher for the data collection activity.

Once the data are contextualized, there is now meaning to the collection of number and symbols which may now look like the following which is just a small part of the data collected in the earlier activity. Class Student Number

1 2 3 4 5 : :

Sex

Number of siblings (in person)

Weight (in kg)

Height (in cm)

Age of mother (in years)

2 5 3 1 0 : :

60 63 65 55 65 : :

156 160 165 160 167 : :

60 66 59 55 45 : :

M F F M M : :

Usual daily allowance in school (in pesos)

Usual daily food expenditure in school (in pesos)

200 300 250 200 350 : :

150 200 50 100 300 : :

Usual number of text messages sent in a day

Most Preferred Color

Usual Sleeping Time

Happiness Index for the Day

20 25 15 30 35 : :

RED PINK BLUE BLACK BLUE : :

23:00 22:00 20:00 19:00 20:00 : :

8 9 7 6 8 : :

KEY POINTS •

Providing correct information in a government data collection activity is a responsibility of every citizen in the country.



Data confidentiality is important in a data collection activity.



Census is collecting data from all possible respondents.



Data to be collected must be clarified before the actual data collection.



Data must be contextualized by answering six W-questions.

13"

ATTACHMENT A: CLASS LIST STUDENT NAME

CLASS STUDENT NUMBER

STUDENT NAME

1.

36.

2,

37.

3.

38.

4.

39.

5.

40.

6.

41.

7.

42.

8.

43.

9.

44.

10.

45.

11.

46.

12.

47.

13.

48.

14.

49.

15.

50.

16.

51.

17.

52.

18.

53.

19.

54.

20.

55.

21.

56.

22.

57.

23.

58.

24.

59.

25.

60.

26.

61.

27.

62.

28.

63.

29.

64.

30.

65.

31.

66.

32,

67.

33.

68.

34.

69.

35.

70.

14"

CLASS STUDENT NUMBER

ATTACHMENT B: STUDENT INFORMATION SHEET Instruction to the Students: Please provide completely the following information. Your teacher is available to respond to your queries regarding the items in this information sheet, if you have any. Rest assured that the information that you will be providing will only be used in our lessons in Statistics and Probability. 1. CLASS STUDENT NUMBER: ______________ 2. SEX (Put a check mark, √): ____Male __ Female 3. NUMBER OF SIBLINGS: _____ 4. WEIGHT (in kilograms): ______________ 5. HEIGHT (in centimeters): ______ 6. AGE OF MOTHER (as of her last birthday in years): ________ (If mother deceased, provide age if she was alive) 7. USUAL DAILY ALLOWANCE IN SCHOOL (in pesos): _________________ 8. USUAL DAILY FOOD EXPENDITURE IN SCHOOL (in pesos): ___________ 9. USUAL NUMBER OF TEXT MESSAGES SENT IN A DAY: ______________ 10. MOST PREFERRED COLOR (Put a check mark, √. Choose only one): ____WHITE ____YELLOW ____BROWN

____RED ____ PINK ____GREEN ____BLUE ____GRAY ____BLACK

____ ORANGE ____PEACH ____PURPLE

11. USUAL SLEEPING TIME (on weekdays): ______________ 12. HAPPPINESS INDEX FOR THE DAY: On a scale from 1 (very unhappy) to 10 (very happy), how do you feel today? ______

15"

ATTACHMENT C: CLASS RECORDING SHEET (for the Teacher’s Use) Class Student Number

Sex

Number of siblings (in person)

Weight (in kg)

Height (in cm)

Age of mother (in years)

Usual Daily allowance in school (in pesos)

16"

Usual Daily food expenditure in school (in pesos)

Usual number of text messages sent in a day

Most Preferred Color

Usual Sleeping Time

Happiness Index for the Day

CHAPTER 1: EXPLORING DATA Lesson 3: Basic Terms in Statistics TIME FRAME: 60 minutes

OVERVIEW OF LESSON As continuation of Lesson 2 (where we contextualize data) in this lesson we define basic terms in statistics as we continue to explore data. These basic terms include the universe, variable, population and sample. In detail we will discuss other concepts in relation to a variable. LEARNING OUTCOME(S): At the end of the lesson, the learner is able to • •

Define universe and differentiate it with population; and Define and differentiate between qualitative and quantitative variables, and between discrete and continuous variables (that are quantitative);

LESSON OUTLINE: 1. Recall previous lesson on ‘Contextualizing Data’ 2. Definition of Basic Terms in Statistics (universe, variable, population and sample) 3. Broad of Classification of Variables(qualitative and quantitative, discrete and continuous) REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Takahashi, S. (2009). The Manga Guide to Statistics. Trend-Pro Co. Ltd. Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

17#

DEVELOPMENT OF THE LESSON A. Recall previous lesson on ‘Contextualizing Data’ Begin by recalling with the students the data they provided in the previous lesson and how they contextualized such data. You could show them the compiled data set in a table like this: Class Student Number

1 2 3 4 5 : :

Sex

M F F M M : :

Number of siblings (in person)

Weight (in kg)

Height (in cm)

Age of mother (in years)

2 5 3 1 0 : :

60 63 65 55 65 : :

156 160 165 160 167 : :

60 66 59 55 45 : :

Usual Daily allowance in school (in pesos)

Usual Daily food expenditure in school (in pesos)

200 300 250 200 350 : :

150 200 50 100 300 : :

Usual number of text messages sent in a day

Most Preferred Color

Usual Sleeping Time

Happiness Index for the Day

20 25 15 30 35 : :

RED PINK BLUE BLACK BLUE : :

23:00 22:00 20:00 19:00 20:00 : :

8 9 7 6 8 : :

Recall also their response on the first Ws of the data, that is, on the question “Who provided the data?” We said last time the students of the class provided the data or the data were taken from the students. Another Ws of the data is What? What are the information from the respondents? and What is the unit of measurement used for each of the information (if there are any)? Our responses are the following: •



The information gathered include Class Student Number, Sex, Number of Siblings, Weight, Height, Age of Mother, Usual Daily Allowance in School, Usual Daily Food Expenditure in School, Usual Number of Text Messages Sent in a Day, Most Preferred Color, Usual Sleeping Time and Happiness Index. The units of measurement for the information on Number of Siblings, Weight, Height, Age of Mother, Usual Daily Allowance in School, Usual Daily Food Expenditure in School, and Usual Number of Text Messages Sent in a Day are person, kilogram, centimeter, year, pesos, pesos and message, respectively.

B. Main Lesson 1. Definition of Basic Terms The collection of respondents from whom one obtain the data is called the universe of the study. In our illustration, the set of students of this Statistics and Probability class is our universe. But we must precaution the students that a universe is not necessarily composed of people. Since there are studies where the observations were taken from plants or animals or even from non-living things like buildings, vehicles, farms, etc. So formally, we define universe as the collection or set of

18#

units or entities from whom we got the data. Thus, this set of units answers the first Ws of data contextualization. On the other hand, the information we asked from the students are referred to as the variables of the study and in the data collection activity, we have 12 variables including Class Student Number. A variable is a characteristic that is observable or measurable in every unit of the universe. From each student of the class, we got the his/her age, number of siblings, weight, height, age of mother, usual daily allowance in school, usual daily food expenditure in school, usual number of text messages sent in a day, most preferred color, usual sleeping time and happiness index for the day. Since these characteristics are observable in each and every student of the class, then these are referred to as variables. The set of all possible values of a variable is population. Thus for each variable we observed, we have a The number of population in a study will be equal to the observed. In the data collection activity we had, there corresponding to 12 variables.

referred to as a population of values. number of variables are 12 populations

A subgroup of a universe or of a population is a sample. There are several ways to take a sample from a universe or a population and the way we draw the sample dictates the kind of analysis we do with our data. We can further visualize these terms in the following figure: VARIABLE 1 Unit 1 Unit 2 Unit 3 : : Unit N

UNIVERSE

VARIABLE 2 Value 1 Value 2 Value 3 : : Value N

Value 1 Value 2 Value 3 : : Value N

POPULATION OF VARIABLE 1 Unit 1 : : Unit n

POPULATION OF VARIABLE 2

OR#

VARIABLE 12 Value 1 Value 2 Value 3 : : Value N

…..#

POPULATION OF VARIABLE 12 Value 1 : : Value n

SAMPLE A SAMPLE OF UNITS

A SAMPLE OF POPULATION VALUES

Figure 3.1 Visualization of the relationship among universe, variable, population and sample. 19#

2. Broad Classification of Variables Following up with the concept of variable, inform the students that usually, a variable takes on several values. But occasionally, a variable can only assume one value, then it is called a constant. For instance, in a class of fifteen-year olds, the age in years of students is constant. Variables can be broadly classified as either quantitative or qualitative, with the latter further classified into discrete and continuous types (see Figure 3.3 below).

Figure 3.3 Broad Classification of Variables

(i) Qualitative variables express a categorical attribute, such as sex (male or female), religion, marital status, region of residence, highest educational attainment. Qualitative variables do not strictly take on numeric values (although we can have numeric codes for them, e.g., for sex variable, 1 and 2 may refer to male, and female, respectively). Qualitative data answer questions “what kind.” Sometimes, there is a sense of ordering in qualitative data, e.g., income data grouped into high, middle and low-income status. Data on sex or religion do not have the sense of ordering, as there is no such thing as a weaker or stronger sex, and a better or worse religion. Qualitative variables are sometimes referred to as categorical variables. (ii) Quantitative (otherwise called numerical) data, whose sizes are meaningful, answer questions such as “how much” or “how many”. Quantitative variables have actual units of measure. Examples of quantitative variables include the height, weight, number of registered cars, household size, and total household expenditures/income of survey respondents. Quantitative data may be further classified into:

20#

a. Discrete data are those data that can be counted, e.g., the number of days for cellphones to fail, the ages of survey respondents measured to the nearest year, and the number of patients in a hospital. These data assume only (a finite or infinitely) countable number of values.

b. Continuous data are those that can be measured, e.g. the exact height of a survey respondent and the exact volume of some liquid substance. The possible values are uncountably infinite. With this classification, let us then test the understanding of our students by asking them to classify the variables, we had in our last data gathering activity. They should be able to classify these variables as to qualitative or quantitative and further more as to discrete or continuous. If they did it right, you have the following: TYPE OF VARIABLE

VARIABLE Class Student Number Sex Number of Siblings Weight (in kilograms) Height (in centimeters) Age of Mother Usual Daily Allowance in School (in pesos) Usual Daily Food Expenditure in School (in pesos) Usual Number of Text Messages Sent in a Day Usual Sleeping Time Most Preferred Color Happiness Index for the Day

Qualitative Qualitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative Quantitative

TYPE OF QUANTITATIVE VARIABLE

Discrete Continuous Continuous Discrete Discrete Discrete Discrete

Qualitative Qualitative Qualitative

Special Note: For quantitative data, arithmetical operations have some physical interpretation. One can add 301 and 302 if these have quantitative meanings, but if, these numbers refer to room numbers, then adding these numbers does not make any sense. Even though a variable may take numerical values, it does not make the corresponding variable quantitative! The issue is whether performing arithmetical operations on these data would make any sense. It would certainly not make sense to sum two zip codes or multiply two room numbers. 21#

KEY POINTS • • • • • •

A universe is a collection of units from which the data were gathered. A variable is a characteristic we observed or measured from every element of the universe. A population is a set of all possible values of a variable. A sample is a subgroup of a universe or a population. In a study there is only one universe but could have several populations. Variables could be classified as qualitative or quantitative, and the latter could be further classified as discrete or continuous. ASSESSM ENT Note: Answers are provided inside the parentheses and italicized.

1. A market researcher company requested all teachers of a particular school to fill up a questionnaire in relation to their product market study. The following are some of the information supplied by the teachers: • highest educational attainment • predominant hair color • body temperature • civil status • brand of laundry soap being used • total household expenditures last month in pesos • number of children in the household • number of hours standing in queue while waiting to be served by a bank teller • amount spent on rice last week by the household • distance travelled by the teacher in going to school • time (in hours) consumed on Facebook on a particular day a.

If we are to consider the collection of information gathered through the completed questionnaire, what is the universe for this data set? (The universe is the set of all teachers in that school) b. Which of the variables are qualitative? Which are quantitative? Among the quantitative variables, classify them further as discrete or continuous. • highest educational attainment (qualitative) • predominant hair color (qualitative) • body temperature (quantitative: continuous) • civil status (qualitative) • brand of laundry soap being used (qualitative) • total household expenditures last month in pesos (quantitative: discrete) • number of children in a household (quantitative: discrete) • number of hours standing in queue while waiting to be served by a bank teller (quantitative: discrete)

22#

amount spent on rice last week by a household (quantitative: discrete) distance travelled by the teacher in going to school (quantitative: continuous) • time (in hours) consumed on Facebook on a particular day (quantitative: continuous) c. Give at least two populations that could be observed from the variables identified in (b). (Possible answer: The population is the set of all values of the highest educational attainment and another population is {single, married, divorced, separated, widow/widower}) 2. The Engineering Department of a big city did a listing of all buildings in their locality. If you are planning to gather the characteristics of these buildings, a. What is the universe of this data collection activity? (Set of all buildings in the big city) b. What are the crucial variables to observe? It would also be better if you could classify the variables as to whether it is qualitative or quantitative. Furthermore, classify the quantitative variable as discrete or continuous. (A possible answer is the number of floors in the building, quantitative, discrete) 3. A survey of students in a certain school is conducted. The survey questionnaire details the information on the following variables. For each of these variables, identify whether the variable is qualitative or quantitative, and if the latter, state whether it is discrete or continuous. a. number of family members who are working (quantitative: discrete) b. ownership of a cell phone among family members (qualitative) c. length (in minutes) of longest call made on each cell phone owned per month (quantitative: continuous) d. ownership/rental of dwelling (qualitative) e. amount spent in pesos on food in one week (quantitative: discrete) f. occupation of household head (qualitative) g. total family income (quantitative: discrete) h. number of years of schooling of each family member (quantitative: discrete) i. access of family members to social media (qualitative) j. amount of time last week spent by each family member using the internet (quantitative: continuous) • •

Explanatory Note: •

Teachers have the option to just ask this assessment orally to the entire class, or to group students and ask them to identify answers, or to give this as homework, or to use some questions/items here for a chapter examination.

23#

CHAPTER 1: EXPLORING DATA Lesson 4: Levels of Measurement TIME FRAME: 60 minutes

OVERVIEW OF LESSON In this lesson we discuss the different levels of measurement as we continue to explore data. Knowing such will enable us to plan the data collection process we need to employ in order to gather the appropriate data for analysis. LEARNING OUTCOME(S): At the end of the lesson, the learner is able to identify and differentiate the different levels of measurement and methods of data collection LESSON OUTLINE: 1. Motivational Activity 2. Levels of Measurement 3. Data Collection Methods REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Takahashi, S. (2009). The Manga Guide to Statistics. Trend-Pro Co. Ltd. Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

! 24!

DEVELOPMENT OF THE LESSON A. Motivational Activity Ask the students first if they believe the following statement: “Students who eat a healthy breakfast will do best on a quiz, students who eat an unhealthy breakfast will get an average performance, and students who do not eat anything for breakfast will do the worst on a quiz” You could further ask one or more students who have different answers to defend their answers. Then challenge the students to apply a statistical process to investigate on the validity of this statement. You could enumerate on the board the steps in the process to undertake like the following: 1. Plan or design the collection of data to verify the validity of the statement in a way that maximizes information content and minimizes bias; 2. Collect the data as required in the plan; 3. Verify the quality of the data after it was collected; 4. Summarize the information extracted from the data; and 5. Examine the summary statistics so that insight and meaningful information can be produced to support your decision whether to believe or not the given statement. Let us discuss in detail the first step. In planning or designing the data collection activity, we could consider the set of all the students in the class as our universe. Then let us identify the variables we need to observe or measure to verify the validity of the statement. You may ask the students to participate in the discussion by asking them to identify a question to get the needed data. The following are some possible suggested queries: 1. Do you usually have a breakfast before going to school? (Note: This is answerable by Yes or No) 2. What do you usually have for breakfast? (Note: Possible responses for this question are rice, bread, banana, oatmeal, cereal, etc) The responses in Questions Numbers 1 and 2 could lead us to identify whether a student in the class had a healthy breakfast, an unhealthy breakfast or no breakfast at all.

! 25!

Furthermore, there is a need to determine the performance of the student in a quiz on that day. The score in the quiz could be used to identify the student’s performance as best, average or worst. As we describe the data collection process to verify the validity of the statement, there is also a need to include the levels of measurement for the variables of interest. B. Main Lesson: 1. Levels of Measurement Inform students that there are four levels of measurement of variables: nominal, ordinal, interval and ratio. These are hierarchical in nature and are described as follows: Nominal level of measurement arises when we have variables that are categorical and non-numeric or where the numbers have no sense of ordering. As an example, consider the numbers on the uniforms of basketball players. Is the player wearing a number 7 a worse player than the player wearing number 10? Maybe, or maybe not, but the number on the uniform does not have anything to do with their performance. The numbers on the uniform merely help identify the basketball player. Other examples of the variables measured at the nominal level include sex, marital status, religious affiliation. For the study on the validity of the statement regarding effect of breakfast on school performance, students who responded Yes to Question Number 1 can be coded 1 while those who responded No, code 0 can be assigned. The numbers used are simply for numerical codes, and cannot be used for ordering and any mathematical computation. Ordinal level also deals with categorical variables like the nominal level, but in this level ordering is important, that is the values of the variable could be ranked. For the study on the validity of the statement regarding effect of breakfast on school performance, students who had healthy breakfast can be coded 1, those who had unhealthy breakfast as 2 while those who had no breakfast at all as 3. Using the codes the responses could be ranked. Thus, the students who had a healthy breakfast are ranked first while those who had no breakfast at all are ranked last in terms of having a healthy breakfast. The numerical codes here have a meaningful sense of ordering, unlike basketball player uniforms, the numerical codes suggest that one student is having a healthier breakfast than another student. Other examples of the ordinal scale include socio economic status (A to E, where A is wealthy, E is poor), difficulty of questions in an exam (easy, medium difficult), rank in a contest (first place, second place, etc.), and perceptions in Likert scales.

! 26!

Note to Teacher: Let us also emphasize to the students that while there is a sense or ordering, there is no zero point in an ordinal scale. In addition, there is no way to find out how much “distance” there is between one category and another. In a scale from 1 to 10, the difference between 7 and 8 may not be the same difference between 1 and 2. Interval level tells us that one unit differs by a certain amount of degree from another unit. Knowing how much one unit differs from another is an additional property of the interval level on top of having the properties posses by the ordinal level. When measuring temperature in Celsius, a 10 degree difference has the same meaning anywhere along the scale – the difference between 10 and 20 degree Celsius is the same as between 80 and 90 centigrade. But, we cannot say that 80 degrees Celsius is twice as hot as 40 degrees Celsius since there is no true zero, but only an arbitrary zero point. A measurement of 0 degrees Celsius does not reflect a true "lack of temperature." Thus, Celsius scale is in interval level. Other example of a variable measure at the interval is the Intelligence Quotient (IQ) of a person. We can tell not only which person ranks higher in IQ but also how much higher he or she ranks with another, but zero IQ does not mean no intelligence. The students could also be classified or categorized according to their IQ level. Hence, the IQ as measured in the interval level has also the properties of those measured in the ordinal as well as those in the nominal level. Special Note: Inform also the students that the interval level allows addition and subtraction operations, but it does not possess an absolute zero. Zero is arbitrary as it does not mean the value does not exist. Zero only represents an additional measurement point. Ratio level also tells us that one unit has so many times as much of the property as does another unit. The ratio level possesses a meaningful (unique and non-arbitrary) absolute, fixed zero point and allows all arithmetic operations. The existence of the zero point is the only difference between ratio and interval level of measurement. Examples of the ratio scale include mass, heights, weights, energy and electric charge. With mass as an example, the difference between 120 grams and 135 grams is 15 grams, and this is the same difference between 380 grams and 395 grams. The level at any given point is constant, and a measurement of 0 reflects a complete lack of mass. Amount of money is also at the ratio level. We can say that 2000 pesos is twice more than 1,000 pesos. In addition, money has a true zero point: if you have zero money, this implies the absence of money. For the study on the validity of the statement regarding effect of breakfast on school performance, the student’s score in the quiz is measured at the ratio level. A score of zero implies that the student did not get a correct answer at all. In summary, we have the following levels of measurement:

! 27!

Level Nomina l Ordinal Interval Ratio

Property

Basic Empirical Operation

No order, distance, or origin

Determination of equivalence

Has order but no distance or unique origin Both with order and distance but no unique origin Has order, distance and unique origin

Determination of greater or lesser values Determination of equality of intervals or difference Determination of equality of ratios or means

The levels of measurement depend mainly on the method of measurement, not on the property measured. The weight of primary school students measured in kilograms has a ratio level, but the students can be categorized into overweight, normal, underweight, and in which case, the weight is then measured in an ordinal level. Also, many levels are only interval because their zero point is arbitrarily chosen. To assess the students understanding of the lesson, you may go back to the set of variables in the data gathering activity done in Lesson 2. You could ask the students to identify the level of measurement for each of the variable. If they did it right, you have the following: VARIABLE Class Student Number Sex Number of Siblings Weight (in kilograms) Height (in centimeters) Age of Mother Usual Daily Allowance in School (in pesos) Usual Daily Food Expenditure in School (in pesos) Usual Number of Text Messages Sent in a Day Usual Sleeping Time Most Preferred Color Happiness Index for the Day

! 28!

LEVEL OF MEASUREMENT Nominal Nominal Ratio Ratio Ratio Ratio

Ratio Ratio Ratio Nominal Nominal Ordinal

2. Methods of Data Collection Variables were observed or measured using any of the three methods of data collection, namely: objective, subjective and use of existing records. The objective and subjective methods obtained the data directly from the source. The former uses any or combination of the five senses (sense of sight, touch, hearing, taste and smell) to measure the variable while the latter obtains data by getting responses through a questionnaire. The resulting data from these two methods of data collection is referred to as primary data. The data gathered in Lesson 2 are primary data and were obtained using the subjective method. On the other hand, secondary data are obtained through the use of existing records or data collected by other entities for certain purposes. For example, when we use data gathered by the Philippine Statistics Authority, we are using secondary data and the method we employ to get the data is the use of existing records. Other data sources include administrative records, news articles, internet, and the like. However, we must emphasize to the students that when we use existing data we must be confident of the quality of the data we are using by knowing how the data were gathered. Also, we must remember to request permission and acknowledge the source of the data when using data gathered by other agency or people. KEY POINTS • • • •

Four levels of measurement: Nominal, Ordinal, Interval and Ratio Knowing what level the variable was measured or observed will guide us to know the type of analysis to apply. Three methods of data collection include objective, subjective and use of existing records. Using the data collection method as basis, data can be classified as either primary or secondary data.

ASSESSMENT Note: Answers are provided inside the parentheses and in bold face. 1. Using the data of the teachers in a particular school gathered by a market researcher company, identify the level of measurement for each of the following variable. • highest educational attainment (ordinal) • predominant hair color (nominal) • body temperature (interval) • civil status (nominal) • brand of laundry soap being used (nominal)

! 29!

• • • • • •

total household expenditures last month in pesos (ratio) number of children in a household (ratio) number of hours standing in queue while waiting to be served by a bank teller (ratio) amount spent on rice last week by a household (ratio) distance travelled by the teacher in going to school (ratio) time (in hours) consumed on Facebook on a particular day (ratio)

2. The following variables are included in a survey conducted among students in a certain school. Identify the level of measurement for each of the variables. a. number of family members who are working (ratio); b. ownership of a cell phone among family members (nominal); c. length (in minutes) of longest call made on each cell phone owned per month (ratio); d. ownership/rental of dwelling (nominal); e. amount spent in pesos on food in one week (ratio); f. occupation of household head (nominal); g. total family income (ratio); h. number of years of schooling of each family member (ratio); i. access of family members to social media (nominal); j. amount of time last week spent by each family member using the internet (ratio) 3. In the following, identify the data collection method used and the type of resulting data. a. The website of Philippine Airlines provides a questionnaire instrument that can be answered electronically. (subjective method, primary data) b. The latest series of the Consumer Price Index (CPI) generated by the Philippine Statistics Authority was downloaded from PSA website. (use of existing record, secondary data) c. A reporter recorded the number of minutes to travel from one end to another of the Metro Manila Rail Transit (MRT) during peak and off-peak hours. (objective method, primary data) d. Students getting the height of the plants using a meter stick. (objective method, primary data) e. PSA enumerator conducting the Labor Force Survey goes around the country to interview household head on employment-related variables. (subjective method, primary data)

! 30!

CHAPTER 1: EXPLORING DATA Lesson 5: Data Presentation TIME FRAME: 60 minutes

OVERVIEW OF LESSON In this lesson we enrich what the students have already learned from Grade 1 to 10 about presenting data. Additional concepts could help the students to appropriately describe further the data set. LEARNING OUTCOME(S): At the end of the lesson, the learner is able to identify and use the appropriate method of presenting information from a data set effectively. LESSON OUTLINE: 1. Review of Lessons in Data Presentation taken up from Grade 1 to 10. 2. Methods of Data Presentation 3. The Frequency Distribution Table and Histogram REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Takahashi, S. (2009). The Manga Guide to Statistics. Trend-Pro Co. Ltd. Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

!

31#

DEVELOPMENT OF THE LESSON A. Review of Lessons in Data Presentation taken up from Grade 1 to 10. You could assist the students to recall what they have learned in Grade 1 to 10 regarding data presentation by asking them to participate in an activity. The activity is called ‘Toss the Ball’. This is actually a review and wake-up exercise. Toss a ball to a student and he/she will give the most important concept he/she learned about data presentation. You may list on the board their responses. You could summarize their responses to be able to establish what they already know about data presentation techniques and from this you could build other concepts on the topic. A suggestion is to classify their answers according to the three methods of data presentation, i.e. textual, tabular and graphical. A possible listing will be something like this: Textual or Narrative Presentation: • Detailed information are given in textual presentation • Narrative report is a way to present data. Tabular Presentation: • Numerical values are presented using tables. • Information are lost in tabular presentation of data. • Frequency distribution table is also applicable for qualitative variables Graphical Presentation: • Trends are easily seen in graphs compared to tables. • It is good to present data using pictures or figures like the pictograph. • Pie charts are used to present data as part of one whole. • Line graphs are for time-series data. • It is better to present data using graphs than tables as they are much better to look at.

B. Main Lesson 1. Methods of Data Presentation You could inform the students that in general there are three methods to present data. Two or all of these three methods could be used at the same time to present appropriately the information from the data set. These methods include the (1) textual or narrative; (2) tabular; and (3) graphical method of presentation. In presenting the data in textual or paragraph or narrative form, one describes the data by enumerating some of the highlights of the data set like giving the highest,

!

32#

lowest or the average values. In case there are only few observations, say less than ten observations, the values could be enumerated if there is a need to do so. An example of which is shown below: The country’s poverty incidence among families as reported by the Philippine Statistics Authority (PSA), the agency mandated to release official poverty statistics, decreases from 21% in 2006 down to 19.7% in 2012. For 2012, the regional estimates released by PSA indicate that the Autonomous Region of Muslim Mindanao (ARMM) is the poorest region with poverty incidence among families estimated at 48.7%. The region with the smallest estimated poverty incidence among families at 2.6% is the National Capital Region (NCR). Data could also be summarized or presented using tables. The tabular method of presentation is applicable for large data sets. Trends could easily be seen in this kind of presentation. However, there is a loss of information when using such kind of presentation. The frequency distribution table is the usual tabular form of presenting the distribution of the data. The following are the common parts of a statistical table: a. Table title includes the number and a short description of what is found inside the table. b. Column header provides the label of what is being presented in a column. c. Row header provides the label of what is being presented in a row. d. Body are the information in the cell intersecting the row and the column. In general, a table should have at least three rows and/or three columns. However, too many information to convey in a table is also not advisable. Tables are usually used in written technical reports and in oral presentation. Table 5.1 is an example of presenting data in tabular form. This example was taken from 2015 Philippine Statistics in Brief, a regular publication of the PSA which is also the basis for the example of the textual presentation given above.

!

33#

Table 5.1 Regional estimates of poverty incidence among families based on the Family Income and Expenditures Survey conducted on the same year of reporting. Region NCR CAR I II III IV A IV B V VI VII VIII IX X XI XII Caraga ARMM

2006 2.9 21.1 19.9 21.7 10.3 7.8 32.4 35.4 22.7 30.7 33.7 40.0 32.1 25.4 31.2 41.7 40.5

2009 2.4 19.2 16.8 20.2 10.7 8.8 27.2 35.3 23.6 26.0 34.5 39.5 33.3 25.5 30.8 46.0 39.9

2012 2.6 17.5 14.0 17.0 10.1 8.3 23.6 32.3 22.8 25.7 37.4 33.7 32.8 25.0 37.1 31.9 48.7

Graphical presentation on the other hand, is a visual presentation of the data. Graphs are commonly used in oral presentation. There are several forms of graphs to use like the pie chart, pictograph, bar graph, line graph, histogram and box-plot. Which form to use depends on what information is to be relayed. For example, trends across time are easily seen using a line graph. However, values of variables in nominal or ordinal levels of measurement should not be presented using line graph. Rather a bar graph is more appropriate to use. A graphical presentation in the form of vertical bar graph of the 2012 regional estimates of poverty incidence among families is shown below:

!

34#

50! 40! 30! 20!

ARMM!

Caraga!

XII!

XI!

X!

IX!

VIII!

VII!

VI!

V!

IV!B!

IV!A!

III!

II!

I!

0!

CAR!

10! NCR!

Poverty(Incidence(Among( Families(in(Percent(

60!

Figure 5.1 2012 Regional poverty incidence among families (2012 FIES). Other examples of graphical presentations that are shown below are lifted from the Handbook of Statistics 1 (listed in the reference section at the end of this Teaching Guide).

Figure 5.2. Percentage distribution of dogs according to groupings identified in a dog show.

Figure 5.3. Distribution of fruits sales of a store for two days.

!

35#

Figure 5.4 Weapons arrest rate from 1965 to 1992 by age of offender.

weight in kg

80 70 60 50 40 30 110

130

150

170

190

height in cm

Figure 5.5. Height and weight of STAT 1 students registered during the previous term.

2. The Frequency Distribution Table and Histogram A special type of tabular and graphical presentation is the frequency distribution table (FDT) and its corresponding histogram. Specifically, these are used to depict the distribution of the data. Most of the time, these are used in technical reports. An FDT is a presentation containing non-overlapping categories or classes of a variable and the frequencies or counts of the observations falling into the categories or classes. There are two types of FDT according to the type of data being organized: a qualitative FDT or a quantitative FDT. For a qualitative FDT, the non-overlapping categories of the variable are identified, and frequencies, as well as the percentages of observations falling into the categories, are computed. On the other hand, for a

!

36#

quantitative FDT, there are also of two types: ungrouped and grouped. Ungrouped FDT is constructed when there are only a few observations or if the data set contains only few possible values. On the other hand, grouped FDT is constructed when there is a large number of observations and when the data set involves many possible values. The distinct values are grouped into class intervals. The creation of columns for a grouped FDT follows a set of guidelines. One such procedure is described in the following steps, which is lifted from the Workbook in Statistics 1 (listed in the reference section at the end of this Teaching Guide) Steps in the construction of a grouped FDT 1. Identify the largest data value or the maximum (MAX) and smallest data value or the minimum (MIN) from the data set and compute the range, R. The range is the difference between the largest and smallest value, i.e. R = MAX – MIN. 2. Determine the number of classes, k using k = N , where N is the total number of observations in the data set. Round-off k to the nearest whole number. It should be noted that the computed k might not be equal to the actual number of classes constructed in an FDT. 3. Calculate the class size, c, using c = R/k. Round off c to the nearest value with precision the same as that with the raw data. 4. Construct the classes or the class intervals. A class interval is defined by a lower limit (LL) and an upper limit (UL). The LL of the lowest class is usually the MIN of the data set. The LL’s of the succeeding classes are then obtained by adding c to the LL of the preceding

! 1 " classes. The UL of the lowest class is obtained by subtracting one unit of measure # x $ 10 & , where x is the maximum number of decimal places observed from the raw data)% from the LL of the next class. The UL’s of the succeeding classes are then obtained by adding c to the UL of the preceding classes. The lowest class should contain the MIN, while the highest class should contain the MAX. 5. Tally the data into the classes constructed in Step 4 to obtain the frequency of each class. Each observation must fall in one and only one class. 6. Add (if needed) the following distributional characteristics: a. True Class Boundaries (TCB). The TCBs reflect the continuous property of a continuous data. It is defined by a lower TCB (LTCB) and an upper TCB (UTCB). These are obtained by taking the midpoints of the gaps between classes or by using the following formulas: LTCB = LL – 0.5(one unit of measure) and UTCB = UL + 0.5(one unit of measure). b. Class Mark (CM). The CM is the midpoint of a class and is obtained by taking the average of the lower and upper TCB’s, i.e. CM = (LTCB + UTCB)/2.

!

37#

c. Relative Frequency (RF). The RF refers to the frequency of the class as a fraction of the total frequency, i.e. RF = frequency/N. RF can be computed for both qualitative and quantitative data. RF can also be expressed in percent. d. Cumulative Frequency (CF). The CF refers to the total number of observations greater than or equal to the LL of the class (>CF) or the total number of observations less than or equal to the UL of the class (RCF) or the fraction of the total number of observations less than or equal to the UL of the class (

CF

CF

< RCF > RCF

TCB CM

LTCB UTCB

Histogram:

Textual presentation: ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________

!

40#

Which of the three methods of data presentation do you think is most appropriate to use for the variable chosen in Number 1? Justify your answer. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 2. Choose a QUALITATIVE variable from Table 5.2 Construct an appropriate graph. Use labels and a title for the graph. Give a brief report describing the variable: ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ Possible Answers: 1. For the quantitative variable gross monthly family income: R = 73.1 – 10.1 = 63

k=

30 = 5.477 ~ 5

c = 63/5 = 12.6

Table 1. Distribution of the gross monthly family income (in thousand pesos) of the 30 Batong Malake Senior Citizens Association members who joined the Lakbay-Aral. Classes Frequency CF RCF (%) TCB RF (%) CM (F) LL UL < CF > CF < RCF > RCF LTCB UTCB 10.1 22.6 9 30.00 9 30 30.00 100.00 16.35 10.05 22.65 22.7 35.2 8 26.67 17 21 56.67 70.00 28.95 22.65 35.25 35.3 47.8 7 23.33 24 13 80.00 43.33 41.55 35.25 47.85 47.9 60.4 3 10.00 27 6 90.00 20.00 54.15 47.85 60.45 60.5 73.0 2 6.67 29 3 96.67 10.00 66.75 60.45 73.05 73.1 85.6 1 3.33 30 1 100.00 3.33 79.35 73.05 85.65

!

41#

Histogram: 10! Frequency(

8! 6! 4! 2! 0!

10.05!!!!!!!!!!!!!!!22.65!!!!!!!!!!!!!!!!35.25!!!!!!!!!!!!!!!!47.85!!!!!!!!!!!!!!!!!60.45!!!!!!!!!!!!!!!!73.05!!!!!!!!!!!!!!!!! 1! 2! 3! 4! 5! 6! 85.65!

TCB(

Figure 1. Monthly gross family income (in thousand pesos) of the 30 BMSCA members.

Textual presentation: (Sample) The monthly gross family income of the 30 BMSCA members range from 10.1 to 73.1 thousand pesos. More than half of them have income of at most 35,250 pesos. Only three of them, or 10%, have monthly family income of at least 60,450 pesos. Which of the three methods of data presentation do you think is most appropriate to use for the variable chosen in Number 1? Justify your answer. (Sample) Textual presentation: It is most appropriate to use a textual presentation since the highlights of the family income of the BMSCA members can be presented. Tabular presentation: It is most appropriate to use a tabular presentation since a lot of the numerical information can be presented and trends in the monthly income of the members can be seen. Graphical presentation: A graphical presentation is most appropriate so that trends in the monthly income of the BMSCA are easily visible.

2. For the qualitative variable: gender !

42#

Figure 2. Distribution of the 30 BMSCA members by gender.

Brief Description: Majority of the 30 BMSCA who joined the Lakbay-Aral are males. Only 43% are females. For the qualitative variable: whether member is receiving monthly pension or not

Figure 2. Distribution of the 30 BMSCA members as to whether they are receiving monthly pension or not.

Brief Description: More than half of the 30 BMSCA members receive monthly pension. Forty percent are not receiving monthly pension.

!

43#

CHAPTER 1: EXPLORING DATA Lesson 6: Measures of Central Tendency TIME FRAME: 60 minutes OVERVIEW OF LESSON The lesson begins with students engaging in a review of some measures of central tendency by considering a numerical example. Students are also asked to examine both strengths and limitations of these measures. Assessments will be given to students on their ability to calculate these measures, and also to get an overall sense of whether they recognize how these measures respond to changes in data values. LEARNING OUTCOME(S): At the end of the lesson, the learner is able to • • •

Calculate commonly used measures of central tendency, Provide a sound interpretation of these summary measures, and Discuss the properties of these measures.

LESSON OUTLINE: 1. Motivation 2. Common Measures of Central Tendency: Mean, Median and Mode 3. Properties of the Mean, Median and Mode REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore. “Deciding Which Measure of Center to Use” http://www.sharemylesson.com/teaching-resource/deciding-which-measure-ofcenter-to-use-50013703/ Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

!

44!

DEVELOPMENT OF THE LESSON A. Motivation Present to the students the following frequency distribution table of the monthly income of 35 families residing in a nearby barangay/village. Monthly Family Income in Pesos 12,000 20,000 24,000 25,000 32,250 36,000 40,000 60,000

Number of Families 2 3 4 8 9 5 2 2

You may ask the students the following to pick up their interest and at the same time introduce to them some summary statistics. 1. What is the highest monthly family income? Lowest? Answer: Highest monthly family income is 60,000 pesos while the lowest is 12,000 pesos. You may emphasize that the highest and lowest values, which are commonly known as maximum and minimum, respectively are summary measures of a data set. They represent important location values in the distribution of the data. However, these measures do not give a measure of location in the center of the distribution. 2. What monthly family income is most frequent in the village? Answer: Monthly family income that is most frequent is 32,250 pesos. The value of 32,250 occurs most often or it is the value with the highest frequency. This is called the modal value or simply the mode. In this data set, the value of 32,250 is found in the center of the distribution. 3. If you list down individually the values of the monthly family income from lowest to highest, what is the monthly family income where half of the total number of !

45!

families have monthly family income less than or equal to that value while the other half have monthly family income greater than that value? Answer: When arranged in increasing order or the data come in an array as in the following: 12,000; 12,000; 20,000; 20,000; 20,000; 24,000; 24,000; 24,000; 24,000; 25,000; 25,000;25,000; 25,000; 25,000; 25,000; 25,000; 25,000; 32,250; 32,250; 32,250; 32,250; 32,250; 32,250; 32,250; 32,250; 32,250; 36,000; 36,000; 36,000; 36,000; 36,000; 40,000; 40,000; 60,000; 60,000; there are 17 values that are less than the middle value while another 17 values are higher or equal to the middle value. That middle value is the 18th observation and it is equal to 32,250 pesos. The middle value is called the median and is found in the center of the distribution. 4. What is the average monthly family income? Answer: When computed using the data values, the average is 30,007.14 pesos. The average monthly family income is commonly referred to as the arithmetic mean or simply the mean which is computed by adding all the values and then the sum is divided by the number of values included in the sum. The average value is also found somewhere in the center of the distribution. Let us now summarize what we have learned from our illustration and introduce the three common measures of central tendency. B. Common Measures of Central Tendency: Mean, Median and Mode Inform students that the most widely used measure of the center is the (arithmetic) mean. It is computed as the sum of all observations in the data set divided by the number of observations that you include in the sum. If we use the summation symbol, ! !!! !! read as ‘sum of observations represented by xi where i takes the values from1 to N, and N refers to the total number of observations being added’, ! ! we could compute the mean (usually denoted by Greek letter, µ) as ! = !!! ! !. Using the example earlier with 35 observations of family income, the mean is computed as ! = 12,000 + 12,000 + ⋯ + 60,000 35 = 1,050,250 35 = 30,007.14 !

46!

Alternatively, we could do the computation as follows: Monthly Family Income in Pesos (x i ) 12,000 20,000 24,000 25,000 32,250 36,000 40,000 60,000

Number of Families (f i ) 2 3 4 8 9 5 2 2 Sum = 35

xi × fi

12,000 × 2 = 24,000 20,000 × 3 = 60,000 24,000 × 4 = 96,000 25,000 × 8 = 200,000 32,250 × 9 = 290,250 36,000 × 5 = 180,000 40,000 × 2 = 80,000 60,000 × 2 = 120,000 Sum = 1,050,250

For large number of observations, it is advisable to use a computing tool like a calculator or a computer software, e.g. spreadsheet application or Microsoft Excel®. The median on the other hand is the middle value in an array of observations. To determine the median of a data set, the observations must first be arranged in increasing or decreasing order. Then locate the middle value so that half of the observations are less than or equal to that value while the half of the observations are greater than the middle value. If N (total number of observations in a data set) is odd, the median or the middle value is the

!!! !! !

!observation in the array. On the other hand, if N is even, then

the median or the middle value is the average of the two middle values or it is average of the

! !! !

and

!

+1 !

!!

!observations. In the example given earlier, there

are 35 observations so N is 35, an odd number. The median is then the !" !! !

!!! !! !

=

= 18!! observation in the array. Locating the 18th observation in the array

leads us to the value equal to 32,250 pesos. The mode or the modal value is the value that occurs most often or it is that value that has the highest frequency. In other words, the mode is the most fashionable value in the data set. Like in the example above, the value of 32,250 pesos occurs most often or it is the value with the highest frequency which is equal to nine.

!

47!

C. Properties of the Mean, Median and Mode Each of these three measures has its own properties. Most of the time we use these properties as basis for determining what measure to use to represent the center of the distribution. As mentioned before the mean is the most commonly used measure of central tendency since it could be likened to a “center of gravity” since if the values in an array were to be put on a beam balance, the mean acts as the balancing point where smaller observations will “balance” the larger ones as seen in the following illustration. 12,000%

20,000%

%

24,000%

32,250%

25,000%

36,000%

40,000%

60,000%

Note that the frequency represented by the size of the rectangle serves as ‘weights’ in this beam balance. To illustrate further this property, we could ask the student to subtract the value of the mean to each observation (denoted as di) and then sum all the differences. The computation can also be done alternatively as shown in the following table. Monthly Family Income in Pesos (x i ) 12,000 20,000 24,000 25,000 32,250 36,000 40,000 60,000

Number of Families (f i )

di = xi - µ

(rounded off) 12,000 – 30,007.14 = -18,007 20,000 – 30,007.14 = -10,007 24,000 – 30,007.14 = -6,007 25,000 – 30,007.14 = -5,007 32,250 – 30,007.14 = 2,243 36,000 – 30,007.14 = 5,993 40,000 – 30,007.14 = 9,993 60,000 – 30,007.14 = 29,993

2 3 4 8 9 5 2 2 Sum = 35

di × fi

-18,007 × 2 = -36,014 -10,007 × 3 = -30,021 -6,007 × 4 = -24,049 -5,007 × 8 = -40,057 2,243× 9 = 20,186 5,993 × 5 = 29,964 9,993 × 2 = 19,986 29,993 × 2 = 59,986 Sum = 0

The sum of the differences across all observations will be equal to zero. This indicate that the mean indeed is the center of the distribution since the negative and positive deviations cancel out and the sum is equal to zero.

!

48!

In the expression given above, we could see that each observation has a contribution to the value of the mean. All the data contribute equally in its calculation. That is, the “weight” of each of the data items in the array is the reciprocal of the total number of observations in the data set, i.e. 1 !. Means are also amenable to further computation, that is, you can combine subgroup means to come up with the mean for all observations. For example, if there are 3 groups with means equal to 10, 5 and 7 computed from 5, 15, and 10 observations respectively, one can compute the mean for all 30 observations as follows: !=

!! !! + ! !! !! + !! !! ! 10×5! + 5×15! + 7×10! = = 195 30 = 6.59 30 30

If there are extreme large values, the mean will tend to be ‘pulled upward’, while if there are extreme small values, the mean will tend to be ‘pulled downward’. The extreme low or high values are referred to as ‘outliers’.’Thus, outliers do affect the value of the mean. To illustrate this property, we could tell the students that if in case there is one family with very high income of 600,000 pesos monthly instead of 60,000 pesos only, the computed value of mean will be pulled upward, that is, ! = 12,000 + 12,000 + ⋯ + 600,000 35 = 2,130,250 35 = 60,864.29 Thus, in the presence of extreme values or outliers, the mean is not a good measure of the center. An alternative measure is the median. The mean is also computed only for quantitative variables that are measured at least in the interval scale.

Like the mean, the median is computed for quantitative variables. But the median can be computed for variables measured in at least in the ordinal scale. Another property of the median is that it is not easily affected by extreme values or outliers. As in the example above with 600,000 family monthly income measured in pesos as extreme value, the median remains to same which is equal to 32,250 pesos. For variables in the ordinal, the median should be used in determining the center of the distribution. On the other hand, the mode is usually computed for the data set which are mainly measured in the nominal scale of measurement. It is also sometimes referred to as the nominal average. In a given data set, the mode can easily be picked out by ocular inspection, especially if the data are not too many. In some data sets, the mode may not be unique. The data set is said to be unimodal !

49!

if there is a unique mode, bimodal if there are two modes, and multimodal if there are more than two modes. For continuous data, the mode is not very useful since here, measurements (to the most precise significant digit) would theoretically occur only once. The mode is a more helpful measure for discrete and qualitative data with numeric codes than for other types of data. In fact, in the case of qualitative data with numeric codes, the mean and median are not meaningful. The following diagram provides a guide in choosing the most appropriate measure of central tendency to use in order to pinpoint or locate the center or the middle of the distribution of the data set. Such measure, being the center of the distribution ‘typically’ represents the data set as a whole. Thus, it is very crucial to use the appropriate measure of central tendency.

KEY POINTS • •

!

A measure of central tendency is a location measure that pinpoints the center or middle value. The three common measures of central tendency are the mean, median and mode.

50!



Each measure has its own properties that serve as basis in determining when to use it appropriately.

ASSESSMENT Note: Answers are provided inside the parentheses and italicized. 1. Thirty people were asked the question, “How many people do you consider your best friend?” The graph below shows their responses. 12%

Frequency)

10% 8% 6% 4% 2% 0% 1%

2%

3%

4%

5%

6%

7%

8%

Number)of)Best)Friends)

What measure of central tendency would you use to find the center for the number of best friends people have? Explain your answer. (Since there is a presence of an outlier, one can use the median which is numerically equal to 3) 2. The mean age of 10 full time guidance counselors is 35 years old. Two new full time guidance counselors, aged 28 and 30, are hired. Five years from now, what would be the average age of these twelve guidance counselors? (The sum of ages is 350 for 10 counselors, with the two newly hired, the sum is now 408, thus yielding a mean currently at 34 years. Five years from now, the mean will go up to 39 years for the 12 guidance counselors.)

3.

Houses in a certain area in a big city have a mean price of PhP4,000,000 but a median price is only PhP2,500,000. How might you explain this best? (There is an outlier (an extremely expensive house) in the prices of the houses.)

4. Five persons were asked on the usual number of hours they spent watching television in a week. Their responses are: 5, 7, 3, 38, and 7 hours. a. Obtain the mean, median and mode. (The mean is 12; median is 7, mode is 7.) b. If another person were to be asked the same question and he/she responded 200 hours, how would this affect the mean, median and mode? (Median and mode unchanged; mean increases to 43.3) !

51!

5. For the senior high school dance, there is a debate going on among students regarding the color that will be featured prominently. Votes were sent by students via SMS, and the results are as follows: Color

Red Green Orange White Yellow Blue Brown Purple

No. of Votes Received

300

550

70

130

220

710

35

5

a. Is there a clear winner on the choice of color? (Yes) b. Compute for the mean, median and modal color (if possible). (We cannot compute for the mean and median. But the modal color is said to be blue.) c. Why is it that we could or could not find each measure of the central tendency? (We cannot compute for the mean and median since color is a qualitative variable and is measured at the nominal level) d. Which measure of central tendency will determine the color to be prominently used during the senior high school dance? (mode) 6. Everyone studied very hard for the quiz in the Statistics and Probability Course. There were 10 questions in the quiz, and the scores are distributed as follows: Score 10 9 8 7 6 5 4 3 2 1 0

Number of Students 8 12 6 5 3 2 0 1 1 0 2

a. Compute for the mean, median, and mode for this set of data. (The computation could be done as follows:

!

Score (x i )

Number of Students (f i )

10 9 8

8 12 6

xi × fi 80 108 48

52!

Less Than Cumulative Frequency (< CF) 40 32 20

7 6 5 4 3 2 1 0

Mean = µ =

5 3 2 0 1 1 0 2 Sum = 40 !"# !"

35 18 10 0 3 2 0 0 Sum = 304

14 9 6 4 4 3 2 2

= 7.6; !!!

Median is the average of the 20th and 21st observations = ! = 8.5. Note that the 20th observation is 8 while the 21st observation is 9 based on the less than cumulative frequency. Mode = 9 since that is the score with the highest frequency equal to 12. c. Suppose the teacher said “Everyone in the class will be getting either the mean, median, or mode for their official score.” i. What would students want to receive (mean, median, or mode)? (Mode) ii. Which would students want to receive the least (mean, median or mode)? (Mean) iii What is the fairest score to receive would be? Ask students to explain their answers. (Note: There is no right or wrong answer for this question. It all depends on the reasoning of the students)

!

53!

CHAPTER 1: EXPLORING DATA Lesson 7: Other Measures of Location TIME FRAME: 60 minutes

OVERVIEW OF LESSON In the previous lesson we discussed a measure of location known as the measure of central tendency. There are other measures of location which are useful in describing the distribution of the data set. These measures of location include the maximum, minimum, percentiles, deciles and quartiles. How to compute and interpret these measures are also discussed in this lesson. LEARNING OUTCOME(S): At the end of the lesson, the learner is able to • •

Calculate measures of location other than the measure of central tendency, and Provide a sound interpretation of these summary measures.

LESSON OUTLINE: 1. Motivation 2. Measures of Location: Maximum, Minimum, Percentiles, Deciles and Quartiles REFERENCES “Deciding Which Measure of Center to Use” http://www.sharemylesson.com/teaching-resource/deciding-which-measure-ofcenter-to-use-50013703/ Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Moore, D.S. (2007). The Basic Practice of Statistics, Fourth Edition W.H. Freeman and Company. Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

!

54#

DEVELOPMENT OF THE LESSON A. Motivation In the previous lesson, we ask the students to identify the highest and lowest family income, and emphasized that that the highest and the lowest values, which are commonly known as maximum and minimum, respectively are important summary measures of a data set. They represent important location values in the distribution of the data. However, these measures do not give a measure of location in the center of the distribution. Instead, these two location measures give extreme locations or points in a distribution. For example, after a long test or examination, we are interested what is the highest score or lowest score and of course who got these scores. These are in addition to knowing the average, median and modal scores. These measures tell us how the students perform in the long test. Knowing these measures, we could do further actions like reward the student(s) who got the highest score and assist those student(s) who got the lowest score. In addition, these measures also indicate if the long test is difficult or easy and the measures may also indicate the level of understanding of the students in the concepts that are covered in the test. To motivate the students, present the following distribution of scores in a 50-item long test of 150 Grade 11 students of a nearby Senior High School and ask them to respond to some questions. Number of Students 4 5 5 15 19 22 18 28 10 7 8 9

Score in a Long Test 10 16 18 20 25 30 33 38 40 42 45 50

!

55#

1. What is the highest score? Lowest score? Answer: Highest score is 50 while the lowest is 10. 2. What is the most frequent score? Answer: Most frequent score is 38 which is the score of 28 students. 3. What is the median score? Answer: The median score is 33 which implies that 50% of the students or around 75 students have score at most 33. 4. What is the average or mean score? Answer: On the average, the students got 32.04667 or 32 (rounded off) out of 50 items correctly. You could ask more questions like: 1. What is the score where at most 75% of the 150 students scored less or equal to it? 2. Do you think the long test is easy since 75 students have scores at most 33 out of 50? 3. Do you need to be alarmed when 10% of the class got a score of at most 20 out of 50? These questions could be answered by knowing other measures of location. B. Measures of Location: Maximum, Minimum, Percentiles, Deciles and Quatiles We formally define the maximum as a measure of location that pinpoints the highest value in the data distribution while the minimum locates the lowest value. There are other measures of location that are becoming common because of its constant use in reporting rank in distribution of scores as the percentile rank in college entrance examination. These measures are referred to as percentiles, deciles, and quartiles. Percentile is a measure that pinpoints a location that divides distribution into 100 equal parts. It is usually represented by Pj, that value which separates the bottom j% of the distribution from the top (100-j)%. For example, P30 is the value that separates the bottom 30% of the distribution to the top 70%. Thus we say 30% of the total

!

56#

number of observations in the data set are said to be less than or equal to P30 while the remaining 70% have values greater than P30. Lifted from the workbook cited as reference at the end of this Teachers Guide, are the steps in finding the jth percentile (Pj) Step 1: Arrange the data values in ascending order of magnitude. j

" Step 2: Find the location of Pj in the arranged list by computing L = !$ %× N , & 100 ' where N is the total number of observations in the data set. Step 3: a. If L is a whole number, then Pj is the mean or average of the values in the Lth and (L+1)th positions. b. If L is not a whole number, then Pj is the value of the next higher position.

To illustrate we use the data on long test scores of 150 Grade 11 students of nearby Senior High School. An additional column on less than cumulative frequency was included to facilitate the computation. Score in a Long Test 10 16 18 20 25 30 33 38 40 42 45 50

Number of Students 4 5 5 15 19 22 18 28 10 7 8 9

< CF 4 9 14 29 48 70 88 116 126 133 141 150

To find P30 we note that j = 30. Since the observations are tabulated in increasing order, we could proceed to Step 2 which ask us to compute L as ! = !" !""

!

!""

×! =

×150 = 45. The computed L which is equal to 45 is a whole number and thus

we follow the first rule in Step 3 which states that Pj is the average or mean of the values found in the Lth and (L+1)th positions. Thus, we take the average of the 45th and 46th observations which are both equal to 25. We then say that the bottom 30%

!

57#

of the scores are said to be less than or equal to 25 while the top 70% of the observations (which is around 105) are greater than 25. Deciles and quartiles are then defined in relation to percentile. If the percentile divides the distribution into 100 equal parts, deciles divide the distribution into 10 equal parts while quartiles divide the distribution into 4 equal parts. Thus, we say that 10th Percentile is the same as the 1st Decile, 20th Percentile same as 2nd Decile, 25th Percentile same as 1st Quartile, 50th Percentile same as 5th Decile or 2nd Quartile and so forth. Note also that by definition of the median in previous lesson, we could say that the median value is equal to the 50th Percentile or 5th Decile or 2nd Quartile. Because of this relationship, the computation of the quartile and decile could be coursed through the computation of the percentile. To illustrate, if we want to compute the 3rd Decile or D3 then we compute 30th Percentile or P30. In other words, D3 = P30 = 25 based on our earlier computation. The 3rd Quartile or Q3 is equal to P75. To compute L as ! =

!

!""

×! =

!"

!""

×150 =

112.5. The computed L which is equal to 112.5 is not a whole number and thus we follow the second rule in Step 3 which states that Pj is the value found in the next higher position, specifically, in 113th position, the next higher position after 112.5. Thus, we take the 113th observation which is equal to 38 as the value of P75. We then say that 75% of the class of 150 students or around 113 students correctly answered at most 38 out of the 50 items. The median which is equal to P50 is computed as the mean or average of the 75th and 76th observations which are both equal to 33. Hence, we did get the same value as the one we obtained using the definition we had in the previous lesson. KEY POINTS • There are other measures of location that could further describe the distribution of the data set. • The maximum and minimum values are measures of location that pinpoints the extreme values which are the highest and lowest values, respectively. • Percentiles, quartiles and deciles are measures of locations that divide the distribution into 100, 4 and 10 equal parts, respectively.

!

58#

ASSESSMENT Note: Answers are provided inside the parentheses and in bold face. 1. A businesswoman is planning to have a restaurant in the university belt. She wants to study the weekly food allowance of the students in order to plan her pricing strategy for the different menus she is going to offer. She asked 213 students and gathered the following data: W eekly Food Allowance 50 100 150 170 200 250 300 350 400 450 500

W eekly Food Allowance 550 600 700 750 800 900 1000 1200 1500 1700 2000

Frequency 5 3 6 1 8 5 5 5 6 11 46

Frequency 3 18 22 8 16 11 27 2 3 1 1

a. Determine the weekly food allowance where 60% of the students have at most. ! !" (The statistic we wanted is P60. To compute L as ! = !"" ×! = !"" ×213 = 127.8 ≅ 128. Then we take the 128th observation which is equal to 700. Thus we say that 60% of the students have at most 700 pesos as their weekly food allowance.) b. What percentage of the students have a weekly food allowance that is at most 170 pesos? (Here we are looking for the value of j. It is given that Pj = 170 is the 15th observation in the array of 213 values. Thus, 15 is the value of L and using ! !" this we compute the value of j as ! = ! ×100 = !"# ×100 ≅ 7. Therefore we say that 7% of the students have a weekly food allowance of at most 170 pesos.) c. If the business woman wanted to have at least 50% of the students could afford to eat in her restaurant, what should be the minimum total cost of the meals that the student could have in a week? ! (The statistic we wanted is the median or P50. To compute L as ! = !"" × !" ! = !"" ×213 = 106.5 ≅ 107. Then we take the 107th observation which is equal to 600. Thus we say that at least 50% of the students could afford to eat in the restaurant if the minimum total cost of the meals that the student could have in a week is 600 pesos.)

!

59#

CHAPTER 1: EXPLORING DATA Lesson 8: Measures of Variation TIME FRAME: 60 minutes OVERVIEW OF LESSON In this lesson, students will be shown that it is not enough to get measures of central tendency in a data set by scrutinizing two different data sets with the same measures of central tendency. We illustrate this using data on the returns on stocks where it is not only the mean, median and mode which are the same, it is also true for other measures of location like its minimum and maximum. However, the spread of observations are different which means that to further describe the data sets we need additional measures like a measure about the dispersion of the data, i.e. range, interquartile range, variance, standard deviation, and coefficient of variation. Also, the standard deviation, as a measure of dispersion can be viewed as a measure of risk, specifically in the case of making investments in stock market. The smaller the value of the standard deviation, the smaller is the risk. LEARNING OUTCOMES: At the end of the lesson, the learner is able to • • •

Calculate some measures of dispersion; Think of the strengths and limitations of these measures; and Provide a sound interpretation of these measures.

LESSON OUTLINE: 1. Introduction: The Case of the Returns on Stocks 2. Absolute Measures of Dispersion: Range, Interquartile Range, Variance, Standard Deviation and Coefficient of Variation 3. Relative Measure of Dispersion: Coefficient of Variation REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore. Bryant−Smith (2009): Practical Data Analysis, Second Edition. McGraw-Hill/Irvine, USA. Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Moore, D.S. (2007). The Basic Practice of Statistics, Fourth Edition W.H. Freeman and Company. “Range as a Measure of Variation” http://www.sharemylesson.com/teachingresource/range-as-a-measure-of-variation-50009362 Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

!

60#

DEVELOPMENT OF THE LESSON A. Introduction: The Case of the Returns on Stocks. To introduce this lesson, tell the students the importance of thinking about their future, of saving, and of wealth generation. Explain that a number of people invest money into the stock market as an alternative financial instrument to generate wealth from savings. Explanatory Note: Stocks are shares of ownership in a company. When people buy stocks they become part owners of the company, whether in terms of profits or losses of the company. Mention to students that the history of performance of a particular stock maybe a useful guide to what may be expected of its performance in the foreseeable future. This is of course, a very big assumption, but we have to assume it anyway. Provide the following data to students representing the rates of return for two stocks, which we will call Stock A and Stock B. Y ear 2005 2006 2007 2008 2009

Stock A 0.081 0.231 0.214 0.214 0.181

Stock B 0.214 0.193 0.132 0.073 0.066

Year 2010 2011 2012 2013 2014

Stock A 0.241 0.193 0.133 0.071 0.066

Stock B 0.081 0.181 0.230 0.214 0.241

Inform students that the rate of return is defined as the increase in value of the portfolio (including any dividends or other distributions) during the year divided by its value at the beginning of the year. For instance, if the parents of Juana dela Cruz invests 50,000 pesos in a stock at the beginning of the year, and the value of the stock goes up to 60,000 pesos, thus having an increase in value of 10,000 pesos, then the rate of return here is 10,000/50,000 = 0.20 Explain to students that the rate of return may be positive or negative. It represents the fraction by which your wealth would have changed had it been invested in that particular combination of securities. Now, let us compute some measures of locations that we learned in previous lessons to describe the data given above. You could ask the students to do this as a sort of an assessment of what they have already learned. It could be done by recitation or through a quiz. Below is a summary of the computed values as well as a graphical presentation of the rate of returns of Stock A and B.

!

61#

Stock A Stock B

Maximum Minimum 0.241 0.066 0.241 0.066

Mean 0.1625 0.1625

Median Mode 0.187 0.214 0.187 0.214

0.3! 0.25! 0.2! Stock!A!

0.15!

Stock!B!

0.1! 0.05! 0! 2005! 2006! 2007! 2008! 2009! 2010! 2011! 2012! 2013! 2014!

Notice that there are no differences in the computed summary statistics but the trend and actual values of the rate of returns for the two stocks are different as depicted in the line graph. Such observation tells us that it is not enough to simply use measures of location to describe a data set. We need additional measures such as measures of variation or dispersion to describe further the data sets. In particular, summary measures of variability (such as the range and the standard deviation) of the rates of return are used to measure risk associated with investment. We could use measures of variation to decide whether it would make any difference if we decide to invest wholly in Stock A, wholly in Stock B, or half of our investments in Stock A and another half in Stock B. In general, there is higher risk in investing if the rate of return fluctuates much or there is high variability in its historical values. Thus, we choose investment where the risk of the rate of return has a small measure of dispersion. There are two types of measures of variability or dispersion. One type is the absolute measure which includes the range, interquartile range, variance, and standard deviation. Absolute measure of dispersion provides a measure of variability of observations or values within a data set. On the other hand, the relative measure of dispersion which is the other type of measure of dispersion is used to compare variability of data sets of different variables or variables measured in different units of measurement. The coefficient of variation is a relative measure of variability.

!

62#

B. Absolute Measures of Dispersion: Range, Interquartile Range, Variance, and Standard Deviation The range is a simple measure of variation defined as the difference between the maximum and minimum values. The range depends on the extremes; it ignores information about what goes in between the smallest (minimum) and largest (maximum) values in a data set. The larger the range, the larger is the dispersion of the data set. We already encountered the range in previous lesson where we discussed the construction of an FDT. Using the data on the scores of 150 Grade 11 students of a nearby Senior High School on a 50-item long test, we could demonstrate the computation of these measures. Score in a Long Test 10 16 18 20 25 30 33 38 40 42 45 50

Number of Students 4 5 5 15 19 22 18 28 10 7 8 9

< CF 4 9 14 29 48 70 88 116 126 133 141 150

In the above data, the maximum is 50 and the minimum is 10, hence the range is 40. But note that the range could be easily affected by the values of the extremes as mentioned earlier as the range depends only on the extremities. Because of this property, another measure, the interquartile range or IQR is used instead. The interquartile range or IQR is the difference between the 3rd and the 1st quartiles. Hence, it gives you the spread of the middle 50% of the data set. Like the range, the higher the value of the IQR, the larger is the dispersion of the data set. Based on the computations we did in the previous lesson, the 3rd quartile or Q3 is the 113th observation and is equal to 38 while Q1 or P25 is the 38th observation and is equal to 25. Hence, IQR = = 38 – 25 = 13. Recall with the students the property of the mean when deviation or difference of each observation was obtained and summed for all the observations we got the sum !

63#

equal to zero. We said that this property shows that the deviation of the observation from the mean cancels out indicating that the mean is indeed the center of the distribution. What if we square the difference before we get the sum and use it to measure the spread of observations? Doing it in our example, we have the following table: Score in a Long Test (x i ) 10 16 18 20 25 30 33 38 40 42 45 50

d i =x i - µ

(rounded off) 10-32 = 22 16-32 = 16 18-32 = 14 20-32 = 12 25-32 = -7 30-32 = -2 33-32 = 1 38-32 = 6 40-32 = 8 42-32 = 10 45-32 = 13 50-32 = 18

di

Number of Students (f i )

2

d i2 × f i

484 4

1936

5

1280

5

980

15 19 22 18 28 10 7 8 9

2160 931 88 18 1008 640 700 1352 2916 Sum= 14009

256 196 144 49 4 1 36 64 100 169 324

So what we did is for each unique observation we subtract the mean, we refer to the difference as di, square the difference and sum it for all observations. Note that in the table we have to multiply the square of the difference with the number of students to account for all observations. We then divide the sum by the total number of observations, denoted by N. Summarizing these steps in a formula, we !

! !! !

have !!! !! . We usually denote this expression as s2 or call it as variance. Thus in this example, s2= 14009/150 = 93.39 For ease in computation, instead of ! !!!

!! !! ! !

, we use an equivalent expression

example, we have ! ! =

! ! !!! !! !!

!

− !! =

!"#,!"# !"#

! ! !!! !!

!

− !! . When applied to our

− 32.04667! ≅ 93.39 (rounded off).

Variance is a measure of dispersion that accounts for the average squared deviation of each observation from the mean. Since we square the difference of each observation from the mean, the unit of measurement of the variance is the

!

64#

square of the unit used in measuring each observation. Such property is a little bit problematic in interpretation. For example, point2 or kilogram2 is difficult to interpret compared to inches2. Hence, instead of the variance the standard deviation is computed which is the positive square of the variance, that is, ! = ! ! . In the example,!! = 93.3933 = 9.6640. To interpret, we say that on the average, the scores of the students deviate from the mean score of 32 points by as much as 9.6640 or approximately 10 points. If all the observations are equal to a constant, then the mean is that constant, and the measure of variation is zero. Furthermore, if for a given data set, the variance and standard deviation turn out to be zero, then all the deviations from the average must be zero, which means that all observations are equal. Note that if a data set were rescaled, that is if the observations were multiplied by some constant, then the standard deviation of the new data set is merely the scaling factor multiplied to the standard deviation of the original data set. The variance and standard deviation are based on all the observations items in the data set, and each item is given a proper weight. They are extremely useful measures of variability as they measure the average scattering of the data around the mean, that is how large data fluctuate above and below the mean. The variance and standard deviation increase with an increase in the deviations about the mean, and decrease with decreases in these deviations. A small standard deviation (and variance) means a high degree of uniformity in the observations and of homogeneity in a series. The variance is the most suitable for algebraic manipulations but as was pointed out earlier, its value is in squared unit of measurements. On the other hand, the standard deviation has unit of measure same as with that of the observations. Thus, standard deviation serves as the primary measure of variation, just as the mean is the primary measure of central location. Going back to the motivation example on the stocks where in we have two stocks, A and B. Both stocks have same expected return measured by the mean. However, the standard deviation of the rates of return for Stock A is 0.0688 while that for Stock B is 0.0685, indicating that Stock A has higher risk compared to Stock B although the difference is not that large.

C. Relative Measure of Dispersion: Coefficient of Variation To compare variability between or among different data sets, that is, the data sets are for different variables or same variables but measured in different unit of !

65#

measurement, the coefficient of variation (CV) is used as measure of relative ! dispersion. It is usually expressed as percentage and is computed as CV = ! ×100%. CV is a measure of dispersion relative to the mean of the data set. With and having same unit of measurement, CV is unit less or it does not depend on the unit of measurement. Hence, it is used compare the variability across the different data sets. As an example, the CV of the scores of the students in the long test is computed as !

!.!!"#

CV = ! ×100% = ! !".!"##$ ×100% = 30.16% while the CV of the rate of returns of !.!"##

Stock A is CV = !.!"#$ ×100% = 42.34%. Thus, we say the rate of returns of Stock A

is more variable than the scores of the students in the test. Here, we used the CV to compare the variability of two different data sets.

KEY POINTS • Measure of dispersion is used to further describe the distribution of the data set. • Absolute measures of variation include range, interquartile range, variance and standard deviation. • A relative measure of dispersion is provided by the coefficient of variation. ASSESSMENT

Note: Answers are provided inside the parentheses and italicized. 1. Three friends, Gerald, Carmina, and Rodolfo are planning their business of selling homemade peanut butter. They start the planning by doing a market study where they obtained the prices (in pesos) of a 250-gram jar of several known brands of peanut butter. Below is the data set they have collected: 100.80 197.60 158.00 131.60 184.40 149.20 136.00 109.60 360.40 122.80 131.60 After studying the data, Gerald said, “The prices of peanut butter are pretty similar. The range is only PhP 30.80.” Carmina said, “You are mistaken! The prices are very different. The range is PhP 259.60. Rodolfo said, “I think you are both mistaken. The range isn’t a useful measure to describe the variation of the data set. a. Explain what you think is the basis used by each person in support of their claims. (Gerald did not arrange the data set from smallest to largest, and erroneously subtracted the first value (100.80) from the last value (131.60) in the data set. Carmina found the range correctly by subtracting the smallest value (100.80) from

!

66#

the largest value (360.40). Rodolfo noticed that the maximum 360.80 is an outlier. As a result, the computed range of PhP259.60 roughly describe the variation of the observations as it was unduly increased by the extreme value.) b. Who should we agree with? Why? (We can agree with both Carmina and Rodolfo. Carmina correctly calculated the range; Rodolfo intelligently observed that while Carmina was correct in her calculation, the range is not very useful in describing the variability of the observations, as the range would only be PHP 96.80 if the outlier were removed from the data set.) 2. Three hundred students taking a basic course in Statistics are given similar final examination. After checking the papers and while the professor is studying the distribution of the final examination scores, he taught of several scenarios which are described below: a. Suppose the professor will give 30% weight to the final examination, what effect would multiplying 30% on all the final scores have on the mean of the final exam scores? On the standard deviation of the final exam scores? (The mean will also get rescaled by 30%, so with the standard deviation.) b. Suppose the professor wants to bloat the final examination scores, what will be the effect to the mean of the final exam scores if 5 points will be added to each of the final score? On the standard deviation of the final exam scores? (The mean will also go up by 5 points; while standard deviation stays the same.) 3. In a fitness center, weights of a certain group of students were taken resulting to a common weight of 140 pounds. What would be the standard deviation of the distribution of weights? (Zero, since the observations do not vary.)

4. Determine which of the following statements is (are) TRUE or FALSE. Explain briefly your answer. a. If each observation in a data set is doubled, then the standard deviation would also be doubled.

!

67#

(True, since the variance would be quadrupled and taking the square root of the resulting variance, will result to twice the standard deviation.) b. If in a set of data, positive numbers are changed to negative, while negative are changed to positive, then the standard deviation changes its sign as well. (False, since standard deviation is always nonnegative.) Explanatory Note: Teachers have the option to ask this assessment orally to the entire class to either introduce or recall the notions of computing the range and of computing the standard deviation, or to group students and ask them to identify answers, or to give this as homework, or to use some questions/items here for a chapter examination.

!

68#

CHAPTER 1: EXPLORING DATA Lesson 9: More on Describing Data: Summary Measures and Graphs TIME FRAME: 60 minutes OVERVIEW OF LESSON: In this lesson, students will do an activity that will use the data on heights and weights which were collected in Lesson 2. They will construct box plots and calculate the summary measures they have learned in previous lessons. These computed summary measures and constructed boxplot will be used to describe fully the data set so as to provide simple analysis of the data at hand. LEARNING OUTCOME(S): At the end of the lesson, the learner is able to • Construct and interpret box plots; and • Provide simple analysis of a data set based on its descriptive measures. LESSON OUTLINE: 1. Preliminaries: Teacher’s Preparation for the Lesson 2. Motivation: The Student’s Height and Weight and Corresponding BMI 3. Construction and Interpretation of a Box-plot REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore. “Armspans” inSTatistics Education Web (STEW) http://www.amstat.org/education/stew/pdfs/Armspans.docx “Deciding Which Measure of Center to Use” http://www.sharemylesson.com/teachingresource/deciding-which-measure-of-center-to-use-50013703/ Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031 Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of the Institute of Statistics, UP Los Baños, College Laguna 4031

!

69#

DEVELOPMENT OF THE LESSON A. Preliminaries: Teacher’s Preparation for the Lesson Note: This is an activity that the teacher has to do in preparation for the lesson. A day before the actual schedule for this lesson, you should review some information about the body mass index (BMI) so that you could compute the BMI of each student in the class based on the students’ weights and heights collected in Lesson 2. This will also make you more confident to discuss BMI in the class as well as use it to integrate the lessons learned in this chapter. The following discussion provides useful information about BMI. The BMI, devised by Adolphe Quetelet, is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m2, using weight in kilograms and height in meters. When the term BMI is used informally, the units are usually omitted. A high BMI can be an indicator of high body fatness. The BMI can be used to screen for weight categories that may lead to health problems. The BMI provides a simple numeric measure of a person's thickness or thinness, allowing medical and health professionals to discuss weight problems more objectively with the adult patients. The standard weight status categories associated with BMI ranges for adults are listed below: BMI Range Below 18.5

Weight Status Underweight

18.5 -22.9 23.0-27.4

Normal or Healthy Weight Overweight

27.5 and above

Obese

Health Risk Risk of developing problems such as nutritional deficiency and osteoporosis Low Risk (healthy range) Moderate risk of developing heart disease, high blood pressure, stroke, diabetes High risk of developing heart disease, high blood pressure, stroke, diabetes

For adults, a BMI from 18.5 up to 23 indicates optimal weight, while a BMI lower than 18.5 suggests that the person is underweight, a number from 23 up to 30 indicates that

!

70#

the person is overweight, and a number from 30 upwards suggests the person is obese. Note that the threshold 23 and 27.5 are used for South East Asians, as per suggestion of the World Health Organization (WHO), though generally 25 and 30 are used. Special Notes about interpreting BMI: 1. Many but not all athletes have a high muscle to fat ratio and may have a BMI that is misleadingly high relative to their body fat percentage. Exceptions also can be made for the elderly, and the infirm. 2. For children and teens, the interpretation of BMI depends upon age and sex, even though it is computed using the same formula. This difference in interpretation is due to the variability in the amount of body fat with age and between girls and boys, among children and teens. Instead of comparison against fixed thresholds for underweight and overweight, the BMI is compared against the percentile for children of the same gender and age. A BMI that is less than the 5th percentile is considered underweight and above the 95th percentile is considered obese. Children with a BMI between the 85th and 95th percentile are considered to be overweight. 3. The following are other limitations in the interpretation of BMI. a. Since the BMI depends upon weight and the square of height, it ignores the basic scaling law which states that mass increases to the 3rd power of linear dimensions. Thus, taller individuals, even if they had exactly the same body shape and relative composition, always have a larger BMI b. BMI also does not account for body frame size; a person may have a small frame and be carrying more fat than optimal, but the BMI may suggest that these people are normal. Alarge framed individual may be quite healthy with a fairly low body fat percentage, but the BMI may yield an overweight classification. In the Philippines, the government’s Food and Nutrition Research Institute (FNRI) of the Department of Science and Technology collects the anthropometric data through the National Nutrition Survey (NNS) to be able to generate estimates on the extent of child malnutrition using three indicators of undernutrition: underweight, wasted and stunted. The NNS is conducted every five years and based on the gathered weights and heights, the nutritional status of the Filipinos was assessed.

!

71#

For a Filipino child whose weight is below three standard deviations from the median weight-for-age, the child is said to be severely underweight, while if the weight is lower than two standard deviations from the growth standard but higher than three standard deviations, then the child is moderately underweight. Similarly, (moderate and severe) wasting and stunting are respectively, defined in terms of the child growth standards on weight-for-height and height-for-age, respectively. Using these standards, FNRI estimates based on the 2013 NNS about one in five children aged 0 to 5 years were underweight, about three in ten had stunted growth. Wasting—or low weight-forheight—was estimated at 7.9 percent.

10

20

30

40

50

It was also reported that incidents of malnutrition were high among those under the poorest 20 percent of families: underweight (29.8 percent), stunting (44.8 percent), and wasting (9.5 percent). Malnutrition is thus related to poverty. The percentage of overweight children was highest among the "wealthiest" (10.7 percent). The figure below shows the trends in the prevalence of stunting, underweight and wasting from 1989 to 2013 based on the data gathered by FNRI through its NNS.

1990

1995

2000

Year

Underweight Stunting!

2005

Stunting Underweight! !

2010

2015

Wasting

Figure 1. Prevalence of stunting, underweight, and wasting among 0-5 years old preschoolers in the Philippines, 1989-2013. When children under five are experiencing malnutrition, they are likely to carry this over to early childhood, which has repercussions on learning achievements in school. In consequence, government, through the Department of Social Welfare and Development, as well as the Department of Education (DepED), has developed feeding programs to reduce hunger, to aid in the development of children, to improve

!

72#

nutritional status and to promoting good health, as well as to reduce inequities by encouraging families to send their children to school given the incentive of school feeding benefits. DepED thus regularly collects school records of heights and weights at the beginning and end of the school year to monitor nutrition of school-aged children. With this information and the class data gathered in Lesson 2, you are now to compute the BMI of each student so that a table with the following format will be ready for the group activity described in the next section. Class Student Number

Sex

Height (in meters)

Weight (in kilograms)

BMI (rounded off to whole numbers)

Note that the height of the student collected in Lesson 2 is in centimeter, thus you have to divide the values by 100 to get the values in meters. Also, BMI is rounded off to whole numbers for ease of computation in the group activity. B. Motivation: The Student’s Height and Weight and Corresponding BMI The activities for this lesson is to be done by groups and will be conducted during the entire class period. Hence, it is recommended that the grouping be done at the start of the class and the group members sit together in a circle as the activity requires group discussion. As mentioned, the students should be advised to stay in their group for the entire class period. A suggested way to group the students into three groups is to have them count 1-2-3 sequentially and students with same number will belong to the same group. Once, they are seated together as group you could begin the lesson by asking the students if they think that males and females have the same heights, weights and BMI. Have them guess what the distribution of heights, weights and BMI might look like for the whole class and whether the distribution of heights, weights and BMI for males and females would be the same.

!

73#

The following are some possible questions to ask: • •

Are the heights, weights, and BMI of males and females the same or different? What are some other factors besides sex that might affect heights, weights and BMI? (Possible factors that could be studied are age, location where person resides, and year the data was collected.)

You could write these questions on the board so that the students will be reminded of these questions while they perform a group activity. Assign the first group (those students who were numbered ‘1’) for the variable ‘height’; the second group (those students who were numbered ‘2’) for the variable ‘weight’; and third group (those students who were numbered ‘3’) for the variable ‘BMI’ You will be using the class data you prepared in the preliminary activity for this lesson. The following table provides a sample data or what your class data should look like.

!

Class Student Number

Sex

Height (in m eters)

W eight (in kilogram s)

BM I (rounded off to whole num bers)

1

F

1.64

40

15

2

F

1.52

50

22

3

F

1.52

49

21

4

F

1.65

45

17

5

F

1.02

60

58

6

F

1.63

45

17

7

F

1.50

38

17

8

F

1.60

51

20

9

F

1.42

42

21

10

F

1.52

54

23

11

F

1.48

46

21

12

F

1.62

54

21

13

F

1.50

36

16

14

F

1.54

50

21

15

F

1.67

63

23

16

M

1.72

55

19

17

M

1.65

61

22

18

M

1.56

60

25

19

M

1.50

52

23

20

M

1.70

90

31

21

M

1.53

50

21

22

M

1.62

90

34

23

M

1.79

80

25

74#

24

M

1.57

58

24

25

M

1.70

68

24

26

M

1.77

27

9

27

M

1.48

50

23

28

M

1.73

94

31

29

M

1.56

66

27

30

M

1.75

50

16

With the class data, ask each group to do the following for the assigned variable in their group: 1. Compute the descriptive measures for the whole class and also for each subgroup in the data set with sex as the grouping variable. The descriptive measures to compute include the measures of location such as minimum, maximum,mean, median, first and third quartiles; and measures of dispersion such the range, interquartile range (IQR) and standard deviation. Each group could use the following format of the table to present the computed measures: Table 9.1 Summary statistics of the variable __________. Descriptive Measure

Computed Value For the whole class with N = ___

For the subgroup of Males with N = ___

For the subgroup of Females with N = ___

Measures of Location Minimum Maximum Mean First Quartile Median Third Quartile Measures of Dispersion Range IQR Standard Deviation

2. With the computed descriptive measures, write a textual presentation of the data for the variable assigned to the group. !

75#

The following tables provide the descriptive measures of the sample class data as a whole and by subgroup. Note that there might be discrepancies in the computed values due to rounding off. Table 9.2 Summary statistics of the variable height (in meters) using the sample data. Descriptive M easure For the whole class with N = 30 Measures of Location Minimum Maximum Mean First Quartile Median Third Quartile Measures of Dispersion Range IQR Standard Deviation

Com puted Value For the subgroup For the subgroup of of Males with N = Females with N = 15 15

1.020 1.790 1.582 1.520 1.585 1.670

1.480 1.790 1.642 1.560 1.650 1.730

1.020 1.670 1.522 1.500 1.520 1.630

0.770 0.150 0.144

0.310 0.170 0.103

0.650 0.130 0.157

Possible textual presentation of the data on heights: Based on Table 9.2, on the average, a student of this class is 1.582 meters high. The shortest student is just a little bit over one meter while the tallest is 1.79 meters high resulting to a range of 0.77 meter. The median which is 1.585 is almost the same as the mean height. Comparing the males and female students, on the average male students are taller than female students but the dispersion of the heights of the female students is wider compared to that of the male students. Thus, male students of this class tend to be of same heights compared to female students. Table 9.3 Summary statistics of the variable weight (in kilograms) using the sample data.

!

76#

Descriptive Measure

Computed Value For the whole class with N = 30

For the subgroup of Males with N = 15

For the subgroup of Females with N = 15

27.0 94.0 55.8 46.0 51.5 61.0

27.0 94.0 63.4 50.0 60.0 80.0

36.0 63.0 48.2 42.0 49.0 54.0

67.0 15.0 15.9

67.0 30.0 18.4

27.0 12.0 7.7

Measures of Location Minimum Maximum Mean First Quartile Median Third Quartile Measures of Dispersion Range IQR Standard Deviation

Possible textual presentation of the data on weights: Using the statistics on Table 9.3, on the average, a student of this class weighs 55.8 kilograms. The minimum weight of the students in this class is only 27 kilograms while the heaviest student of this class is 94 kilograms. There is a wide variation among the values of the weights of the students in this class as measured by the range which is equal to 67 kilograms. The median weight for this class is 51.5 kilograms which is quite different from the mean as the value of the latter was pulled by the presence of extreme values. Comparing the males and female students, on the average male students are heavier than female students. The extreme values observed for the class are both coming from male students. The wide variation observed on the students’ weights of this class was also observed among the weights of the male students. In fact, the standard deviation of the weights of the male students is more than double the standard deviation of the weights of female students.

!

77#

Table 9.4 Summary statistics of the variable BMI (in kg/m2) using the sample data. Descriptive Measure

Measures of Location Minimum Maximum Mean First Quartile Median Third Quartile Measures of Dispersion Range IQR Standard Deviation

Computed Value For the whole class with N = 30

For the subgroup of Males with N = 15

For the subgroup of Females with N = 15

9.0 58.0 22.9 19.0 21.5 24.0

9.0 34.0 23.6 21.0 24.0 27.0

15.0 58.0 22.2 17.0 21.0 22.0

49.0 5.0 8.3

25.0 6.0 6.2

43.0 5.0 10.2

Possible textual presentation of the data on BMIs: Table 9.4 shows that the minimum BMI of the students in the class is 9 while the maximum is 58 kg/m2. On the average, a student of this class has a BMI of 22.9. Also, the median BMI for this class is 21.5 which is near the value of the mean BMI. The variability of the values is also not that large as a small standard error value of 8.3 was obtained. Comparing the males and female students, on the average, the BMI of the male and female students are near each other with numerical values equal to 23.6 and 22.2, respectively.But there is a wider variation among the BMI values of the female students compared to that of the male students. The standard deviation of the BMIs of the male students is less than that of the female students. Visual comparison of the data distributions between two or among several groups could be achieved through box-plots. You may ask the students if they already know how to construct a box-plot. If so, you may just review the steps with them. Otherwise,

!

78#

you may briefly discuss the steps in constructing box-plot as given in the next section before you ask them to construct box-plots for their respective data sets. C. Construction of a Box-Plot Using five summary statistics, namely: minimum, maximum, median, first and third quartiles, a box-plot can be constructed as follows: 1. Draw a rectangular box (horizontally or vertically) with the first and third quartiles as the endpoints. Thus the width of the box is given by the IQR which is the difference between the third and first quartiles. 2. Locate the median inside the box and identify it with a line segment. 3. Compute for 1.5 IQR. Use this value to identify markers. These markers are used to identify outliers. The lowest marker is given by Q1 – 1.5IQR while the highest marker is Q3+ 1.5IQR.Values outside these markers are said to be outliers and could be represented by a solid circle. 4. One of the two whiskers of the box-plot is a line segment joining the side of the box representing Q1 and the minimum while the other whisker is a line segment joining Q3 and the maximum. This is for the case when the minimum and maximum are not outliers. In the case that there are outliers, the whiskers will only be line segments from the side of box and its corresponding marker. Inform also the students that a box-plot is also called box-and-whiskers plot and it could easily be generated using a statistical software. Comparison of data distributions could easily be done visually using this kind of plots. Likewise, in technical papers or reports, a box-plot is an accepted graphical presentation of data distribution. To complete the activity for this lesson, ask each group to construct box-plots of the male and female data distributions of their assigned variable. They could further improve their textual presentation by interpreting the resulting box-plots of their data sets. Using the sample class data, the following figures provide the box-plots for the variables heights, weights and BMI by sex of the student. The said figures confirm what were stated in the textual presentation.

!

79#

Figure 9.1 Box-plots of the variable heights of the 30 students by sex. We could also note that in Figure 9.1, the distribution of heights for the girls has a larger range because of an outlier as represented by a solid circle given on the plot. The distribution of the girls’ heights has smaller median compared to the male distribution.

Figure 9.2 Box-plots of the variable weights of the 30 students by sex.

!

80#

For the variable weights, females have a lower median weight than males, as well as less variability. The middle 50% of the female weight distribution is also observed to be contained within the range of the male weight data.

Figure 9.2 Box-plots of the variable BMI of the 30 students by sex. As for the variable BMI, females have a lower median BMI and lower variability compared to those of males. There is, at least extremely obese female, and one is severely underweight male. With the computed descriptive statistics and corresponding box-plot(s), the analysis or textual presentation could be further improved by describing data not only in terms of the measures but also in terms of the interpretation of box plots. Furthermore, these measures allow us to answer the guide questions provided at the start of the class.

KEY POINTS • • •

!

Descriptive measures are important statistics required in simple data analysis. Groups of data could be compared in terms of their descriptive measures. A box-plot is an approach to compare visually data distributions.

81#

ASSESSMENT Note: Answers are provided inside the parentheses and italicized. In a university the grading scale that is used for a subject are as follows: 1.0; 1.25; 1.5; 1.75; 2.0; 2.25; 2.5; 2.75; 3.0; 4.0; and 5.0 Grades from 1.0 to 3.0 are passing grades with 1.0 as the highest possible grade. The grade of 5.0 is failing while 4.0 is a conditional grade. At the end of the semester, the general weighted average (GWA) of the students are computed and students with high GWAs are usually recognized. Below is a table showing the GWA and sex of thirty students who are to be recognized in a program for having high GWAs. Name Im elda Frederick Gerald Jose Ana Isidoro Roberto Katherine Barbara Josie M aria Kenneth Ofelia Am paro Jam es Ditas Frenz Ronald Ruben Belle Elm o Connie Gina M arcia Jikko Susan Em m an Pinky Rose Brad

!

GW A 1.54 1.45 1.42 1.52 1.56 1.34 1.36 1.43 1.49 1.58 1.64 1.56 1.56 1.49 1.42 1.24 1.78 1.06 1.33 1.45 1.38 1.27 1.22 1.59 1.60 1.59 1.63 1.70 1.75 1.58

82#

Sex F M M M F M M F F F F M F F M F F M M F M F F F M F M F M M

Use the approaches below to compare the academic performance of male and female students in the previous term. 1. Compute for the descriptive measures which include the measures of location such as minimum, maximum, mean, median, first and third quartiles; and measures of dispersion such the range, interquartile range (IQR) and standard deviation by sex. Descriptive M easure

Measures of Location Minimum Maximum Mean First Quartile Median Third Quartile Measures of Dispersion Range IQR Standard Deviation

Com puted Value For the subgroup For the subgroup of of Males with N = Females with N = 16 14 1.06 1.75 1.46 1.36 1.44 1.58

1.22 1.78 1.51 1.44 1.55 1.59

0.69 0.22 0.17

0.56 0.15 0.16

2. Using the computed descriptive statistics, compare the two distributions in terms of their measures of location and measures of dispersions. On the average, which group of students perform better academically in the previous term? Which group varies more? (On the average, the numerical GWA of female students is 1.51 while male students have an average GWA of 1.46 which implies that male students in this group perform better academically than the female students. There is also difference in the numerical values of the computed medians but still the same observation that males perform better than females. However, the variability of the observations for the male students is higher compared to those of the female students. Hence, we say that the GWAs of male students vary more than those of the female students.)

!

83#

3. Sort the data within each group then determine what proportion in each group is within one standard deviation of that group's mean. Are the proportions similar? (Sorted Data of Male Students: 1.06 1.33 1.34 1.36 1.38 1.42 1.42 1.45 1.52 1.56 1.58 1.6 1.63 1.75 ! ∓ σ = 1.46 ∓ 0.17 = 1.29,1.63 Note that there are 12 out of 14 observations are within the interval or 86% of the observations are within one standard deviation of the mean. Sorted data for the female students:

1.22

1.24

1.27

1.43

1.45

1.49

1.49

1.5 4

1.5 6

1.5 6

1.5 8

1.5 9

1.5 9

1.6 4

1.7

1.7 8

! ∓ ! = 1.51 ∓ 0.16 = 1.35,1.67 Note that there are 11 out of 16 are within the interval or 69% of the observations are within one standard deviation of the mean. The proportions of observations that are within one standard deviation of the mean for each group are not the same. The proportion for the male group is larger than that of the female group. This support the observation earlier that the GWAs of the male students are more varied compared to those of female students.)

!

84#

4. Construct box-plots of the GWAs for the males and females. Compare the two data distributions of GWAs.

Visually, the two distributions of GWAs are different. The GWAs of the female students are less dispersed compared to that of the male students. Numerically, the median GWA of male students is lower than that of the female students. Hence, male students of this group perform better academically than their female counterpart. But the numerical values of the GWAs of the female students are close to each other.

!

85#

CHAPTER 2 : RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Lesson 1: Probability TIME FRAME: 60 minutes

OVERVIEW OF LESSON In this activity, learners initially review some basic concepts in Probability that they may have learned prior to Grade 11. Then, they are taught extra concepts on conditional probability. There are also discussions on the classical birthday problem that show them how to compute for the chance or probability of having at least two people in the classroom share the same birthday. LEARNING COMPETENCIES At the end of the lesson, learners should be able to: • • •

define probability in terms of empirical frequencies show how to apply the General Addition Rule, and the Multiplication Rule make use of a tree diagram for conditional probabilities

LESSON OUTLINE A. Introduction / Motivation: What is Probability? B. Main Lesson: Computing the Probability of an Event REFERENCES Many of the materials in this lesson were adapted from: De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1, 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031 Probability and Statistics: Module 18. (2013). Australian Mathematical Sciences Institute and Education Services Australia. Retrieved from http://www.amsi.org.au/ESA_Senior_Years/PDF/Probability4a.pdf

!

86#

DEVELOPMENT OF THE LESSON A. Introduction / Motivation: What is Probability and How to Assign It? Begin the session with a discussion on your uncertainty over summaries generated from data, especially when data are “random” samples of a larger population of units (i.e. people, farms, firms, etc). Examples: (i) approval ratings or proportion of people voting for a candidate (in an opinion poll); (ii) average family income (in the Philippine Statistical Authority’s triennial Family Income and Expenditure Survey); (iii) average prices of commodities (from sample outlets) Explain that people can quantify uncertainty through the notion of PROBABILITY (or Chance). Suggest to learners that if they were asked for the probability that they would pass the next quiz, they may give a number between 0% and 100 percent. Typically, the chances of a future outcome may be based on some past experience of data collected. Very studious learners, for instance, had passed their quizzes 100 percent of the time, while average students had passed their quizzes 85 percent of the time. When considering probabilities of events, learners should be guided to consider a particular context wherein possible outcomes are well defined and can be specified, at least in principle, beforehand. This context is called random process wherein we do not know which of the possible outcomes will occur, but we do know what is on the list of possible outcomes. Learners can be informed that it can be also helpful to view the probability of an event as its “long-run” empirical frequency or the fraction of times the event may have occurred under repeated “trials” of the random process. In the next lesson, we shall call this the “empirical probability,” and mention that in practice, we expect these empirical probabilities to stabilize toward some “theoretical probability.” This is called the law of large numbers). Ask learners to think of random processes and an event where: a. the outcome is certain. Examples may be getting a head (event) in the next toss of a two-headed coin (random process) or getting a number of at most 6 (event) when a die is thrown once (random process) b. the outcome is impossible. Examples may be getting a tail (event) in the next toss of a two-headed coin (random process) or getting a number greater than 6 (event) when a die is thrown once (random process)

!

87#

c. the outcome has an even chance of occurring. Examples may be a couple having a boy (event) as their next child (random process) or getting a red card (event) when randomly selecting a card from a deck of cards (random process) d. the outcome has a strong but not a certain chance of occurring. Example might be getting a sum of at most 11 (event) when a pair of dice is thrown (random process) Then, ask them the probability associated with these events. (Answers: 100 percent for certain events, 0 percent for impossible events, and 50 percent for outcomes with even chance of occurring. For the example in D, there are 36 possible outcomes for tossing a pair of fair dice, 35 of them will have at most a sum of 11, so the chance of getting at most 11 is 35/36). The closer the value of the probability to 1, the more likely the event will occur and the closer it is to 0, the less likely it will occur. Important: Point out to learners the following properties of the probability of an event: •

• •

the probability of an event is a non-negative value. In fact, it ranges from zero (0) (when the event is impossible) to one (when the event is sure). The closer the value to one, the more likely the event will occur the probability of the sure event is one (In other words, the chance of a sure event is 100 percent). if A and B are mutually exclusive events, meaning it is impossible for these two events to occur at the same time, then P(A or B) = P(A) + P(B). This is called the Addition Rule.

A more general result (also called the General Addition Rule) states that: P(A or B) = P(A) + P(B) - P(A and B) Geometrically, from a Venn Diagram, the area of the union of A and B is the sum of the areas, but if we added the intersection of A ∩ B twice, so we have need to subtract this area from the sum of the areas of A and B. Illustrate to learners that these properties can help us more readily compute for the probabilities of events. P(at most a sum of 12 when tossing a pair of fair dice) = P(at most a sum of 11 OR a sum of 12) = P (at most a sum of 11) + P(sum of 12)

!

88#

But P(at most a sum of 12) = 1 and P (sum of 12) = 1/36; Thus, when looking for the value of P (at most a sum of 11) P (at most a sum of 11) = 1 – 1 /36 = 35/36 In general, if we are interested in Ac, the complement of an event A (i.e. the event that happens when A does not), since P(A or Ac) = P(A) + P(Ac) and P(A or Ac) = P(Sure event) = 100% Thus, P(A) + P(Ac) = 1

or equivalently

P(Ac) = 1 – P(A) In consequence, the chance that an event does not occur is one (1) minus the chance it does occur. In terms of a Venn Diagram below, given a Sure Event S (represented by a square with area 100%), and an event A (represented by the triangle whose area represents the probability of A), then the chance of an event A not happening is one minus the chance of event A happening (i.e. area of the square minus the area of the triangle).

!

89#

Extra Notes: Mention also to learners that: (1) Historically, probability was studied by gamblers who wanted to increase their winnings (or at least decrease their losses). (2) Probability describes random behavior, but does anything really happen at random? Even Albert Einstein, when confronted by theories of quantum mechanics, was said to have pointed out that “God does not play dice.” Yet, many events, especially in nature “seem” to display random behavior. In many real life situations, we will be able to model these by random processes and thus, apply probability to understand the behavior of these situations. B. Main Lesson: Computing Probabilities of Events Mention to learners that the calculation of the probability of an event may sometimes be considered directly from the nature of the phenomenon/random process, with some assumptions of symmetry. Some underlying outcomes may be “equally likely” by assumption such as fair coins and fair dice. In practice, these assumptions need to be tested and will be the subject of inquiry in future lessons. These assumptions are simplifications to help us calculate probabilities. Example 1: Tell learners that a box contains green and blue chips. A chip is then drawn from the box. If it is green, you win P100. If it is blue, you win nothing. • Learners have a choice between two boxes: – Box A with 3 blue chips and 2 green chips – Box B with 30 blue chips and 20 green chips •

Which would learners prefer???

Some learners may say B, but tell learners that it actually should not matter, because the chance of winning Php100 is 2/5 =40% in box A, while in box B, the chance of winning is 20/50 =40%. Same probability. Conditional Probability Mention to learners that sometimes, we may have extra information that can change the probability of an event. Give the following definition of conditional probability. The conditional probability of event A given that B has occurred is denoted as P(A|B) and defined as

!

90#

Example 2: Suppose that we want to randomly select a student from among Grades 9 to 12 in a certain school

Grade

Sex Male

Female

Total

9

84

145

229

10

40

82

122

11

36

52

88

12

25

36

61

Total

185

315

500

The chance of selecting a Grade 11 student, given that the student is male, can be computed as follows: Define events A and B as: A = event that student selected is a Grade 11 student B = event that student selected is male, then

Example 3: A king comes from a family of two children. What is the chance that the king has a sister? Remind learners here that as the king comes from a family of two children, we are given extra information that this family of two children has a boy, the king.

!

91#

What we want to compute here is the probability that the sibling of the king is a girl. Let B the event of having at least one boy. So B={(b,b),(b,g),(g,b)}, where (x,y) means the sex of each child and the possible values are b for boy and g for girl. Then A is the event that the king's sibling is a girl, A={(b,g),(g,b)}. While the original sample space S of all possible outcomes is S={(b,b),(g,b),(b,g),(g,g)}, each outcome has ¼ chance of occurring. However, P (A | B) = P (A and B) / P(B) = (2/4) / (3/4) = 2/3

Independent Events Sometimes, the extra information provided may not really change the probability of an event. In this case, the events are said to be independent. The conditional probability of A given B may still be equal to the (unconditional) probability of event A. Two events A and B are said to be independent if P (A and B) = P (A) P (B) This is also called the Multiplication Rule. Intuitively, we call events such as tossing a coin (or dice) several times independent since future tosses are not affected by previous outcomes. If however, the events are not independent then we can still obtain the probability that both events A and B will occur using the definition of conditional probability: P (A and B) = P (A) P (B | A)

Example 4: Tell learners to suppose that there is a box that contains three tickets marked 1, 2, and 3. We shake the box, draw out one ticket at random; shake the box and draw out a second ticket. What would be the probability of getting a sum of “three” if tickets were drawn with replacement? Without replacement?

!

92#

The possible sums for the two tickets drawn with replacement are shown in a contingency table and tree diagram below

In consequence the probability of getting a sum of three is: P (sum of three) = 2/9 While if the tickets were drawn without replacement, we have

P (sum of three) = 2/6 = 1/3 Exercise 5: (The Birthday Problem, originally posed by Richard von Mises in 1939, reprinted in English in 1964) Mention to learners that in a room filled with more than 23 people, there is more than half a chance that at least two of them will have the same birthday, and if there are more people, the chances increase further toward 100% (about 99.9% with 70 people). Try it out with learners in your class. Tell learners to identify how many of them have a birthday in January. Try to see if you can get a match. Go to February if you don’t find anyone that match. Then March, and so forth. !

93#

The chance of 2 people having different birthdays is: = 0.997260 The chance of N people having different birthdays is:

1 #& 2 # & N −1 # & $1 − !$1 − ! ! $1 − ! 365 " % 365 "% 365 " % So the chance that at least two of them will have the same birthday is: p(N) = We have the probabilities computed below for several values of N. N p(N)

10 11.7%

20 41.1%

23 50.7%

30 70.6%

50 97.0%

57 99.0%

KEY POINTS •

Probability is a numerical representation of the likelihood of occurrence of an event. Its value is between zero (0) and one (1). When the value approaches 1, this means the event is very likely to occur, while a value close to zero (0) means it is not likely to occur.



When A and B are mutually exclusive events, then the probability of A or B is P (A or B ) = P(A) + P (B) (this is called the Addition Rule)



If A and B are independent events, then the probability of A and B is

P(A) P(B)

!

(this is called the Multiplication Rule)

94#

P(A and B) =

ASSESSMENT

1. What would be the probability of a. picking a black card at random from a standard deck of 52 cards? b. picking a face card (i.e. a king, queen, or jack)? c. not picking a face card? Answer: a. P(Black) = 26/52= ½ ; b. P(Face)= 12/52 = 3/13 ; c. P(not Face) = 1 – (3 /13) =10/13 2. What is the probability of rolling, on a fair dice: a. a 3? b. an even number? c. zero? d. a number greater than 4? e. a number lying between 0 and 7? f. a multiple of 3 given that an even number was drawn Answer: a. P(‘3’) = 1/6 ; b. P(Even)= 3/6 = 1/2 ; c. P(‘0’) = 0 ; d. P(greater than 4) = P(5 or 6) = 2/6 = 1/3 ; e. P(between 0 and 7) = P(1 or 2 or 3… or 6) = 6/6 = 1 f. P(multiple of 3 given even number) = P(multiple of 3 and even) / P(even) = P(‘6’)/P(2 or 4 or 6) = (1/6) / (3/6 ) = 1/3 3. A standard deck of playing cards is well shuffled and from it, you are given two cards. You can have 0, 1, or 2 aces: three possibilities altogether. So the probability that you have two aces is equal to 1/3. What is flawed about this argument? Answer: The outcomes are not equally likely. There are (52)(51)/2=1326 ways of selecting the first two cards. These are the equally likely outcomes. Of these ways, there would be (4)(3)/2=6 ways of selecting two aces; (48)(47)/2=1128 ways of selecting no aces, and 1326-6-1128=192 ways of selecting one ace. So, the chance of getting two aces is 6/1326 and not 1/3.

!

95#

4. You shuffle a deck of playing card, and then start turning the cards one at a time. The first one is black. The second one is also a black card. So is the third, and this happens up to the 10th card. You start thinking, “the next one will likely be red!” Are you correct in this reasoning? Answer: Yes, there are 42 cards left, 26 red and only 16 black. However, likely does not mean certainty. There is 16/42 chance that it is still going to be black. 5. The family of Tony delivers newpapers, one to each house in their village. Philippine Star 250 Philippine Daily Inquirer 300 Manila Bulletin 150

Manila Times 140 Manila Standard Today 100 Daily Tribune 60

What is the probability that a house picked at random has: a. the Manila Times? b. the Manila Standard Today or the Philippine Daily Inquirer? c. a newspaper other than Daily Tribune?

Answer: a. P(Manila Times) = 140/1000 =7/50; b. P(Manila Standard Today or PDI)= (100 + 300)/1000 = 2/5 ; c. P(other than Daily Tribune) = 1 – P(Daily Tribune) = 1 – (60/1000)= 940/1000 = 47/50 6. A class is going to play three games. In each game, some cards are put into a bag. Each card has a square or a circle on it. One card will be taken out, then put back. If it is a circle, the boys will get a point. If it is a square, the girls will get a point. a. Which game are the girls least likely to win? Why? b. Which game are the boys most likely to win? Why? !

96#

c. Which game are the girls certain to win? d. Which game is impossible for the boys to win? e. Which game is it equally likely that the boys or girls win? f. Are any of the games unfair? Why? Answer: a. game 3. For girls, chance of winning games 1, 2, and 3 are respectively, 4/8=50%, 8/8=100%, 4/12 = 33.3%. b.game 3. For boys, chance of winning games 1, 2, and 3 are respectively, 4/8=50%, 0/8=0%, 8/12 = 66.7%. c. game 2, chance is 100% d. game 2, chance is 0% e. game 1. f. games 2 and 3. 7. In a computer ‘minefield’ game, ‘mines’ are hidden on grids. When you land randomly on a square with a mine, you are out of the game. a. The circles indicate where the mines are hidden on three different grids. On which of the three grids is it hardest to survive? b. Grid 1 above is a 3 by 6 grid with 6 mines. On which of the following grids is it hardest to survive? X. 99 mines on a 30 by 16 grid Y. 40 mines on a 16 by 16 grid Z. 10 mines on an 8 by 8 grid Explain your reasoning.

Answer: a. P(hit a mine) in grid 1, 2 and 3, respectively is 6/18=33.3%, 8/25=32%, 7/20=35%. Thus, it is hardest to survive in grid 3 b. P(hit a mine) in X, Y, Z grid, respectively is 99/(30x16) = 0.20625, 40/(16x16)=0.15625, 10/8x8)=0.15625. So, it is hardest to survive in grid X.

!

97#

CHAPTER 2: RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Lesson 2: Geometric Probability TIME FRAME: 60 minutes OVERVIEW OF LESSON In this activity, learners initially review concepts in Probability, and discuss examples of theoretical probability, geometric probability, and empirical probability. Then, they are given a coin-tossing exercise to calculate the empirical probability of having a coin fall on a particular square in a grid (to solve Buffon’s Coin problem). Learners are led to discover that empirical probabilities, with more tosses, tend toward the geometric/theoretical probability. LEARNING COMPETENCIES At the end of the lesson, learners should be able to: • • • •

define and distinguish “geometric probability,” “empirical probability,” and “theoretical probability” use simulation to identify an empirical solution to Buffon’s coin problem employ area formulas to identify a theoretical solution to Buffon’s coin problem observe that as the number of trials increases, the empirical probability tends to approach the theoretical probability.

LESSON OUTLINE A. B. C. D.

Introduction: Recall How to Calculate Probability for Certain Random Processes Main Lesson: Empirical and Theoretical Probability Investigation on Empirical Probability: Buffon’s Coin Problem Investigation on Theoretical Probability: Buffon’s Coin Problem

REFERENCES Schneiter, K. Exploring Geometric Probabilities with Buffon’s Coin Problem. Utah State University in Statistics Education Web (STEW) Online Journal of K-12 Statistics Lesson Plans. Retrieved from http://www.amstat.org/education/stew/pdfs/EGPBCP.pdf Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031 Probability and Statistics: Module 18. (2013). Australian Mathematical Sciences Institute and Education Services Australia. Retrieved from http://www.amsi.org.au/ESA_Senior_Years/PDF/Probability4a.pdf Geometric Probability Examples. North Carolina School of Science and Mathematics. Retrieved from http://www.dlt.ncssm.edu/stem/sites/default/files/GeometricProbabilityexamples.pdf Geometric Probability Solutions. North Carolina School of Science and Mathematics. Retrieved from http://www.dlt.ncssm.edu/stem/sites/default/files/GeometricProbabilitysolutions.pdf

!

98!

MATERIALS REQUIRED •

• •

Coins and square grid (the diameter of the coin should be less than the length of a square on the grid, possibilities are plastic lids on floor tiles, or coins on graph paper). Note that a blank grid for coins is provided on the last page of this lesson. Pencil and paper for record keeping and note taking Calculator

DEVELOPMENT OF THE LESSON A. Introduction / Motivation: Recall How to Assign Probabilities to Events Begin the session with a recall of the notion of the PROBABILITY (or Chance) of events, in the context of random processes where possible outcomes can be determined beforehand, but not whether an outcome will occur. Mention to students that probabilities of events may be assigned: (a) theoretically by assuming understanding of situations in the events, such as symmetry or equal-likely outcomes (e.g. fair coin, fair dice being tossed so outcomes are equally likely), or if the events are related to areas of geometric objects (this is called “geometric probability”) (b) subjectively with personal assessment of the situation (e.g. when a student tells his friend that he has 50 percent chance of passing the quiz, or the probability that a student can swim around the world in 24 hours is zero (0)) (c) empirically by collecting data from repeated trials or experiences, and getting the proportion of times an event occurs (e.g. observing 10 patients, noticing that 6 of them responded to a medicine within one hour of the treatment, and thus, stating that the probability of response within an hour of receiving the treatment is 60 percent) Inform students that a few hundred years ago, people enjoyed betting on coins tossed onto the floor ... Would they cross the line of the grid or not? Georges-Louis Leclerc, Comte de Buffon (1707 – 1788), a French mathematician, started thinking about this problem more systematically, expressing it as follows: “What is the probability that a coin, tossed randomly at a grid, will land entirely within a tile rather than beyond the tile boundaries? (For the purposes of this

!

99!

lesson, we will assume that the diameter of the coin is less than the length of a side of the tile.) The ‘Buffon coin problem’ is an exercise in geometric probability, where probabilities are viewed as the proportions of areas (lengths or volumes) of geometric objects under specified conditions. Examples of questions that deal with geometric probabilities are: What is the probability of hitting the bull’s eye when a dart is thrown randomly at a target, given the target has a diameter of 24 cm, and the bull’s eye has a diameter of 10 cm? What is the probability that a four-colored spinner, with a diameter of 20 cm, will land on red? Geometric probabilities can be estimated using empirical approaches, or identified exactly using analytical methods (theoretical probability). An empirical probability is the proportion of times that an event of interest occurs in a set number of repetitions of an experiment.

Example: Throw 100 darts at the target. 15 darts hit the bull’s eye. The empirical probability of hitting the bull’s eye is 15/100 = 3/20.

Spin the spinner 50 times. Spinner lands on red 12 times. Empirical probability = 12/50 = 6/25.

!

100!

A theoretical probability is the proportion of times an event of interest is expected to occur in an infinite number of repetitions of an experiment. For a geometric probability, this is the ratio of the area of interest (e.g. bull’s eye) to the total area (e.g. target).

Area of bull’s eye = Area of target = Theoretical probability of hitting bull’s eye =

Area of red section = Area of spinner =

Theoretical probability of landing on red =

Ask learners how they can identify an empirical solution to this Buffon’s coin problem? (This corresponds to prompt 1 on the task sheet.) C. Investigation on Empirical Probability: Buffon’s Coin Problem I. Problem Formulation: Buffon’s coin What is the probability that a coin, tossed randomly at a grid, will land entirely within the tile rather than beyond the tile boundaries? (Recall that in this activity, we assume that the diameter of the coin is less than the length of a side of the tile.) II. Design and Implement a Plan to Collect the Data •

!

Discuss this as a class: How would you identify an empirical solution to Buffon’s coin problem? (See Item 1 on Activity Sheet.)

101!

After learners propose tossing coins at a grid, discuss details of the experiment. Divide the class into groups of five students. How many times will each group throw the coin? How will the coin be tossed? Will they count the times the coin lands on a boundary or the times it lands entirely within the tile? Will each group do this the same way? What difference will it make if they do not? (For purposes of later discussion, it will be helpful if everyone considers the event that the coin lands entirely within a tile.) Who will record the outcome of each toss? How will this count translate into an empirical probability? •

Experiment: Instruct each group to conduct the experiment, as designed by the class. (See Item 2 of Activity Sheet.)

III. Analyze the Data Instruct each group to use the data they gathered to compute the empirical probability of the event they considered. IV. Interpret the Results •

Discuss as a class: o Summarize the empirical probabilities generated by the groups on the blackboard. Ask learners what they observe about the empirical probabilities computed by the groups. (They are not all the same, many may be similar, a few may differ by a lot, if the experiment were repeated different answers would be obtained). o Is it possible to get a more stable answer? (Yes, repeat the experiment more times, combine data from different groups). o Ask students what they would expect to see if the coin could be tossed an infinite number of times. Why would they expect to see this? (See Item 3 of Activity Sheet.)

D. Investigation on Geometric Probability I. Problem Formulation: Buffon’s coin Recall Buffon’s coin problem: What is the probability that the coin, tossed randomly at a grid, will land entirely within a tile rather than beyond the tile boundaries? II. Solution to the Problem

!

102!



Discuss this as a class: How will they identify a theoretical solution to Buffon’s coin problem? (See Item 4 of Activity Sheet.) Outline the process here: Identify the shape of the region within the tile where the coin must land to be entirely within the tile. Look at the ratio of the area of that shape to the area of a tile. Learners must work out the details with group mates in the next segment. Answer: The Probability of a Crack Crossing Our main interest is in the event C that the coin crosses the tiles. However, it turns out to be easier to describe the complementary event Cc that the coin does not cross a tile. If the tile has unit length, and radius r < ½, then P(C c )=

(1 - 2r) 2 ,

Thus, P(C) = 1 - (1 - 2r) 2 •

Explore: Again working in groups, ask students: o To formulate a conjecture about the relationship between theoretical and empirical probabilities. (See Item 5 of Activity Sheet.) o To identify the shape of the region within the tile in which the coin must land to be entirely within a tile. (This will be challenging for some learners. The key is to consider where the center of the coin lands and how close the center can be to the edge of a tile while the coin is not on the boundary.)

III. Analyze the Data Instruct each group to use observations from the experiment, and the group discussion to compute for the theoretical probability that a coin tossed randomly at a grid lands entirely within a single tile. (See Item 6 of Activity Sheet.)

!

103!

IV. Interpret the Results Discuss as a class: Summarize learners’ answers on the board. Discuss observations about these solutions. Bring the class to a consensus on the solution. Observe that there is only one solution and it will not vary with further investigation. • Synthesis What seems to be the relationship between empirical and theoretical probabilities? (See Item 7 of Activity Sheet.) •

KEY POINTS •

Empirical probabilities (obtained from observing the proportion of times an event occurs in repeated trials) may differ, but the long run frequency of empirical probabilities will stabilize toward the theoretical probability. (As the number of trials increases, the empirical probability tends to converge to the theoretical one).



For some situations, we can calculate the theoretical probabilities as geometric probabilities, when events pertain to areas of geometric objects.



Sometimes, we associate probabilities subjectively, according to personal assessment of the likelihood of an event to occur.

ACTIVITY SHEET 2-02 Definitions: 1. Empirical probability: the proportion of times an event of interest occurs in a set number of repetitions of an experiment. 2. Theoretical probability: the proportion of times an event of interest would be expected to occur in an infinite number of repetitions of an experiment. 3. Geometric probability: a probability concerned with proportions of areas (lengths or volumes) of geometric objects under specified conditions. 4. Subjective probability: a probability derived from an individual's personal assessment of the situation on whether a specific outcome is likely to occur

!

104!

Investigation: Consider the question: What is the probability that a coin, tossed randomly at a grid, will land entirely within a tile rather than beyond the tile boundaries? 1. How would you be able to determine an empirical probability that “a coin, thrown randomly at a grid, will land entirely within a tile of a grid rather than beyond the tile boundaries?” 2. Work with your group to compute for the empirical probability that the coin will land within the tile of a grid. Record your observations below: 3. What would you expect to observe if the coin were to be tossed an infinite number of times at the grid? Why would you expect to see this? 4. How would you compute the theoretical probability that “a coin, thrown randomly at a grid, will land entirely within a tile rather than beyond the tile boundaries?” How is this question different from question 1? 5. What is the relationship between empirical and theoretical probabilities? 6. Work with your group to compute the theoretical probability that the coin will land within the tile. Record your work below. 7. Compare the empirical and theoretical probabilities you found. How do your results relate to the conjecture you proposed in item 5? EXAMPLE OF A GRID !

!

!

!

!

!

!

!

!

!

!

!

!

105!

ASSESSMENT 2-02 1. If a circle with diameter 20 cm is placed inside a square with a length 20 cm, what is the chance that a dart thrown will land inside the circle? ANSWER: Area of Circle / Area of Square = ( 0.785

/

= 314 / 400 =

2. Suppose two numbers, x and y, are generated at random, where 0 < x < 5 and 0 1.43); (c) P( –1.43 ≤ Z ≤ 1.43).

Answer: To solve (a), we merely read off directly the entry from the Table of Cumulative Distribution Function values of a Standard Normal Curve. Reproducing the needed part of this Table, Z

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1.4

0.9192

0.9207

0.9222

0.9236

0.9251

0.9265

0.9279

0.9292

0.9306

0.9319

We find that the area under the curve is F(1.43) = 0.9236.

!

189$

To solve (b), note that the sum of the area to the right and the area to the left is the total area under the curve (100%), so that the area to the right is 100% minus the area to the left. Thus, for (b), since the area to the left of z=1.43 is 0.9236 and the total area under the curve is 100%, then the area to the right of z = 1.43 is then 1-0.9236=0.0764. σZ =1

σZ =1

= 100% µZ = 0

1.43

Z

µZ = 0

1.43

Z

To obtain (c), note that by symmetry, since the area to the right of z = 1.43 is 0.0764, the area to the left of z = -1.43 is also 0.0764. Hence, the area between –1.43 and +1.43 is 1 – 2(0.0764) = 1–0.1528= 0.8472. Alternatively, we obtain this probability as F(1.43)- F(-1.43)= 0.9236 – F(1.43) = 0.9236-0.0764=0.8472. 8. The Inter-quartile Range is the difference between the Third Quartile (i.e., the 75th percentile), and the First Quartile (the 25th percentile). Calculate the Interquartile Range of a standard normal distribution. Answer: We can obtain the IQR of a standard normal curve by generating the upper and lower quartiles of the distribution. From the Table, we have the upper and lower quartiles as 0.675 and –0.675, respectively. Thus the IQR, the difference between the upper and lower quartiles, is 1.35.

!

190$

HANDOUT 2-07-1

Cumulative Distribution Function (CDF) of the Standard Normal Curve z -3.8 -3.7 -3.6 -3.5 -3.4 -3.3 -3.2 -3.1 -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 -0.0

!

0.00 0.0001 0.0001 0.0002 0.0002 0.0003 0.0005 0.0007 0.0010 0.0013 0.0019 0.0026 0.0035 0.0047 0.0062 0.0082 0.0107 0.0139 0.0179 0.0228 0.0287 0.0359 0.0446 0.0548 0.0668 0.0808 0.0968 0.1151 0.1357 0.1587 0.1841 0.2119 0.2420 0.2743 0.3085 0.3446 0.3821 0.4207 0.4602 0.5000

0.01 0.0001 0.0001 0.0002 0.0002 0.0003 0.0005 0.0007 0.0009 0.0013 0.0018 0.0025 0.0034 0.0045 0.0060 0.0080 0.0104 0.0136 0.0174 0.0222 0.0281 0.0351 0.0436 0.0537 0.0655 0.0793 0.0951 0.1131 0.1335 0.1562 0.1814 0.2090 0.2389 0.2709 0.3050 0.3409 0.3783 0.4168 0.4562 0.4960

0.02 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005 0.0006 0.0009 0.0013 0.0018 0.0024 0.0033 0.0044 0.0059 0.0078 0.0102 0.0132 0.0170 0.0217 0.0274 0.0344 0.0427 0.0526 0.0643 0.0778 0.0934 0.1112 0.1314 0.1539 0.1788 0.2061 0.2358 0.2676 0.3015 0.3372 0.3745 0.4129 0.4522 0.4920

0.03 0.0001 0.0001 0.0001 0.0002 0.0003 0.0004 0.0006 0.0009 0.0012 0.0017 0.0023 0.0032 0.0043 0.0057 0.0075 0.0099 0.0129 0.0166 0.0212 0.0268 0.0336 0.0418 0.0516 0.0630 0.0764 0.0918 0.1093 0.1292 0.1515 0.1762 0.2033 0.2327 0.2643 0.2981 0.3336 0.3707 0.4090 0.4483 0.4880

0.04 0.0001 0.0001 0.0001 0.0002 0.0003 0.0004 0.0006 0.0008 0.0012 0.0016 0.0023 0.0031 0.0041 0.0055 0.0073 0.0096 0.0125 0.0162 0.0207 0.0262 0.0329 0.0409 0.0505 0.0618 0.0749 0.0901 0.1075 0.1271 0.1492 0.1736 0.2005 0.2296 0.2611 0.2946 0.3300 0.3669 0.4052 0.4443 0.4840

191$

0.05 0.0001 0.0001 0.0001 0.0002 0.0003 0.0004 0.0006 0.0008 0.0011 0.0016 0.0022 0.0030 0.0040 0.0054 0.0071 0.0094 0.0122 0.0158 0.0202 0.0256 0.0322 0.0401 0.0495 0.0606 0.0735 0.0885 0.1056 0.1251 0.1469 0.1711 0.1977 0.2266 0.2578 0.2912 0.3264 0.3632 0.4013 0.4404 0.4801

0.06 0.0001 0.0001 0.0001 0.0002 0.0003 0.0004 0.0006 0.0008 0.0011 0.0015 0.0021 0.0029 0.0039 0.0052 0.0069 0.0091 0.0119 0.0154 0.0197 0.0250 0.0314 0.0392 0.0485 0.0594 0.0721 0.0869 0.1038 0.1230 0.1446 0.1685 0.1949 0.2236 0.2546 0.2877 0.3228 0.3594 0.3974 0.4364 0.4761

0.07 0.0001 0.0001 0.0001 0.0002 0.0003 0.0004 0.0005 0.0008 0.0011 0.0015 0.0021 0.0028 0.0038 0.0051 0.0068 0.0089 0.0116 0.0150 0.0192 0.0244 0.0307 0.0384 0.0475 0.0582 0.0708 0.0853 0.1020 0.1210 0.1423 0.1660 0.1922 0.2206 0.2514 0.2843 0.3192 0.3557 0.3936 0.4325 0.4721

0.08 0.0001 0.0001 0.0001 0.0002 0.0003 0.0004 0.0005 0.0007 0.0010 0.0014 0.0020 0.0027 0.0037 0.0049 0.0066 0.0087 0.0113 0.0146 0.0188 0.0239 0.0301 0.0375 0.0465 0.0571 0.0694 0.0838 0.1003 0.1190 0.1401 0.1635 0.1894 0.2177 0.2483 0.2810 0.3156 0.3520 0.3897 0.4286 0.4681

0.09 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0005 0.0007 0.0010 0.0014 0.0019 0.0026 0.0036 0.0048 0.0064 0.0084 0.0110 0.0143 0.0183 0.0233 0.0294 0.0367 0.0455 0.0559 0.0681 0.0823 0.0985 0.1170 0.1379 0.1611 0.1867 0.2148 0.2451 0.2776 0.3121 0.3483 0.3859 0.4247 0.4641

HANDOUT 2-07-1 CDF of the Standard Normal Curve ( cont’d) z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8

!

0.00 0.5000 0.5398 0.5793 0.6179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413 0.8643 0.8849 0.9032 0.9192 0.9332 0.9452 0.9554 0.9641 0.9713 0.9772 0.9821 0.9861 0.9893 0.9918 0.9938 0.9953 0.9965 0.9974 0.9981 0.9987 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999

0.01 0.5040 0.5438 0.5832 0.6217 0.6591 0.6950 0.7291 0.7611 0.7910 0.8186 0.8438 0.8665 0.8869 0.9049 0.9207 0.9345 0.9463 0.9564 0.9649 0.9719 0.9778 0.9826 0.9864 0.9896 0.9920 0.9940 0.9955 0.9966 0.9975 0.9982 0.9987 0.9991 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999

0.02 0.5080 0.5478 0.5871 0.6255 0.6628 0.6985 0.7324 0.7642 0.7939 0.8212 0.8461 0.8686 0.8888 0.9066 0.9222 0.9357 0.9474 0.9573 0.9656 0.9726 0.9783 0.9830 0.9868 0.9898 0.9922 0.9941 0.9956 0.9967 0.9976 0.9982 0.9987 0.9991 0.9994 0.9995 0.9997 0.9998 0.9999 0.9999 0.9999

0.03 0.5120 0.5517 0.5910 0.6293 0.6664 0.7019 0.7357 0.7673 0.7967 0.8238 0.8485 0.8708 0.8907 0.9082 0.9236 0.9370 0.9484 0.9582 0.9664 0.9732 0.9788 0.9834 0.9871 0.9901 0.9925 0.9943 0.9957 0.9968 0.9977 0.9983 0.9988 0.9991 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999

0.04 0.5160 0.5557 0.5948 0.6331 0.6700 0.7054 0.7389 0.7704 0.7995 0.8264 0.8508 0.8729 0.8925 0.9099 0.9251 0.9382 0.9495 0.9591 0.9671 0.9738 0.9793 0.9838 0.9875 0.9904 0.9927 0.9945 0.9959 0.9969 0.9977 0.9984 0.9988 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999

192$

0.05 0.5199 0.5596 0.5987 0.6368 0.6736 0.7088 0.7422 0.7734 0.8023 0.8289 0.8531 0.8749 0.8944 0.9115 0.9265 0.9394 0.9505 0.9599 0.9678 0.9744 0.9798 0.9842 0.9878 0.9906 0.9929 0.9946 0.9960 0.9970 0.9978 0.9984 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999

0.06 0.5239 0.5636 0.6026 0.6406 0.6772 0.7123 0.7454 0.7764 0.8051 0.8315 0.8554 0.8770 0.8962 0.9131 0.9279 0.9406 0.9515 0.9608 0.9686 0.9750 0.9803 0.9846 0.9881 0.9909 0.9931 0.9948 0.9961 0.9971 0.9979 0.9985 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999

0.07 0.5279 0.5675 0.6064 0.6443 0.6808 0.7157 0.7486 0.7794 0.8078 0.8340 0.8577 0.8790 0.8980 0.9147 0.9292 0.9418 0.9525 0.9616 0.9693 0.9756 0.9808 0.9850 0.9884 0.9911 0.9932 0.9949 0.9962 0.9972 0.9979 0.9985 0.9989 0.9992 0.9995 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999

0.08 0.5319 0.5714 0.6103 0.6480 0.6844 0.7190 0.7517 0.7823 0.8106 0.8365 0.8599 0.8810 0.8997 0.9162 0.9306 0.9429 0.9535 0.9625 0.9699 0.9761 0.9812 0.9854 0.9887 0.9913 0.9934 0.9951 0.9963 0.9973 0.9980 0.9986 0.9990 0.9993 0.9995 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999

0.09 0.5359 0.5753 0.6141 0.6517 0.6879 0.7224 0.7549 0.7852 0.8133 0.8389 0.8621 0.8830 0.9015 0.9177 0.9319 0.9441 0.9545 0.9633 0.9706 0.9767 0.9817 0.9857 0.9890 0.9916 0.9936 0.9952 0.9964 0.9974 0.9981 0.9986 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 0.9999

HANDOUT 2-07-2 Selected Percentiles of the Standard Normal Distribution

!

z

F(z)

-2.326 -1.96 -1.645 -1.282 -0.842 -0.675 0.00 0.675 0.842 1.282 1.645 1.96 2.326

0.01 0.025 0.05 0.10 0.20 0.25 0.50 0.75 0.80 0.90 0.95 0.975 0.99

193$

CHAPTER 2: RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Lesson 10: Areas under a Normal Distribution TIME FRAME: 120 minutes

LEARNING COMPETENCIES At the end of the lesson, learners should be able to: • • •

convert a normal random variable to a standard normal variable and vice versa compute probabilities using a table of cumulative areas under a standard normal curve compute percentiles of a normal curve

PRE-REQUISITE LESSONS: Random Variables, Probability Distribution of Continuous Random Variables, Properties of Normal Distributions LESSON OUTLINE A. Introduction: Review of Normal Distribution B. Main Lesson: Areas Under a Normal Curve C. Enrichment: Computing with Excel REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los Baños, College Laguna 4031 Probability and statistics: Module 22: Exponential and normal distributions (2013). Australian Mathematical Sciences Institute and Education Services Australia. Retrieved from http://www.amsi.org.au/ESA_Senior_Years/PDF/ExpoNormDist4f.pdf

!

194$

DEVELOPMENT OF THE LESSON A. Introduction: Review of Normal Distribution Ask learners to recall some of the lessons learned about normal distributions. They should be able to state that •







A normal distribution has a symmetric bell-shaped curve (for its probability density function) with one peak. This curve is characterized by its mean m (the center of symmetry, and also the peak) and standard deviation σ (the distance from the center to the change-of-curvature points on either side). If a random variable X has a normal distribution with mean m and variance σ2, we denote this as X~N(μ,σ2 ). A normal curve is symmetric about its mean (thus the mean is the median). It is more concentrated in the middle and its peak is at the mean (so that the mean is also the mode). Like any continuous distribution, the total area under the normal curve is equal to 1, and the probability that a normal random variable X equals any particular value a, P(X=a) is zero (0). The normal curve follows the empirical rule (also called the 68-95-99.7 rule): o About 68% of the area under the curve falls within 1 standard deviation of the mean. o About 95% of the area under the curve falls within 2 standard deviations of the mean. o Nearly the entire distribution, about 99.7% of the area under the curve, falls within 3 standard deviations of the mean.

B. Main Lesson: Probabilities/Areas Under a Normal Curve Tell learners that given a normally distributed random variable: X~N(μ,σ2), we often wish to find various probabilities pertaining to where an arbitrary measurement may lie. For instance, we may want to find P(a ≤ X ≤ b), which is the probability that a random measurement X lies between a and b.

!

195$

We may also wish to find the proportion of measurements less than a value k (or at most k ), denoted by P(X < k) (or P(X ≤ k) ). Remind learners that it would not matter whether we are considering P(X < k) or P(X ≤ k) since or P(X = k) =0

Finally, we may want the proportion greater than k (or at least k), denoted by P(X > k) (or P(X ≥ k) ).

In the last session, learners were given a lesson on the standard normal distribution. We make use of areas under a standard normal distribution also but we need to convert a normal distribution into standardized form. Standard Scores (or Z-scores) Whatever the value of the mean and standard deviation of a normal curve, we can transform the whole normal curve into a standard normal curve (as illustrated in the following figure). !

196$

This entails transforming the all data in a normal curve into standard units: An observation is in standard unit (or z-score) if we see how many standard deviations it is above or below the average. That is, if x, m, and s respectively represent the observation, its mean, its standard deviation, then the standardized form (or z-score) of x is

x−µ

σ Reiterate to learners that a Z-score indicates how many standard deviations a certain data element is from the mean. For instance, if examination scores in Statistics and Probability have an average of 75 and a standard deviation of 5, then an exam score of 90 has a z-score of (90-75)/5 = 3 , while a score of 70 has a z-score of (70-75)/5 =-1. To interpret these z-scores, we note that 90 is 3 standard deviations above the mean (75), while 70 is one standard deviation “below’ the mean. Z scores have a very good way of making variables comparable. Suppose a student got an examination score of 90 in Statistics and Probability (where the mean was 75 and the standard deviation was 5) and a 92 in an English examination (where the mean was 95 and the standard deviation was 3). While it might seem that the Statistics and Probability is “lower” (in absolute numbers) than the English, but the z-score in English is (92-95)/3 = -1, so the “relative” performance in English (in relation to the average) is actually lower than the relative performance in Statistics and Probability. The Z-scores may also be used for normal random variables to transform them into standard normal random variables, and this, in turn, can help us relate probabilities for any normal distribution to areas under a standard normal curve, as the following example on the time to walk a dog illustrates.

!

197$

Illustration for Finding Areas Under a Normal Curve Ask learners to assume that the distribution of heights of all female Grade 11 students can be modeled well by a normal curve with a mean of 1620 mm and a standard deviation of 50 mm. Further, we wish to determine (a) the proportion of female Grade 11 students shorter than 1550 mm; (b) the proportion of female Grade 11 students taller than 1650 mm; (c) the proportion of female Grade 11 students between 1600 and 1675 mm; (d) the height of a female Grade 11 student for which 10 percent of female Grade 11 students are shorter than it; (e) the height of a female Grade 11 student for which 75% of female Grade 11 students are taller than it. For computing the answer to (a), tell learners to firstly transform 1550 to its z-score, yielding (1550-1620)/50 =-1.4 so that we can associate the area to the left of 1550 (under a normal curve with mean 1620 and standard deviation 50) with that of the area to the left of z = -1.4 under a standard normal curve. Reading from the table of Cumulative Distribution Function of a Standard Normal Curve, we find Φ(-1.4) = 0.0808, For (b), ask learners what they should do. They should say they need to firstly transform the height value 1650 to its standard units, (1650-1620)/50 = 0.6, and then note that the area to the right of z = 0.6 under the standard normal curve is the difference between the total area under a standard normal curve (100%) and the area to the right of z=0.6, Φ (0.6)= 0.7257. In consequence, the desired probability (and area) is 10.7257=0.2743. Likewise, for (c), learners should mention they need to firstly transform 1600 and 1675 into their respective standardized forms, namely (1600-1620)/50 = -0.4 and (16751620)/50 = 1.1, and then generate the area between these two z-scores as the difference between Φ (1.1) and Φ (-0.4), i.e. 0.8643-0.3446=0.5197. For (d), draw the figures on the board to illustrate what needs to be done:

!

198$

Show that the 10th percentile of the height distribution may be obtained by firstly getting the 10th percentile of the standard normal curve, which can be read off from the table as –1.282. This means that the 10th percentile of the height distribution is 1.282 standard deviations below the mean. This required value for the height is – 1.282(50)+1620 =1555.9. Finally, for (e), suggest to learners that we want the 25th percentile as this is the value for which 75 percent of the height distribution would be above it. Similar to (d), tell students they can find the 25th percentile first of a standard normal curve (– 0675), then yield the required height as: –0.675(50)+1620 =1586.25. C. Enrichment: Computing with Excel In the last lesson, the NORMSDIST function was illustrated in the Enrichment part of the lesson. There are other important Excel functions for the normal distribution, especially the NORMDIST and NORMINV functions. The NORMDIST (x, mu, sigm, cumulative) helps obtain cumulative probabilities but for general normal curves. The parameters x, mu and sigma are numeric values, where the parameter, cumulative is a logical TRUE or FALSE value. Note that sigma must be greater than 0 (as it is a non-trivial standard deviation), but there are no similar requirements whether for x or mu. To illustrate, recall the female student’s height example, where we were interested firstly in obtaining P(X ≤ 1550 given m =1620 and s =50) where X is the height of a randomly selected female Grade 11 student. Students can merely use the NORMDIST function that asks for the score (1550), the mean (m =1620) and standard deviation (s =50) of the normal distribution:

= NORMDIST(1550,1620,50,TRUE) Note that the final argument TRUE tells Excel that we wish to obtain the area to the left (rather than the height of the normal curve). Also, to compute P(X ≥ 1650 given m =1620 and s =50), learners can specify in Microsoft Excel the command

!

199$

= 1-NORMDIST(1650,1620,50,TRUE) For P(1600 ≤ X ≤ 1675 given m =1620 and s =50), learners can enter = NORMDIST(1675,1620,50,1) - NORMDIST(1600,1620,50,1) The NORMINV (p, mu, sigma) function of Excel returns the value x such that, with probability p, a normal random variable with mean mu and standard deviation sigma takes on a value less than or equal to x. That is, the value returned is the (100 times p)th percentile of the normal curve with mean mu and standard deviation sigma. For instance, to obtain the 10th percentile of the distribution for the heights of female Grade 11 students, merely enter in Excel =NORMINV(0.1,1620,50) The 25th percentile (the value for which 75 percent are above it) can be obtained with: =NORMINV(0.25,1620,50)

KEY POINTS • To obtain probabilities or percentiles under a normal curve, perform two steps: Transform the normal curve into a standard normal curve by way of “z-scores” (which involves subtracting the mean and dividing the result by the standard deviation)

z = (X - μ) / σ. Then, use the tables of the Cumulative Distribution Function of a Standard Normal Distribution to obtain the required areas of a standard normal curve to find the probabilities associated with the z-scores.

!

200$

ASSESSMENT 1. If a particular batch of data is approximately normally distributed, we would find that approximately a) 2 of every 3 observations would fall between ± 1 standard deviation around the mean. b) 4 of every 5 observations would fall between ± 1.28 standard deviations around the mean. c) 19 of every 20 observations would fall between ± 2 standard deviations around the mean. d) All the above. Answer: d For problems 2 to 4 consider the following case. The length of time it takes a Grade 11 student to play the Candy Crush computer app follows a normal distribution with a mean of 3.5 minutes and a standard deviation of 1 minute 2. The probability that a randomly selected Grade 11 student will play one game of Candy Crush in less than 3 minutes is a) 0.3551 b) 0.3085 c) 0.2674 d) 0.1915 Answer: b 3. The probability that a randomly-selected grade 11 student will take between 2 and 4.5 minutes to play Candy Crush is: a) 0.0919 b) 0.2255 c) 0.4938 d) 0.7745 Answer: d ! 4. The point in the distribution of times to play Candy Crush in which 75.8% of the Grade 11 students exceed when playing Candy Crush. a) 2.8 minutes b) 3.2 minutes c) 3.4 minutes d) 4.2 minutes Answer: a

5. Rodrigo earned a score of 940 on a national achievement test. The mean test score was 850 with a standard deviation of 100. What proportion of students had a higher score than Rodrigo? (Assume that test scores are normally distributed.) If there were 100,000 students who took the test, how many would be expected to have a higher score than Rodrigo? Answer :

Assuming that test scores are normally distributed, the solution involves three steps. !

201$

First, we transform Rodrigo's test score into a z-score, using the z-score transformation equation. z = (X - μ) / σ = (940 - 850) / 100 = 0.90 Then, using a standard normal distribution table, we find the cumulative probability associated with the z-score. In this case, we find P(Z < 0.90) = 0.8159. Therefore, the P(Z > 0.90) = 1 - P(Z < 0.90) = 1 - 0.8159 = 0.1841. Thus, we estimate that 18.41 percent of the students tested had a higher score than Rodrigo. If there were 100,000 students who took the exam, then we expect 100,000 x (0.1841 ) = 18,410 students to have scores higher than Rodrigo’s. 6. Every night when you get home from school, you take your dog Bantay for a walk. The length of the walk is normally distributed with a mean of m=15 minutes and standard deviation of s=3 minutes. (a) What proportion of walks last less than 15 minutes? (b) What proportion of walks last longer than 20 minutes? (c) What proportion of walks last between 10 and 16 minutes? Answer :

(a) P (X < 15 given m=15 and s=3 ) = P( Z < (15-15)/3 = 0 ) = 0.5 (b) P (X > 20 given m=15 and s=3 ) = P( Z > (20-15)/3 = 1.67 ) = 1- P(Z≤1.67) = 1- 0.9525 = 0.0475 (c) P (10 ≤ X ≤ 20 given m=15 and s=3 ) = P ((10-15)/3 ≤ Z ≤ (20-15)/3 )= P (-1.67 ≤ Z ≤ 1.67) = I(1.67) – I(-1.67) = 0.9525 – 0.0475 = 0.905 7. Suppose scores on an IQ test are normally distributed. If the test has a mean of 100 and a standard deviation of 10, what is the probability that a person who takes the test will score between 90 and 110? Answer : Here, we want to know the probability that the test score falls

between 90 and 110. The "trick" to solving this problem is to realize the following: P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 ) = P(Z < (110-100)/10 ) – P(Z< (90-100)/10 ) = 0.84 - 0.16 = 0.68 Thus, about 68% of the test scores will fall between 90 and 110. Alternatively, notice we are getting the scores within one standard deviation from the mean, so the empirical rule will suggest the chance to be 68%. 8. The following letter appeared in the popular “Dear Abby” newspaper advice column in the 1970s: Dear Abby: You wrote in your column that a woman is pregnant for 266 days. Who said

so? I carried my baby for ten months and five days, and there is no doubt about it because I know the exact date my baby was conceived. My husband is in the Navy and it couldn’t have possibly been conceived any other time because I saw him only once for an hour, and I didn’t see him again until the day before the baby was born.

!

202$

I don’t drink or run around, and there is no way this baby isn’t his, so please print a retraction about the 266-day carrying time because otherwise I am in a lot of trouble. San Diego Reader

The advice column was founded in 1956 by Pauline Phillips under the pen name "Abigail Van Buren" and carried on up to today by her daughter, Jeanne Phillips, who now owns the legal rights to the pseudonym. Suppose that according to pediatricians, pregnancy durations, let’s call them X, tend to be normally distributed with m= 266 days and s = 16 days. Perform a probability calculation that addresses San Diego Reader’s credibility, presuming she was pregnant for 308 days. What would you conclude and why? Answer:

The chance of being pregnant for 308 or more days is P(X > 308 days, given m= 266 days and s = 16 days) = P(Z > (308-266)/16 ) = P(Z >2.625) = 1- 0.9957 =0.0043 .

which would happen in 43 out of 10,000 pregnancies.

!

203$

CHAPTER 3: SAMPLING Lesson 1: Coin Tossing Revisited from a Statistical Perspective TIME FRAME: 60 minutes OVERVIEW OF LESSON In this activity, learners revisit the coin tossing activity but this time, they look into how the probability of getting a head is an unknown constant and needs to be estimated. Other illustrations on sampling and estimation are also be discussed. LEARNING COMPETENCIES: At the end of the lesson, the learner should be able to: • • • •

describe random sampling distinguish between (population) parameter and (sample) statistic describe sampling distributions of statistics (sample mean) discuss the Central Limit Theorem

LESSON OUTLINE A. Introduction / Motivation : A Coin Need Not Be Fair B. Main Lesson: Estimation of Probability of Getting a Head in a Single Toss of a Coin C. Data Collection D. Data Analysis and Interpretation E. Enrichment REFERENCES Richardson, M, Using Dice to Introduce Sampling Distributions. STatistics Education Web (STEW). Retrieved from http://www.amstat.org/education/stew/pdfs/UsingDicetoIntroduceSamplingDistributio ns.doc De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031 Probability and statistics: Module 24. (2013). Australian Mathematical Sciences Institute and Education Services Australia. Retrieved from http://www.amsi.org.au/ESA_Senior_Years/PDF/InferenceProp4g.pdf KEY CONCEPTS: Sampling, Estimation, Sampling Variation, Standard Error, Central Limit Theorem

!

204$

MATERIALS NEEDED: 1-peso coin per student DEVELOPMENT OF THE LESSON A. Introduction / Motivation: A Coin Need Not Be Fair Learners may have heard of “sample” of data being used—opinion polls which estimate the fraction of voters who are likely to vote for a particular candidate for the next presidential election; taking measurements on the heights and weights of senior high school learners (done in the first chapter); or conducting an experiment on a sample of patients who are either randomly allocated to (a) a treatment group who is given some medical treatment) and (b) a control group, who is given a placebo, a harmless salt solution, to control the psychological effects of being given a treatment. The context here for sampling is to recall the coin-tossing experiment in Lesson 203 that involves tossing a one-peso coin with the class getting either a head (H), the face of Rizal on top, or a tail (T), the other side up. This time, however, it is crucial to point out that the class do not assume beforehand that the coin is fair. That is, while they may be able to assume that the probability of getting a head on a single toss of a coin, P(H) = p, is a constant, it is not known; and thus, they would like to estimate p. This makes the situation a statistical one that involves uncertainty. Tell learners to suppose that if they were to flip the coin n times. (Later on they will begin to refer to this as taking a sample of size n.) The random variable X, as defined before in Lesson 2-03, can take on values {0,1,2, … , n} and the number of outcomes favorable to each value can be read off of Pascal’s Triangle. However, this time the outcomes are no longer equally likely. Moreover, the probabilities cannot be computed exactly because they are functions of p, which is not known to them. Note to teacher: In the previous chapter, learners have learned that the probabilities of independent events happening simultaneously are the product of their probabilities. So, the probability of x heads and n-x tails in specific sequence is given by

p x(1-p)n-x

!

205$

Moreover, there are nCx =

ways that x heads can turn up out of n flips.

Hence, the probability of observing x heads in n tosses given that the chance p of getting heads is

P(X=x) = nCx p x(1-p)n-x

for x = 0, 1, 2, …, n

which is called the binomial probability mass function, or binomial pmf. As the name implies, a pmf defines the probability mass corresponding to individual values of the discrete random variable X. The binomial pmf depends on the value of p, which may be assumed as constant but is unknown. Intuition should lead the majority, if not most learners, to consider the number of heads X observed divided by the number of tosses (or sample size) as a reasonable estimate for p; that is,

X/n is a natural estimate of the probability p of getting a head. They know that can take on values 0/n, 1/n, 2/n, … , n/n, but they will not know which one until after the flipping is completed. Furthermore, they know that if the experiment is repeated (flip coin n times again), the observed X will not necessarily be the same as in the previous one. Thus, X is no longer just a variable in the mathematical sense; it is called a random variable (as was discussed in Lesson 2-03) because its outcome can change, but that the change cannot be computed with certainty. Variability and the attendant uncertainty in the result of the sampling experiment are introduced. Inform learners that, as will be shown later in the course, the one experiment (of n flips of a coin yielding a single outcome x) allows the class to (a) estimate the unknown probability p of getting a head (with x/n); (b) estimate the uncertainty in this estimate, i.e., the value of the “probable error” in estimation from the sample, aware that the estimate is subject to “sampling variation” or “sampling error.” For instance, the approval ratings of the President obtained from an opinion poll of about 1,200 respondents randomly selected are theoretically within a margin of error of about 3 percentage points (as will be illustrated later) from the actual approval ratings. (c) compute an interval estimate from the sample, along with the chance that the interval “captures” the unknown p. The uncertainty is still there, but it !

206$

can be measured using probability and there lies the connection between statistics and probability (or mathematics). B. Main Lesson: Estimation of Probability of Getting a Head in a Single Toss of a Coin In discussing probability in the last chapter, we have considered symmetry and appropriate random mixing (such as shaking a die) to justify the assignment of probabilities. For example, that the chance of rolling a two using a fair die is 1 out of 6. But knowing the probabilities of events, or even having a basis for assuming particular values for probabilities, is actually not a common scenario. On the contrary, we are often confronted with a situation such as a coin-tossing experiment, where we know the size n of the random sample of units (or number of trials), but we do not know the probability p of getting a head. And we would like to estimate this constant but unknown quantity. We could extend this to the scenario of knowing the • •

percentage of voters who would be voting for a certain candidate in the next election, or the fraction of the population who is poor

One of the main reasons for studying probability distributions is that it provides the foundation for making conclusions or inferences about unknown population characteristics, such as p (on the basis of sample data). Inform learners that generalizing results beyond the data collected, provided that the data collected is a part (sample) of a large set of items (population), is known as statistical inference. In the context of the two practical examples, we could get a random sample of •



!

voters, who can be asked about their current preference for voting in the next election. We may be interested in using the sample to draw an inference about the proportion of the population of voters who currently prefer to vote for some candidate (and even profile these people in relation to socioeconomic status, sex, age, or geographic location). respondents who can be asked about their income and/or expenditure, and if some poverty line (that can be viewed as the minimum level of income or expenditure required for a particular welfare level) is defined, we can draw conclusions about the proportion of the population who are poor (and consequently, describe the poor in relation to the non-poor).

207$

Even without using any concepts from probability (discussed in the previous chapter), learners should find it reasonable to think that a sample proportion should tell us something about the population proportion (that is unknown). If we have a “random sample” from the population, the sample is representative of the population so we should be able to use the sample proportion as an estimate of the population proportion. Provide some scenarios and ask what the estimate of p would be for these scenarios: •

Flipping a coin 100 times and getting 52 heads. Ask learners what the estimate of the probability of getting a head on a single toss would be. The probability of getting a tail? They should say 52/100=0.52 and 48/100=0.48, respectively.



Conducting an opinion poll of 1,200 randomly selected voters who suggested these voting preferences Metro Manila

Candidate X Candidate Y Total

Balance Luzon 195 105 300

Visayas 197 103 300

Mindanao 115 185 300

Total

261 39 300

768 432 1200

Ask learners what the estimated fraction of voting preference for candidate X would be. Learners should see that nationally, the estimate is 768/1200=0.64 but the estimated proportions vary by geographic location. •

Conducting a sample survey of, say 5 families selected randomly from a list of families. They are asked to provide information on their monthly family income and family size. Suppose that a family is poor if its monthly per capita income, i.e. monthly total family income divided by the family size, is less than Php1,800 Phpper month. Monthly Total Family Income

Family 1 2 3 4 5

!

Family Size

40,000 10,000 100,000 8,000 75,000

208$

5 6 3 8 4

Per Capita Income 8000.00 1666.67 33333.33 1000.00 18750.00

Ask learners what the estimated proportion of families that are poor would be. Learners should see that only the second and fourth families are poor, so 2/5=0.40= 40% of families are estimated to be poor. C. Data Collection Activity Give the learners the Activity Worksheet 3-01. Ask learners: If you were to toss a coin for an extremely large number of times, what proportion of the tosses will be heads? Of course, they will answer 1/2. Explain to learners that they are assuming that the coin is fair. The goal of this activity is to estimate the proportion of tosses that would result in a “head” and then, to examine the distribution of estimates in repeated sampling. Have learners work individually using the data collection procedure described on the Activity Worksheet. Learners must determine the sample proportion of tosses that would yield a “head” for each of the sample sizes of n=5, 10, 20, and 30. Individual results are recorded on the Worksheet and each student will write individual results on the blackboard (or worksheet of a spreadsheet application in a computer) in an appropriately labeled column. For a 45-student class, there should be 45 sample proportion values for each of the sample sizes. Collecting this data provides learners the opportunity to participate in an example of obtaining repeated samples. Calculating the proportion of “heads” yielded for each of the sample sizes helps to reinforce the idea of a sample proportion being a random variable whose value changes from sample to sample. After the individual sample proportions have been computed and the results copied onto the blackboard (or a computer worksheet), ask learners to input the class data into the Class Data Table on the Activity Worksheet. Based on the generated data, construct a Stem and Leaf Display (you may also use bar graphs). From this graph, ask learners to describe the distribution of the values that they generated.

D. Data Analysis and Interpretation ! The figures below provides an example of results that can serve as a model. Here, 35 learners participated in producing the example data set. Stem and Leaf Displays have been constructed for the proportions of tosses that yielded a ‘head’ for each of the four sample sizes (5, 10, 20 and 30). By examining the class data, learners will begin to discover how statistics differs from mathematics since statistics involves uncertainty.

!

209$

Stem and Leaf Display for sample sizes of n=5 data rounded to nearest multiple of .1 plot in units of .1 0* | 0 0t | 22222 0f | 44444444444 0s | 66666666666 0. | 8888 1* | 000

Stem and Leaf Display for sample sizes of n=10 data rounded to nearest multiple of .1 plot in units of .1 0t | 222 0f | 4444444555555555 0s | 666666677777777 0. | 8

Stem and Leaf Display for sample sizes of n=25 data rounded to nearest multiple of .01 plot in units of .01 2. | 888 3* | 2 3. | 6 4* | 0000044444 4. | 88888 5* | 22222 5. | 6 6* | 00004 6. | 88 7* | 22 Stem and Leaf Display for sample sizes of n=50 data rounded to nearest multiple of .01

!

210$

plot in units of .01 3* | 4 3. | 88 4* | 002222444 4. | 6688 5* | 00022244 5. | 666666 6* | 044 6. | 68 Figure 3-01.1. Stem and Leaf Display of Example Class Data The example class data in Figure 3-01.1 is used for purposes of illustration, and maybe as a prototype for the actual class data. For each sample size, learners should construct a stem and leaf display (or any graphical representation of the distribution such as a histogram) of the sample proportion values and describe the shape, center, and spread of the distribution of values. For samples of size 5 and size 10, there are only a few different values of the sample proportion, so the shape is difficult to determine. It appears that the centers of the distribution for samples of size 5 and size 10 are both at around 0.50. And, the sample proportion values range from 0 to 1 for samples of size 5, and from 0.2 to 0.8 for samples of size 10. For samples of size 25, different values are obtained for the sample proportion of “heads.” The distribution has a center again at around 0.50. And, the sample proportions range from 0.28 to 0.72. For samples of size 50, the distribution of the sample proportions appears roughly like a normal curve, i.e. mound-shaped (with a slight rightward skew). The center is at around .50. The sample proportion values range from 0.34 to 0.68. For each sample size, ask learners to calculate the mean and standard deviation of the sample proportions. The calculated values of the mean of the sample proportion distributions are 0.52, 0.529, 0.489, and .50, respectively for samples of size 5, 10, 25, and 50. The calculated values of the standard deviation of the sample proportion distributions are 0.24, 0.15, 0.12, and 0.09, respectively for samples of size 5, 10, 25, and 50. Ask learners to think about the relationship between the center of the distribution of the sample proportions and the value of the population proportion. Learners

!

211$

should note that the distribution of sample proportion values is centered on the value of the population proportion (1/2 is approximately 0.50). Tell learners to think about the relationship between the sample size and the shape of the distribution of the sample proportion. Learners should note that as the sample size increases, the distribution of the sample proportion tends more towards a normal distribution. (This is known as the “Central Limit Theorem”). Ask learners: For which sample size is the standard deviation of the sample proportion values the largest and for which sample size is the standard deviation the smallest? Ask them why they think this happens. Learners should observe that the variability of the sample proportion values, whether measured from the range or the standard deviation, is related to the sample size. A larger sample size results in smaller variability (smaller range and smaller standard deviation) in the sample proportion values. The results from the analyses of the repeated sampling can lead to a discussion on the theoretical properties of the sampling distribution of a sample proportion. The mean value of the distribution of a sample proportion for repeated random samples of size n, drawn from the same population, is equal to the corresponding value of the proportion of the population, p. (Here, the value of p seems to be 0.5). The standard deviation of the distribution of a sample proportion for repeated random samples of size n, drawn from the same population, decreases with an increase in the sample size. The theoretical standard deviation formula

of a sample

proportion, also called the “standard error,” can now be introduced. The standard error is inversely proportional to the square root of the sample size. In consequence, the bigger the sample size (i.e., the number of tosses of the coin), the less variability there will be in the estimates. Technical Note: The function

is maximized when p=1/2, then the standard

error of the sample proportion is at most

. As will be shown in later lessons,

organizations that conduct opinion polls have 95% confidence that the opinion polls have margins of error, i.e. twice the standard error of a sample proportion, at most 3 percentage points. They design the polls so that equivalently, the minimum sample size n, for which, these organizations use sample sizes of 1,200.

!

212$

, or =

This is why

Learners will also observe that the standard error is dependent on p, and thus, we may instead estimate it with enough (usually

provided that the sample size is large and

).

Inform learners that the distribution of a sample proportion for repeated random samples of size n, drawn from the same population, will approximately follow a normal (bell-shaped) distribution. Note: The activity above can be done with other “simulations” of situations, e.g., consider tossing a die and observing the proportion of times that the upward face of the die yields a “four” or a “five” (which should be expected to be 2/6 = 1/3 , give or take some random variation). E. Enrichment After learners have been introduced to the formula for a confidence interval for the population proportion, data collected from the coin-tossing activity for n= 50 can be used. Learners can construct a 95% confidence interval for the proportion of tosses resulting in a “head”:

which can be approximated as

since p is unknown. Note that since each student’s sample will be unique, after constructing their 95% confidence intervals, learners can be asked to put the results onto the blackboard. This way, the instructor can have a discussion on what confidence level means (examining the percentage of the different confidence intervals that include ½ =

!

213$

0.5). About 95 percent of learners will have confidence intervals that will hit the target of 0.5, but about 5 percent of the intervals will not. KEY POINTS •

The probability p of getting a “head” in a single toss of a coin need not be 50%, but it is an unknown number, which you can estimate by flipping a coin n times and noting the number x of times you get a “head,” and thus yield an estimate of p.



If several learners were to yield estimates from n tosses of the coin, the estimates will not be the same, but they will have “sampling” variability. The standard

p(1 − p) . This standard n error is dependent on p, and thus, we may instead estimate it with deviation of the estimates, called the standard error, is

. The standard error is inversely proportional to the square root of the sample size. In consequence, the bigger the sample size (i.e., the number of tosses of the coin), the less variability we will have in the estimates.

• As the number of tosses increases, the distribution of estimates looks more and more like a normal curve. (This is known as the “Central Limit Theorem” wherein the sample proportion has a sampling distribution whose shape can be approximated by a normal curve, whose center is the value of the population proportion and a standard deviation of better the approximation will be.

!

214$

p(1 − p) . The larger the sample, the n

ACTIVITY SHEET NUMBER 3-01

Using Coin Tosses to Introduce Sampling Distributions Activity Sheet Suppose that you were to toss a regular one-Php coin a large number of times. What proportion of the tosses will yield a “head”? p = ______________. Individual Data Table

Use two decim al places. 5 Trials

10 Trials

25 trials

50 Trials

Number of Tosses Resulting in “Heads” Proportion of Tosses Resulting in “Heads”

Copy your sample proportions (use two decimals) onto the blackboard in the appropriately labeled column. Input the class proportion of tosses resulting in a “head” into the Class Data Table. Class Data Table Class Proportion of Tosses Resulting in a “Head” Sample 1

!

n=5

n=

n=

10

20

n=30

Sample 21

2

22

3

23

4

24

5

25

6

26

7

27

8

28

9

29

10

30

215$

n=5

n = 10

n = 20

n=30

11

31

12

32

13

33

14

34

15

35

16

36

17

37

18

38

19

39

20

40

Use the Class Data to answer the following questions. Rem em ber that p = 1/2

1. For each sample size, construct a stem-and-leaf display/histogram of the sample proportion values and describe the shape, center, and spread of the distribution of values. 2. Based on your stem-and-leaf displays/histograms, what do you think is the relationship between the center of the distribution of the sample proportions and the value of the population proportion? 3. Based on your stem-and-leaf displays/histograms, what do you think is the relationship between the sample size and the shape of the distribution of the sample proportion? 4. For each sample size, calculate the mean and standard deviation of the sample proportions and write a sentence to interpret the standard deviation. Sample Size n=5 n=10 n=25 n=50

Mean

Standard Deviation

5. For which sample size is the standard deviation the largest and for which sample size is the standard deviation the smallest? Why do you suppose this happens?

!

216$

ASSESSMENT 1. A polling organization randomly selects 1,000 registered voters to estimate the proportion of a large population that intends to vote for a certain movie celebrity in an upcoming election. Although it is not known by the polling organization, the actual proportion of the population that prefers the celebrity is 0.46. (a) Give the numerical value of the mean of the sampling distribution of pˆ . Answer: .46 (b) Calculate the standard deviation of the sampling distribution of pˆ . Answer: (c) If the proportion of the population that prefers the celebrity is .46, would a sample proportion value of .60 be considered unusual? Answer: Yes, a sample proportion of .60 would be considered unusual. It is roughly 8.9 standard deviations above the mean: .46 + 8.9(.0158) = .60. 2. Every senior high school student taking a Probability and Statistics class at a large senior high school (about 1,100 learners) participated in a class project by rolling a 6sided die, 50 times. Each student determined the proportion of his or her 50 rolls for which the result was a “1”. The instructor plans to draw a histogram of the 1,100 sample proportions. (a) What will be the approximate shape of this histogram? (i). Skewed left. (ii). Uniform. (iii). Normal (bell-shaped). (iv). Skewed right. ANSWER: iii

!

217$

(b) What will be the approximate mean for the 1,100 sample proportions? (i) 1/50 (ii) 1/6 (iii) 6/50 (iv). 6 ANSWER: ii. Note that the sample size here is 50 (c) What will be the approximate standard deviation for the 1,100 sample proportions? (i)

(1 / 6)(5 / 6) 1,100

(ii)

(1 / 6)(5 / 6) 50

(iii)

(1 / 100)(99 / 100) 1,100

(iv)

(1 / 100)(99 / 100) 50 ANSWER: ii

3. A candy company assures the public that 20% (p = 0.20) of all its assorted candies are chocolate flavored. Suppose the candies are packaged at random in small bags containing about n =100 candies. A class of senior high school learners learning about percentages opens several bags, counts the various types of candies, and calculates pˆ = proportion in the bag that are chocolate flavored. What is the z-score for pˆ = .25? ANSWER; and, here: p = .20,

!

218$

.25, and n = 100, so:

4. It is believed that 60% of cars travelling on a major highway outside of Metro Manila exceed the speed limit. A radar trap checks the speeds of 90 cars. (a) Using the 68-95-99.7 Rule, draw and label the distribution of the proportion of cars that the police will observe speeding ANSWER: Center of distribution is 0.60; Standard error is

0.05163978

The normal curve follows the empirical rule (also called the 68-95-99.7 rule): • About 68% of the area under the curve falls within 1 standard deviation of the mean. • About 95% of the area under the curve falls within 2 standard deviations of the mean. • Nearly the entire distribution (About 99.7% of the area under the curve) falls within 3 standard deviations of the mean. Thus, we can approximate the sampling distribution of the proportion of cars that will exceed speed limit by a normal curve with center 0.60 and standard deviation 0.052

(b) Do you think the appropriate conditions necessary for your analysis are met?

!

219$

Both np=90 *0.6 = 54 and np(1-p)=21.6 ≥ 10. Drivers may be viewed as independent of each other, but if flow of traffic is very fast, they may not behave independently. Or if weather conditions are not good, these conditions may affect all drivers. In these cases, there may be fewer speeders than they expect.!

5. Using the following table of standard errors for estimating the proportion p of voters supporting a candidate:

(a) Obtain the maximum estimated standard error for a sample of size 1,200 ANSWER: .0145 (b) What would the gain be, if you go from a sample of size 1,000 to a sample of size 2,000, or even 10,000? ANSWER: Doubling sample size from 1,000 to 2,000 will only reduce standard error from 1.6 percentage points to 1.1 percentage points. Increasing size from 1,000 to 10,000 will reduce standard error from 1.6 percentage points to about a third (0.5 percentage point).

!

220$

CHAPTER 3: SAMPLING Lesson 2: The Need for Sampling TIM E FRAM E: 120 minutes OVERVIEW OF LESSON In this lesson, learners are given lectures (and assessments) regarding sampling—basic concepts, discussions on why it is important to sample, descriptions of different types of samples, as well as kinds of survey errors. LEARNING CO M PETENCIES At the end of the lesson, the learner should be able to: • • • • •

Define random sampling Give reason for sampling Distinguish between parameter and statistic Recognize the value of randomization as a defense against bias Identify that the size of the sample (not the fraction of the population) determines the precision of estimates from a probability sample

LESSON OUTLINE A. Motivation: What is a Survey and Why do we use Sampling? B. Lesson Proper 1. Probability Sampling 2. Non-probability Sampling 3. Survey Errors 4. Sampling Distribution, Accuracy, and Precision C. Data Collection D. Data Analysis and Interpretation E. Enrichment REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los Baños, College Laguna 4031

!

221#

KEY CONCEPTS: Sampling, Estimation, Bias, Sampling Variation, Randomization DEVELOPMENT OF THE LESSON A. Motivation: What is a Survey and Why do we use Sampling (rather than full enumeration)? In Chapter 1, discussions on describing data assumed that data come from a population of interest. When the recording of information of an entire population is conducted, this is called a census. An example of this is collecting the grades of all the Grade 11 learners, or the decennial population census done by the Philippine Statistics Authority (PSA). However, in most cases, censuses involve great challenges. Also, one does not need to do a full count to get information, especially on flow data, such as agricultural production, household expenditure, and establishment income. This brings us to sampling, which is the process of selecting a section of the population. Learners may have heard of sample surveys especially opinion polls conducted before an election. Ask a few learners to tell state the number of minutes they spend to get to school in the morning. Then, after asking these few individuals, describe to them the typical time it takes learners to get to school (such as the average time). Ask learners if the descriptive statements you made are valid or not. Next, define: •

a sample survey as a method of systematically gathering information on a segment of the population, such as individuals, families, wildlife, farms, business firms, and unions of workers, for the purpose of inferring quantitative descriptors of the attributes of the population.

The fraction of the population being studied is called a sample. Learners may wonder why people don’t just survey everyone instead and why they “trust” opinion polls when these only interview 1,600 respondents and not the actual millions of Filipinos who will be voting on election day. Learners should be made aware that there are many reasons why we resort to sampling. •

!

Cost. A sample often provides useful and reliable information at a much lower cost than a census. For extremely large populations, the conduct of a census can be even impractical. In fact, the difficulty of analyzing complete census data led to summarizing a census by taking a “sample” of returns.

222#









Timeliness. A sample usually provides more timely information because fewer data are to be collected and processed. This attribute is particularly important when information is needed quickly. Accuracy. A sample often provides information as accurate, or more accurate, than a census, because data errors typically can be controlled better in smaller tasks. Detailed information. More time is spent in getting detailed information with sample surveys than with censuses. In a census, we can often only obtain stock, not flow data. For instance, agricultural production cannot be generated from censuses. Destructive testing. When a test involves the destruction of an item, sampling must be used. Battery life tests must use sampling because something must be left to sell!

Inform learners that conducting a full census of voters can be quite costly and besides, this is already done on Election Day itself. Only in rare cases is a full enumeration census of the population taken. For instance, the PSA conducts the Census of Population and Housing every ten years, typically when the year ends in 0, although in 1995, 2007 and 2015, the PSA has also conducted mid-decade censuses. The financial costs for conducting and processing results of censuses are quite huge (compared to sample surveys). Explain that in a sample survey, we can generate flow information that describes characteristics of the subject covering a period of time. For instance, agricultural production is collected not in an agriculture census but in a sample survey of agricultural households and establishments. A sample survey covers more detailed information on the unit of inquiry than that of a census, and is also less expensive to conduct than a census. Sampling theory, developed a century ago, has shown that one does not need to conduct a census to obtain information, i.e. conducting a sample survey will do just as well. Look at it this way: One does not need to finish drinking a pot full of coffee to know if the coffee tastes good. A cup or even a sip will do, provided the “sample” is taken in a “fair manner.” Even hospitals only extract blood samples from patients for medical tests rather than extracting all the blood of the patient to determine whether or not the patient gets clean bill of health. What is crucial is to design a sample survey that will be a representative of the population it intends to characterize. Typically, people can guarantee representativeness in a sample survey if chance methods are used for selecting respondents.

!

223#

Even sample surveys conducted by the PSA—household surveys, establishment surveys, agricultural surveys (that may involve households and establishments)—are also using chance methods to select their survey respondents. B. Lesson Proper 1. Probability Sampling If data is to be used to make decisions about a population, then how the data is collected is critical. For a sample data to provide reliable information about a population of interest, the sample must be representative of that population. Selecting samples from the population using chance allows the samples to be representative. If a sample survey involves allowing every member of the population to have a known, nonzero chance of being selected into the sample, then the sample survey is called a probability sample. Probability samples are meant to ensure that the segment taken is representative of the entire population. Examples of these include the Family Income and Expenditure Survey (FIES), the Labor Force Survey, and the Quarterly Survey of Establishments, all conducted by the PSA. Opinion polls conducted by some non-government organizations with track records such as the Social Weather Stations and Pulse Asia, likewise use chance methods to select their survey respondents. Data collected from these probability sampling-based surveys yield estimates of characteristics of the population that these surveys attempt to describe. Basic Types of Probability Sampling a. Simple random sampling (SRS) involves allowing each possible sample to have an equal chance of being picked and every member of the population has an equal chance of being included in the sample. Selection may be with replacement (selected individual or unit is returned to frame for possible reselection) or without replacement (selected individual or unit isn’t returned to the frame). This sampling method requires a listing of the elements of the population called the sampling frame. In the case of agricultural surveys or surveys of establishments, the sampling frame may either be based on a list frame, or an area frame, or a mixture. Samples may be obtained from the table of random numbers or computer random number generators.

!

224#

b. Stratified sampling is an extension of simple random sampling which allows for different homogeneous groups, called strata, in the population to be represented in the sample. To obtain a stratified sample, the population is divided into two or more strata based on common characteristics. A SRS is then used to select from each strata, with sample sizes proportional to strata sizes. Samples from the strata are then combined into one. This is a common technique when sampling from a population of voters, stratifying across racial or socio-economic classes. When thinking of using stratification, the following questions must be asked: ! !

Are there different groups within the population? Are these differences important to the investigation?

Figure 3-02.1 Illustration of Stratified Sampling If the answer to both questions is yes, then stratified sampling is necessary. Explanatory Note: Usually, stratified sampling is done when the population is divided into several subgroups with common characteristics. The population may be divided into urban and rural locations (as dwellings in rural areas may tend to be homogenous compared to dwellings in urban areas); the student population may be divided by the year level of learners; or the workers in a hospital may be categorized by their different occupations—nurse, doctor, janitor, secretary. c. In systematic sampling, elements are selected from the population at a uniform interval that is measured in time, order, or space.

!

225#

Figure 3-02.2 Illustration of Systematic Sampling ! Typically, there is firstly, a decision on a desired sample size n. The frame of N units is then divided into groups of k units: k=N/n. Then, one unit is randomly selected from the first group, with every kth unit thereafter also selected. For instance in Figure 3-02.2, consider the population of 20 trees, and if the sample size is 4, then the frame is divided into 4 groups. Suppose that the fourth item is chosen in the first group, with every fifth unit thereafter chosen. d. Cluster sampling divides the population into groups called clusters, selects a random sample of clusters, and then, subjects the sampled clusters to complete enumeration, that is everyone in the sampled clusters are made part of the sample.

Figure 3-02.3 Illustration of Cluster Sampling

!

226#

Explanatory Note: Clusters in the population may be based on convenience in the collection of data. For example, in a village, clusters can be blocks of houses. In a school, the clusters can be the sections. In a dormitory, clusters can be the rooms. In a city or municipality, the clusters can be the different barangays. Cluster sampling is conducted so that data collected need not come from a huge geographic range, thus saving resources. For instance, instead of getting a simple random sample of households from all over a town, clusters of dwellings can be selected from different barangays so that the cost of data collection can be minimized. Example: Suppose you want to compute the mean grade point averages (GPAs) of learners at a certain higher educational institution. You decide that an appropriate sample size is n = 100. To estimate the mean GPAs, you can use simple random sampling to select 100 learners and average their GPAs. Since freshmen GPAs tend to be lower than senior GPAs, you may want to make sure that both classes are represented, so you decide to use a stratified sample. According to the university’s registrar, the student population consists of 35% freshmen, 30% sophomores, 20% juniors, and 15% seniors. Get samples from each stratum, proportional to its size. Specifically, take simple random samples of 35 freshmen, 30 sophomores, 20 juniors, and 15 seniors. Then, average the GPAs of the learners to estimate the GPA of the entire university. Instead of a class, you can also have subgroups of the student population based on their academic major, assuming that each student is assigned one major. When stratifying into subgroups, the subgroups must be mutually exclusive. If they are not, then some subjects will have a higher chance of being chosen since they belong in more than one subgroup.

Inform learners that Statistics is different from Mathematics. The essential paradigm in Statistics is induction (from the particular to the general) while Mathematics uses deduction (from the general to the particular). Modern Statistics’ is there to develop tools that will allow scientifically valid inference from samples to the populations from which they came. Specific parameters—numerical summaries of the population such as a population proportion or a population mean—are estimated by Statistics, summaries of the

!

227#

sample data such as a sample proportion, or a sample mean. In probability sampling, each member of the population has a positive and measurable chance of inclusion in the sample. These inclusion probabilities serve as the bridge from sample to population. However, this bridge is weak or nonexistent when the inclusion probabilities cannot be computed as in the case of sample surveys.

Figure 3-02.4 Population, sample, and inference 2. Non-probability Sampling Ask learners whether polls on voting preferences through SMS messages and Facebook posts can be adequate to represent actual voting preference. Learners should know or be made aware that results of such kinds of polls are filled with too much noise as there is currently no way to determine the representativeness of respondents to such surveys (if the targeted population is much bigger than the sampled respondents). SMS and Facebook polls do not have complete coverage of voters: Not all voters have cellphones (especially among the poor) despite the increase in mobile phone usage over the years; Not everyone has internet access; and, certainly, not every Filipino voter has a Facebook account. As of 2014, only a third of Filipinos are reported to have access to the Internet. In addition, a mere “random selection” of mobile phone numbers or of Facebook users will in no way assure you of its representativeness of the voting population even if everyone had a cellphone or a Facebook account since there will be “nonresponses” that have to be accounted for. Non-probability or judgment sampling is the generic name of several sampling methods where some units in the population do not have the chance to be selected in the sample, or if the inclusion probabilities cannot be computed. Generally, the procedure involves arbitrary selection of “typical” or !

228#

“representative” units concerning which information is to be obtained. A few types of non-probability samples are listed below: a. Haphazard or accidental sampling involves an unsystematic selection of sample units. Some disciplines like archaeology, history, and even medicine draw conclusions from whatever items are made available. Some disciplines like astronomy, experimental physics, and chemistry often do not care about the “representativeness” of their specimens. b. In convenience sampling, sample units expedient to the sampler are taken. c. For volunteer sampling, sample units are volunteers in studies wherein the measuring process is painful or troublesome to a respondent. d. Purposive sampling pertains to having an expert select a representative sample based on his own subjective judgment. For instance, in Accounting, a sample audit of ledgers may be taken of certain weeks (which are viewed as typical). Many agricultural surveys also adopt this procedure for lack of a specific sampling frame. e. In Quota Sampling, sample units are picked for convenience but certain quotas (such as the number of persons to interview) are given to interviewers. This design is especially used in market research. f. In Snowball Sampling, additional sample units are identified by asking previously picked sample units for people they know who can be added to the sample. Usually, this is used when the topic is not common, or the population is hard to access. Discuss with learners other ways of classifying surveys. • • • • •

!

size of the sample – e.g. large-scale or small-scale periodicity – longitudinal or panel, where respondents are monitored periodically; cross-section; quarterly main objective – descriptive, analytic method of data collection – mail, face-to-face interview, e-survey, phone survey, SMS survey respondents – individual, household, establishment (or enterprise), farmer, OFW

229#

3. Survey Errors When collecting data, whether through sample surveys or censuses, a variety of survey errors may arise. This is why it is crucial to design the data collection process very carefully. Censuses may also overcount or undercount certain portions of the population of interest. Household censuses in the Philippines, for instance, have often been contentious because of undercounts and overcounts and their implications on politics since congressional seats and Internal Revenue Allotment (IRA) depend on population counts. Conclusions based on purposive samples, such as telephone polls used in early morning television shows, SMS polls, or surveys in Facebook, do not hold the same weight as probability-based samples. A probability sample uses chance to ensure that the sample is much more representative of the population, something that is not true of purposive samples. Survey errors involve sampling errors and non-sampling errors: •



!

In the conduct of sample surveys, sampling error is roughly the difference between the value obtained in a sample statistic and the value of the population parameter that would have arisen had a census been conducted. This difference comes from the operation of the chance process that determines which particular units in the population are included in the sample. This error can be positive or negative, small or large but increasing the sample size can always reduce this type of error. This error can be estimated and reported along with the sample statistic. Since estimates of a parameter from a probability sample would vary from sample to sample, the variation in estimates serves as a measure of sampling error. Statisticians can say, for instance, that in 2000, the FIES indicated that 39.5 percent of the entire Filipino population is poor and that there are 95 chances in 100 that a full census would reveal a value within 0.4% of the stated figure. The approval ratings of the President, obtained from an opinion poll of about 1,200 respondents who were selected judiciously through chance-methods, are theoretically within a margin of error of about 3 percentage points from the actual approval ratings. Another type of error that statisticians consider in the collection of data is called non-sampling error. There are many specific types of non-sampling error. There may be selection bias or the systematic tendency to exclude in a survey a particular group of units. As a result, you get coverage errors, which arise if, for example, we assume that the respondents in a telephone

230#

poll in an early morning television shows reflect the entire population of voters. Yet in fact, telephone polls in the Philippines at best represent only the population of telephone subscribers, which is, in truth, only a vast minority of the targeted population of all Filipinos. Current television and radio polls being conducted by a number of media stations reflect only the population of those who are watching or listening to the show and who are persistent in phoning in their views. Thus, there is a serious issue of coverage. The same is true in the case of Internet-based and SMS surveys. Even a seriously done Internet survey will only reflect those who have Internet access, which is currently not the majority of Filipino households. To illustrate coverage and other non-sampling errors, consider the following case in point. Example of Survey Estimate Fiasco: In 1936, the Literary Digest, a famed magazine in the United States, conducted a survey of its subscribers as well as telephone subscribers to predict the outcome of the presidential race. The Digest erroneously predicted that then incumbent President Franklin D. Roosevelt would receive 43% of the vote and thus lose to the challenger Kansas Governor Alfred Landon when in actuality, Landon only received 38% of the total vote. (The Digest went bankrupt thereafter). At the same time, George Gallup set up his polling organization and correctly forecasted Roosevelt’s victory from a mere sample of 50,000 people. A post-mortem analysis revealed coverage errors arising from biases in sample selection. The Literary Digest list of targeted respondents was taken from telephone books, magazine subscriptions, club membership lists, and automobile registrations. Inadvertently, the Digest targeted well-to-do voters, who were predominantly Republican and who had a tendency to vote for their candidate. The sample had a built-in bias to favor one group over another. This is called selection bias. In addition, there was also a non-response bias since, of the 10 million they targeted for the survey, only 2.4 million had actually responded. A response rate of 24% is far too low to yield reliable estimates of population parameters. Nonresponsive people may differ considerably in their views from the views of responders. Here, we see that obtaining a large number of respondents does not cure procedural defects but only repeats them over and over again! When choosing a !

231#

sample, biases, such as selection bias or nonresponse bias should be avoided. However, in practice, it can be challenging to avoid nonresponse bias in surveys since there are people who will fill out surveys and those who will not, even if incentives are provided. Provide more examples to emphasize the lesson on non-sampling errors, such as asking your learners a question but only selecting specific people to answer the question. For example, what their favorite toy or game was when they were growing up, but only ask either the boys or girls. Then, based on the responses, conclude that their answers were true for the entire class. Have them react to your statements. Say that it was an example of selection or coverage bias. You can also ask what the average height of the class is, but only ask tall people. Then, conclude that all members of the class are tall. Emphasize that nonprobability sampling makes the conclusions hard to generalize for the population. To remedy biases (or failures for a sample to represent the population) resulting from “convenience” errors, polling organizations have since then resorted to using probability-based methodologies for selecting samples where the subjects are chosen on the basis of certain probabilities, which in turn, allow us to compute for the number of respondents each sampled respondent effectively represents. Randomization or using chance-based procedures for selecting respondents is the best guarantee against bias. However, it is important to firstly have an idea of the sampling frame, i.e. the targeted population, and carefully design the survey in order to make it representative of the targeted population. Other possible sources of biases in sample surveys that one should be cautious about: • • • •

!

wording of questions, which can influence the response enormously the sensitivity of a survey topic (e.g., income, sex and illegal behavior) interviewer biases in selecting respondents or in the responses generated because of the appearance and demeanor of the interviewer non-response biases, which happens when targeted respondents opt not to provide information in the survey

232#

Note to Teacher: You may mention the following examples to drive the point further about survey errors. One rather famous example is the time when surveys conducted by all the “reputable” pollsters Gallup, Crossley and Roper in the United States in the 1950s embarrassingly resulted in the wrong prediction that New York Governor Thomas Dewey would beat the reelectionist US President Harry Truman in the presidential race. No less than the famed Chicago Daily Tribune printed an early edition with the headlines based on the (wrong!) poll predictions.

Re-electionist Harry Truman showing the Chicago Daily Tribune early edition There, problems resulted from the sampling design, with the interviewers being provided excessive judgment calls on whom to interview. These polls used quota sampling. Interviewers may have selected the least threatening people they would encounter on the field, e.g. the best-dressed people so that the samples chosen systematically over-represented a part of the population (and underrepresented other groups). Special surveys that measure the difference between respondents and nonrespondents show that lower-income and upper income people tend to not respond to questionnaires, so that modern polling organizations would prefer to use personal interviews rather than mailed questionnaires. In developed countries like United States, the typical response rate for personal interviews is about 65%, compared to merely 25% for mailed questionnaires. A number of methods are now being tested to improve response rates in polls and other surveys.

!

233#

4. Sampling Distribution, Accuracy and Precision As was pointed out earlier, statistics generated from a sample survey are subject to both non-sampling and sampling errors. The latter arise because only a part of the population is observed. There is likely to be some difference between the sample statistic and the true value of the population parameter (that you would have obtained had a census been conducted). To know more about this difference or sampling error and consequently establish the reliability of the sample statistic, you have to understand the chance process involved in the sample selection. For this purpose, you have to analyze the sampling distribution or the set of all possible values that the point estimate could take under repeated sampling, and possibly approximate this sampling distribution. When estimating, you should know something about the population to be generalized. One of the characteristics of the population that is often estimated is the mean. The population mean is often the parameter to be estimated. There can be several estimators of the population mean, including the sample mean, sample median, sample mode, and sample midrange. In similar manner, there can be several estimators of the population variance s2. Given sample data where represents the sample mean (i.e. the sum of the data divided by the sample size n), then the sample variance defined with denominator n-1

and that with denominator n

are two estimators of the population variance. As was earlier pointed out, a good estimator must possess desirable properties— lAccuracy and Precision.

!

234#



Accuracy is a measure of how close the estimates are to the parameter. It can be measured by bias, i.e., the difference of the expected value of the estimate from the true value of the parameter. An estimator is said to be unbiased if its bias is zero. Otherwise, the estimator is biased. When bias is positive or greater than zero, the estimator overestimates the parameter. If negative or below zero, estimator underestimates the parameter.



Precision is a measure of how close the estimates are with each other. The variance of the estimator or its standard error gives a measure of how precise the estimator is. The smaller the value of the standard error of an estimator, the more precise the estimator is.

In general, we want the estimator to be both accurate and precise. We can illustrate precision and accuracy by way of an analogy. Let us represent the parameter as a target bull’s eye while the estimates of the parameters are the arrows shot by an archer. The first target (1) in the figure below illustrates a precise but not an accurate estimator. The second target (2) shows that the archer or estimator is accurate but not precise. The third estimator (3) shows the archer is both precise and accurate while the last target (4) shows an estimator that is neither accurate nor precise.

(1)!!

(2)!!

(3)!!

(4)!!

Figure 3-02.5 Analogy between estimation and hitting the bull’s eye Example: The sample mean (of a simple random sample) is an estimator of the population mean that is both accurate and precise. Its expected value is equal to the population mean itself that is why it is unbiased and, consequently, an accurate estimator. It is precise because statistical theory has determined that it has the smallest standard error compared to other estimators. Having these good

!

235#

properties of an estimator makes the sample mean a good estimator of the population mean. E. Enrichment Encourage learners to come up with a survey to discover something that can be relevant to their experience in high school. For example, they can ask the biggest concern of learners in their grade level (Is it their academics, family, friends/peers, etc?) or what the learners in their grade level want to do after high school, or who they want to support in the next national elections. They can explore different sampling methods or try different ways of collecting data (interviews, questionnaires, text polls). Then, have them report their findings and ask them how they can come up with the appropriate interpretation of the data they generated. KEY POINTS •

Sampling is undertaken over full enumeration (census) since selecting a sample is less time-consuming and less costly than selecting every item in the population. An analysis of a sample is also less cumbersome and more practical than an analysis of the entire population.



Probability sampling involves units obtained using chance mechanism, and requires the use of a sampling frame (a list/map of all the sampling units in the population) while in a non-probability sample, units are chosen without regard to their probability of occurrence. The latter type of sample should not be used for statistical inference. Among the typical basic probability samples include o Simple random sample wherein sample size n is one in which each set of n elements in the population has an equal chance of being selected, o Systematic sample is a sample drawn by first selecting a fixed starting point in the larger population and then obtaining subsequent observations by using a constant interval between samples taken. o Stratified random sample is a sample chosen in such a way that the population is divided into several subgroups, called strata, with random samples drawn from each stratum. o Cluster sample is a sample where entire groups (or clusters) are chosen at random.

!

236#





Types of Survey errors o Sampling error results from chance variation from sample to sample in a probability sample. It is roughly the difference between the value obtained in a sample statistic and the value of the population parameter that would have arisen had a census been conducted. Since estimates of a parameter from a probability sample would vary from sample to sample, the variation in estimates serves as a measure of sampling error. o Non-sampling error: ! Coverage error or selection bias results if some groups are excluded from the frame and have no chance of being selected ! Non-response error or bias occurs when people who do not respond may be different from those who do respond ! Measurement error arising due to weaknesses in question design, respondent error, and interviewer’s impact on the respondent A representative sample, using chance-based methods for selecting the sample units, can provide insights about a population. The size of the sample, not its relative size to the larger population, determines the precision of the statistics it generates. Randomization, i.e. using chance-based procedures for selecting respondents, is the best guarantee against bias.

ASSESSMENT I. Select the best choice. 1. The process of using sample statistics to draw conclusions about true population parameters is called a) b) c) d)

statistical inference the scientific method sampling descriptive statistics ANSWER: a

2. The universe or "totality of items or things" under consideration is called a) a sample b) a population c) a parameter d) a statistic ANSWER: b 3. The portion of the universe that has been selected for analysis is called a) a sample b) a frame c) a parameter

!

237#

d) a statistic ANSWER: a 4. A summary measure that is computed to describe a characteristic from only a sample of the population is called a) a parameter b) a census c) a statistic d) the scientific method ANSWER: c 5. A summary measure that is computed to describe a characteristic of an entire population is called a) a parameter b) a census c) a statistic d) the scientific method ANSWER: a 6. Which of the following is most likely a population as opposed to a sample? a) respondents to a newspaper survey b) the first 5 learners completing an assignment c) every third person to arrive at the bank d) registered voters in a county ANSWER: d 7. Which of the following is most likely a parameter as opposed to a statistic? a) The average score of the first five learners completing an assignment b) The proportion of females registered to vote in a county c) The average height of people randomly selected from a database d) The proportion of trucks stopped yesterday that were cited for bad brakes ANSWER: b 8. Which of the following is NOT a reason for the need for sampling? a) It is usually too costly to study the whole population. b) It is usually too time-consuming to look at the whole population. c) It is sometimes destructive to observe the entire population. d) It is always more informative by investigating a sample than the entire population. ANSWER: d 9. Which of the following is NOT a reason for drawing a sample? a) A sample is less time consuming than a census. b) A sample is less costly to administer than a census. c) A sample is always a good representation of the target population.

!

238#

d) A sample is less cumbersome and more practical to administer. ANSWER: c 10. The Philippine Airlines Internet site provides a questionnaire instrument that can be answered electronically. Which of the 4 methods of data collection is involved when people complete the questionnaire? a) Published sources b) Experimentation c) Surveying d) Observation ANSWER: c

II. Identify the population, parameter of interest, the sampling frame, the sample, the sampling method, and any potential sources of biases in the following studies 1. The producers of a television show asked information from Facebook users on the TV show’s Facebook page about their sentiments (favorable, unfavorable, neutral) on a segment on the TV show Answer: population: TV show viewers; parameter: proportion of TV viewers who have favorable sentiments to a segment on the TV show; sampling frame: Facebook users who have access to the TV show’s Facebook page; sampling method: convenience sample; biases: coverage, non-response, and measurement errors 2. A question posted on the website of a daily newspaper in the Philippines asked visitors of the site to indicate their voter preference for the next presidential election. Answer: population: Filipino voters ; parameter: proportion of voters who would vote for some presidentiables; sampling frame: visitors of the website; sampling method: voluntary response (no randomization used); biases: coverage (sampling frame is not target population), and voluntary response has a “self selection bias” (those who visit the site and respond may be predisposed to a particular answer). 3. In March 2015, Pulse Asia reported that the leading urgent concerns of Filipinos are inflation control (46%), the increase of workers' pay (44%), and the fight against government corruption (40%). On the other hand, Filipinos are least concerned with national territorial integrity (5%), terrorism (5%), and charter change (4%).The nationwide survey was conducted from March 1 to 7, 2015 with 1,200 respondents. Answer: population: Filipinos; parameter: proportion of people who are concerned about various socio-economic issues; sampling frame: all Filipino adults; sampling method: 1200 randomly selected respondents; biases: probably not biased. Conclusions could be generalized

!

239#

4. A sample survey of persons with disability (PWDs) was designed to be representative of PWDs, by making use of PWD registers from local government units, but an assessment suggested the registers were severely undercovering PWDs. The design was adjusted to make use of snowball sampling where existing sampled PWDs would identify other future subjects from among their acquaintances. The study attempted to examine the proportion of PWDs who were poor. Answer: population: PWDs; parameter: proportion of PWDs who are poor; sampling frame: PWD registers with extra PWDs from snowball sampling; sampling method: random selection from PWD registers plus snowball sampling; biases: coverage errors, non-response biases (from uncooperative PWDs) III. Identify which sampling method is applied in the following situations. 1. The teacher randomly selects 20 boys and 15 girls from a batch of learners to be members of a group that will go to a field trip Stratified sampling 2. A sample of 10 mice are selected at random from a set of 40 mice to test the effect of a certain medicine Simple Random Sampling 3. The people in a certain seminar are all members of two of five groups are asked what they think about the president. Cluster Sampling 4. A barangay health worker asks every four house in the village for the ages of the children living in those households. Systematic Sampling 5. A sales clerk for a brand of clothing asks people who comes up to her whether they own a piece of article from her brand. Volunteer Sampling 6. A psychologist asks his patient, who suffers from depression, whether he knows other people with the same condition, so he can include them in his study Snowball Sampling 7. A brand manager of a toothpaste asks ten dentists that have clinic closest to his office whether they use a particular brand of toothpaste. Convenience Sampling

!

240#

IV. Examine each of the following questions that could be used in a survey for possible bias. Indicate how the question might be improved 1. Do you go swimming? a. Never b. Rarely c. Frequently d. Sometimes Possible Answer: The problem with this question is in the categories supplied for the answer. Everybody has a different idea as to what words such as ‘sometimes’ and ‘frequently’ mean. Instead, give specific time frames such as ‘twice a year’ or ‘once a month’. Also, the order of answers should follow a logical sequence – in the example above, they do not.

2. How many books have you read in the last year? a. None b. 1 to 5 c. 5 to 10 d. 10 to 20 e. Over 20 Possible Answer: This question may contain prestige bias – would people be more likely to say they have read plenty of books when they might actually not have read any? Also, the categories for the answers need modification – which box would you tick for someone who answered ‘5’ or ‘10’?

3. Do you think senior high school learners should be required to wear school uniforms? a. Agree b. Disagree c. Neutral or No opinion Possible Answer: Seems unbiased

4. What do you think about the CPP-NPA attempt to blackmail the Government? Possible Answer: This is a very leading question which uses an emotive word—blackmail. It assumes that the CPP-NPA is blackmailing the Government and assumes that someone knows about the issues and would be able to answer. A filter question would have to be used in this case and the word “blackmail” changed.

5. What is wrong with the young people of today and what can we do about it? Possible Answer: This question is double-barreled, leading, and ambiguous. It asks two questions in one and so needs to be split up. The word ‘wrong’ is emotive and suggests there is something not normal about the young people of today. It asks the respondent to distance themselves and comment from a moral high ground

!

241#

CHAPTER 3: SAMPLING Lesson 3: Sampling Distribution of the Sample Mean TIME FRAME: 60 minutes OVERVIEW OF LESSON In this lesson, learners are given lectures (and assessments) regarding the sampling distribution of the sample mean and sample proportion (under conditions that random sampling is done with replacement). LEARNING COMPETENCIES At the end of the lesson, the learner should be able to: •

identify sampling distributions of some statistics of the sample mean and the sample proportion



calculate the mean and variance of the sampling distribution of the sample mean and sample proportion



describe the approximate sampling distribution of the sample mean (and sample proportion) when the sample size is large

LESSON OUTLINE A. Introduction B. Motivational Activity on Sampling Distribution C. Lesson Proper: 1. Expected Value and Standard Error 2. Approximating the Sampling Distribution 3. Some Points of Confusion KEY CONCEPTS: Sampling Distribution, Expected Value, Standard Error, Normal Curve, Central Limit Theorem

!

242#

DEVELOPMENT OF THE LESSON A. Introduction In previous lessons, learners were provided concepts about sampling, including the reasons for sampling as opposed to the conduct of a full-enumeration census. It was also pointed out that probability sampling, where samples are selected using chance methods, enable the samples to be representative of the population being studied. Samples can be drawn with and without replacement. In addition, if the sampling protocol were to be replicated, then a new set of samples (and data) would be obtained, thus yielding different estimates from one sample to another. Thus, an estimate based on sample could be different if the sampling process were to be repeated many times. The set of all possible estimates generated is called the sampling distribution. In this lesson, learners will be provided more descriptions and discussions on sampling distributions. B. Motivation Activity on Sampling Distribution 1. Instruct learners to form groups of five members each. Assign a number, from 1 to 5, to each member of the group. For each group, list the weight of the members on a sheet of paper. Compute for the mean weight of the group, , and the standard deviation of the weights, . For reference, the formula for the standard deviation is

2. Tell them to list all the possible samples with two members with replacement. For example: 1-1, 1-2, 1-3, 1-4, 1-5, 2-1, 2-2, etc. There should be a total of 25 possible samples. Slearners may list the weights of the learners in each of the samples. 3. For each of the samples, compute for the average weight of the learners. Tell them that these are the possible values of the sample means, . Ask them how many of the sample means are equal to the mean weight of the group. They might notice that most, if not all, of the sample means are not equal to the true mean weight. Ask them if they think that based on the activity, will getting a sample always be representative of the population? They should notice that there is “chance error” i.e. each sample mean may differ from the population mean, most of the differences are negligible. 4. After computing for all the possible values of , ask them to compute the average of all the values of . This is the mean of all the sample means, . Ask !

243#

them to compare this value to the mean of the group. The learners should notice that , which we shall refer to as the expected value of the sampling distribution of the sample mean, is equal to the population (or group’s) mean . What does this say about the mean of all the possible samples? They should mention that this suggests that while there is chance error, the sample mean appears to be a fairly good estimate of the mean . 5. Next, ask them to compute for the standard deviation of all the sample means, . Ask them to also compare this with the population standard deviation . They should notice that the standard deviation of all the sample means (which we shall later refer to as the “standard error”) is less than the population standard deviation. This suggests that the “sampling distribution” of the sample mean is more clustered around than the original “population.” Inform them that, in fact, when sampling with replicate, the standard deviation of sample means, is , where the sample size here is 2. 6. Also, for each sample, ask learners to compute for the proportion of males (The number of males in the sample divided by the size of the sample. This should be one of these three 0, ½, or 1 since each sample may have zero, one, or two males). This is called a sample proportion, . 7. Compute for the mean of all sample proportions. That is . How does this compare to the proportion of males in the group? This is similar to the result earlier that the mean of the sampling distribution is the targeted mean since a proportion is essentially an average, an average of 1’s and 0’s. 8. Next, ask them to compute for the standard deviation of all the possible values of the sample proportion, . Tell them to recall that for a binomial probability model, the standard deviation is in the group. Let them notice that equal

, where

where

is the proportion of males

is less than

should

is the sample size 2?

Here is a sample table that the learners can prepare: Sample mean, Number in sample Weights of sample 1-1 40 kg, 40 kg 40 kg 1-2 40 kg, 38 kg 39 kg … 5-5 39 kg, 39 kg 39 kg Total 1237.5 Average 49.5 kg

!

. In fact,

244#

Sample proportion, 1 0.5 0 15 0.6

C. Lesson Proper 1. Expected Value and Standard Error Given a sample data set that was drawn from a certain population, the resulting sample mean serves as an estimate of the mean of the population from which the sample was derived. Ask learners to imagine having a replicated run of the random sampling protocol. Would the new estimate of the mean be the same as the previous estimate? Point out to learners that the new estimate is likely to vary from the first estimate because of randomization. If the protocol were to be further replicated many times, then we would have a distribution of estimates. The set of all possible estimates is called the sampling distribution. This sampling distribution has a mean and standard deviation. Henceforth, we define: • •

the expected value (EV) as the mean of the sampling distribution the standard error (SE) as the standard deviation of the sampling distribution

It turns out that for the sampling distribution of the mean, EV is the population mean m. That is,

Thus, we say that the sample mean is an unbiased estimate of the population mean m. In addition, the SE can be viewed as measuring the amount of chance variability in the estimates that could be generated over all possible samples. Ask learners whether they prefer the estimates to have less variability. Their intuition should lead them to say that they desire estimates with a small SE, since in this case, the chances are good that the estimate will be close to the true value of the parameter. Consider the data in Lesson 1-08 pertaining to the heights, weights, and BMIs of learners, but suppose you limit interest to information about the female learners, whose heights, weights, and MBIs we replicate below:

!

245#

1

Height (in m) 1.64

Weight (in kg) 40

BMI (in kg/m 2 ) 14.8721

2

1.52

50

21.64127

3

1.52

49

21.20845

4

1.65

45

16.52893

5

1.02

60

57.67013

6

1.626

45

17.02046

7

1.5

38

16.88889

8

1.6

51

19.92188

9

1.42

42.2

20.92839

10

1.52

54

23.37258

11

1.48

46

21.00073

12

1.62

54

20.57613

13

1.5

36

16

14

1.54

50

21.08281

15

1.67

63

22.58955

(Population) Average

1.52

48.21

22.09

(Population) Standard Deviation

0.11

5.23

6.96

A visualization of the distribution of the heights, weights, and BMIs of the 15 learners is given in Figure 3-03.1 a, b, and c, respectively. !

.3

Fraction

0

0

.1

.2

.2

.4

Fraction

.6

.4

.8

.5

Student

1

1.2

Height

1.4

35

1.6

.8 .6

Fraction

.4 .2 0

10

20

30

BMI

40

50

(C) !

45

Weight

(B)

1

(A)

40

246#

50

55

60

Figure 3-03.1. Distribution of (a) Heights, (b) Weights, and (c) BMIs of 15 Female Learners Suppose you are interested in estimating the (population) average height, (population) average weight, and (population) average BMI of these 15 female learners by getting estimates based on the (sample) average height, (sample) average weight, and (sample) average BMI of two female learners selected at random. Draw a box of 15 tickets and mention that these tickets represent the 15 female learners. Ask learners how many possible samples of size 2 can be obtained. They should say that there are 15x15=225 possible sample of size 2 (i.e., 225 possible samples of 2 female learners that can be drawn from the population of 15 female learners)

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15

Table 3-03.1 lists all the 225 possible random samples with replacement of sample size 2 that could be taken from the box above, together with the (sample) average height, (sample) average weight, and (sample) average BMI for the specific sample drawn. For example, if we had obtained sample number 80, with the tickets 6 and 5 drawn, meaning student number 6 and student number 5 were selected, then for this sample, the average of the height, weight and BMI of the learners is 1.323 m, 52.5 kg, and 37.3453 kg/m2, respectively. Student 6

Height (in m) 1.626

5 Average

Weight (in kg)

45

BMI (in kg/m 2 ) 17.02046

1.02

60

57.67013

1.323

52.5

37.3453

Each of the 225 possible samples has an equal chance, i.e. 1/225, of being selected. The sample averages for the heights, weights, and BMIs have sampling distributions illustrated by Figure 3-03-2 (a), (b) and (c). While only one sample of size 2 would actually be chosen, but there could be a host of possible samples and thus, several possible sample averages for the heights, weights, and BMIs that will serve as estimates for the population averages. Estimates for the population average height (of 1.52 meters), for instance, can vary, going for as low as 1.02 meters to as high as 1.67. In estimating the population average weight (of 48.21 kg), estimates for the sample average weight can range from 36 kg to 63, while for the population average BMI (of 22.09 kg/m2), the sample averages can go from as low as 14.87 kg/m2 to as high as 57.67.

!

247#

!

First Student 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

Sample

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

Second Student 1

1.57

1.5

1.52

1.47

1.56

1.51

1.573

1.27

1.585

1.52

1.52

1.58

1.655

1.59

1.57

1.63

1.56

1.58

1.53

1.62

1.57

1.633

1.33

1.645

1.58

1.58

Average Height 1.64

52

48

52

46.1

50.5

44

47.5

55

47.5

49.5

50

45

51.5

45

38

47

43

47

41.1

45.5

39

42.5

50

42.5

44.5

45

Average Weight 40

21.1087

21.321

22.50693

21.28483

20.78158

19.26508

19.33087

39.6557

19.0851

21.42486

21.64127

18.25669

18.73083

17.97746

15.43605

17.72412

17.93642

19.12234

17.90025

17.39699

15.8805

15.94628

36.27112

15.70052

18.04028

18.25669

14.8721

Average BMI

102

101

100

99

98

97

96

95

94

93

92

91

90

89

88

87

86

85

84

83

82

81

80

79

78

77

76

Sample

7

7

7

7

7

7

7

7

7

7

7

7

6

6

6

6

6

6

6

6

6

6

6

6

6

6

First Student 6

248$

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

Second Student 1

1.56

1.49

1.51

1.46

1.55

1.5

1.563

1.26

1.575

1.51

1.51

1.57

1.648

1.583

1.563

1.623

1.553

1.573

1.523

1.613

1.563

1.626

1.323

1.638

1.573

1.573

Average Height 1.633

46

42

46

40.1

44.5

38

41.5

49

41.5

43.5

44

39

54

47.5

40.5

49.5

45.5

49.5

43.6

48

41.5

45

52.5

45

47

47.5

Average Weight 42.5

18.73251

18.94481

20.13074

18.90864

18.40539

16.88889

16.95468

37.27951

16.70891

19.04867

19.26508

15.8805

19.80501

19.05164

16.51023

18.7983

19.0106

20.19652

18.97443

18.47117

16.95468

17.02046

37.3453

16.7747

19.11446

19.33087

15.94628

Average BMI

177

176

175

174

173

172

171

170

169

168

167

166

165

164

163

162

161

160

159

158

157

156

155

154

153

152

151

Sample

12

12

12

12

12

12

12

12

12

12

12

12

11

11

11

11

11

11

11

11

11

11

11

11

11

11

First Student 11

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

Second Student 1

Table 3-03.1 Distinct samples of size two (with replacement) and average heights, weights, and BMIs of sample

1.55

1.57

1.52

1.61

1.56

1.623

1.32

1.635

1.57

1.57

1.63

1.575

1.51

1.49

1.55

1.48

1.5

1.45

1.54

1.49

1.553

1.25

1.565

1.5

1.5

1.56

Average Height 1.595

50

54

48.1

52.5

46

49.5

57

49.5

51.5

52

47

54.5

48

41

50

46

50

44.1

48.5

42

45.5

53

45.5

47.5

48

43

Average Weight 58.5

20.78843

21.97436

20.75226

20.24901

18.73251

18.7983

39.12313

18.55253

20.89229

21.1087

17.72412

21.79514

21.04177

18.50037

20.78843

21.00073

22.18666

20.96456

20.46131

18.94481

19.0106

39.33543

18.76483

21.10459

21.321

17.93642

Average BMI 22.98107

!

2

2

2

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

4

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

1.575

1.635

1.565

1.585

1.535

1.625

1.575

1.638

1.335

1.65

1.585

1.585

1.645

1.595

1.53

1.51

1.57

1.5

1.52

1.47

1.56

1.51

1.573

1.27

1.585

1.52

1.52

1.58

1.595

1.53

1.51

40.5

49.5

45.5

49.5

43.6

48

41.5

45

52.5

45

47

47.5

42.5

56

49.5

42.5

51.5

47.5

51.5

45.6

50

43.5

47

54.5

47

49

49.5

44.5

56.5

50

43

16.26447

18.55253

18.76483

19.95076

18.72866

18.22541

16.70891

16.7747

37.09953

16.52893

18.86869

19.0851

15.70052

21.899

21.14563

18.60423

20.89229

21.10459

22.29052

21.06842

20.56517

19.04867

19.11446

39.43929

18.86869

21.20845

21.42486

18.04028

22.11541

21.36204

18.82064

133

132

131

130

129

128

127

126

125

124

123

122

121

120

119

118

117

116

115

114

113

112

111

110

109

108

107

106

105

104

103

9

9

9

9

9

9

9

9

9

9

9

9

9

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

7

7

7

249$

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

1.46

1.52

1.45

1.47

1.42

1.51

1.46

1.523

1.22

1.535

1.47

1.47

1.53

1.635

1.57

1.55

1.61

1.54

1.56

1.51

1.6

1.55

1.613

1.31

1.625

1.56

1.56

1.62

1.585

1.52

1.5

39.1

48.1

44.1

48.1

42.2

46.6

40.1

43.6

51.1

43.6

45.6

46.1

41.1

57

50.5

43.5

52.5

48.5

52.5

46.6

51

44.5

48

55.5

48

50

50.5

45.5

50.5

44

37

18.4642

20.75226

20.96456

22.15049

20.92839

20.42514

18.90864

18.97443

39.29926

18.72866

21.06842

21.28483

17.90025

21.25572

20.50235

17.96094

20.24901

20.46131

21.64723

20.42514

19.92188

18.40539

18.47117

38.79601

18.22541

20.56517

20.78158

17.39699

19.73922

18.98585

16.44445

208

207

206

205

204

203

202

201

200

199

198

197

196

195

194

193

192

191

190

189

188

187

186

185

184

183

182

181

180

179

178

14

14

14

14

14

14

14

14

14

14

14

14

14

13

13

13

13

13

13

13

13

13

13

13

13

13

13

13

12

12

12

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

13

1.58

1.51

1.53

1.48

1.57

1.52

1.583

1.28

1.595

1.53

1.53

1.59

1.585

1.52

1.5

1.56

1.49

1.51

1.46

1.55

1.5

1.563

1.26

1.575

1.51

1.51

1.57

1.645

1.58

1.56

1.62

52

48

52

46.1

50.5

44

47.5

55

47.5

49.5

50

45

49.5

43

36

45

41

45

39.1

43.5

37

40.5

48

40.5

42.5

43

38

58.5

52

45

54

20.82947

21.04177

22.2277

21.0056

20.50235

18.98585

19.05164

39.37647

18.80587

21.14563

21.36204

17.97746

19.29478

18.54141

16

18.28807

18.50037

19.68629

18.4642

17.96094

16.44445

16.51023

36.83507

16.26447

18.60423

18.82064

15.43605

21.58284

20.82947

18.28807

20.57613

!

4

4

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

1.345

1.28

1.26

1.32

1.25

1.27

1.22

1.31

1.26

1.323

1.02

1.335

1.27

1.27

1.33

1.66

1.595

61.5

55

48

57

53

57

51.1

55.5

49

52.5

60

52.5

54.5

55

50

54

47.5

40.12984

39.37647

36.83507

39.12313

39.33543

40.52136

39.29926

38.79601

37.27951

37.3453

57.67013

37.09953

39.43929

39.6557

36.27112

19.55924

18.80587

150

149

148

147

146

145

144

143

142

141

140

139

138

137

136

135

134

10

10

10

10

10

10

10

10

10

10

10

10

10

10

10

9

9

250$

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

1.595

1.53

1.51

1.57

1.5

1.52

1.47

1.56

1.51

1.573

1.27

1.585

1.52

1.52

1.58

1.545

1.48

58.5

52

45

54

50

54

48.1

52.5

46

49.5

57

49.5

51.5

52

47

52.6

46.1

22.98107

22.2277

19.68629

21.97436

22.18666

23.37258

22.15049

21.64723

20.13074

20.19652

40.52136

19.95076

22.29052

22.50693

19.12234

21.75897

21.0056

225

224

223

222

221

220

219

218

217

216

215

214

213

212

211

210

209

15

15

15

15

15

15

15

15

15

15

15

15

15

15

15

14

14

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

1.605

1.585

1.645

1.575

1.595

1.545

1.635

1.585

1.648

1.345

1.66

1.595

1.595

1.655

1.605

1.54

1.52

56.5

49.5

58.5

54.5

58.5

52.6

57

50.5

54

61.5

54

56

56.5

51.5

56.5

50

43

21.83618

19.29478

21.58284

21.79514

22.98107

21.75897

21.25572

19.73922

19.80501

40.12984

19.55924

21.899

22.11541

18.73083

21.83618

21.08281

18.54141

.3 .2 0

.1

Fraction

1

1.2

1.4 Average Height

1.6

0

.05

Fraction

.1

.15

(a)

35

40

45

50 Average Weight

(b)

!

251$

55

60

.4 .3 .2 0

.1

Fraction

10

20

30 40 Average BMI

50

60

(c) Figure 3-03.2. Sam pling distributions of the sam ple m eans of size n=2 learners for their (a) heights, (b) weights and (c) BM I levels

With regards to the sampling distribution of the sample average height, the EV may be readily calculated as the mean of all the 225 sample average heights:

while the SE, the standard deviation of the sample average heights, is

It may be noted that the true values of the population average and population standard deviation are 1.52 and 0.15, respectively. In practice, each sample average (height) will be off from the population average (height) by some “chance error.” How big the chance error is likely to be is roughly measured by the SE. The sample averages are rarely more than 2 or 3 SEs away from the population average. Learners will note that the EV, i.e. the center of the sampling distribution,

!

252$

is equal to the targeted population mean (height)

while the SE, the standard deviation of the sample average heights,

is less than the standard deviation of the population of heights data: It turns out that theoretically, the SE is the ratio of the population sample standard deviation to the square root of the sample size, i.e. ,

Similarly, learners can verify that for weights, the sampling distribution has EV

which is equal to the mean of the weights distribution

and that the SE, the standard deviation of the sample average weights,

is the ratio of the (population) standard deviation of the weights to the square root of the sample size, i.e.

Learners may be able to see that the pattern holds true for BMI. That is, the sampling distribution has an expected value

that is equal to the mean of the weights distribution

and that the SE, the standard deviation of the sample average BMIs,

!

253$

is also the ratio of the (population) standard deviation of the BMIs to the square root of the sample size, i.e.

Learners should now recognize that •

a sample statistic (a summary measure from sample data such as the sample mean, sample percentage, sample median, and even the sample standard deviation) has an associated sampling distribution arising from the fact that if we were to repeatedly take all possible samples that could be generated, the sample statistics will differ from sample to sample



in practice, these sample statistics (such as a sample mean) will not equal to the targeted parameter (i.e. the population mean)



the difference between the sample statistic and the targeted parameter is a chance error, which you hope to be small. How big the chance error is likely to be is roughly measured by the SE, the ratio of the (population) standard deviation to the square root of the sample size

It would be desirable to have the EV of the sampling distribution equal to the targeted parameter. In which case, the estimate is an unbiased estimate of the parameter. •

Sample means turn out to be unbiased estimates of the population mean, i.e. the expected value of the sampling distribution of the sample means is the population mean.

Also, you would want to have the standard error, i.e. the standard deviation of the sampling distribution, take a small value as this would imply that the estimate is accurate. Typically, to have a small standard error, you would have to increase the sample size. 2. Approximating the Sampling Distribution Inform learners that an approximation of the histogram of the sampling distribution can be obtained when the sample size is large.

!

254$

In the last chapter, learners learned about the normal curve and how to determine areas under the curve, as well as how to obtain specific percentiles of the normal curve. One of the reasons why the normal curve is important in statistical applications is that under some fairly regular conditions, the normal curve can be used to approximate sampling distributions of sample averages and sample proportions regardless of the original shape of the parent distributions provided that the sample sizes are rather large. This is called the Central Limit Theorem. Example Revisited: Recall the earlier example that provided an illustration of the sampling distribution of the sample mean for a sample of size 2 taken with replacement from a box model with 15 tickets (representing 15 female learners).

15 10 0

5

Percent

10 0

5

Percent

15

Show learners the results of a simulation experiment conducted with a statistical software called Stata. This simulation experiment consisted of generating 10,000 experiments of random samples with replacement of sample size n=3, n=5, n=9 and n=12 from the box model representing the distribution of heights, weights, and BMIs of the N=15 learners. The histograms of the resulting (simulated) sampling distributions for heights are shown in Figure 3-03.3 (i)-(iv).

1 1.2 1.4 1.6 1.8 Sampling Distribution for Average Heights, n = 5

0

0

2

2

4

Percent

6 4

Percent

6

8

8

10

1 1.2 1.4 1.6 1.8 Sampling Distribution for Average Heights, n = 3

1.2 1.3 1.4 1.5 1.6 Sampling Distribution for Average Heights, n = 9

!

255$

1.35 1.4 1.45 1.5 1.55 1.6 Sampling Distribution for Average Heights, n = 14

Figure 3-03.3. Sam pling Distribution of the Sam ple M ean Height (taken from a random sample with replacem ent) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14. What can learners notice as n gets larger? They should notice that the sampling distribution looks more and more like a normal curve as the sample size gets larger.

8 6 4

Percent

0

2

4 0

2

Percent

6

8

Figure 3-03.4 shows the sampling distribution for average heights of sample of size n=3, n=5, n=9 and n=14. Learners should notice that the normal approximation for the sampling distribution is already quite good even for n=3, but gets even better for n=14.

35 40 45 50 55 60 Sampling Distribution for Average Weights, n=5

4

Percent

2

4

0

0

2

Percent

6

6

8

8

30 40 50 60 Sampling Distribution for Average Weights, n=3

40 45 50 55 60 Sampling Distribution for Average Weights, n=9

40 45 50 55 60 Sampling Distribution for Average Weights, n=14

Figure 3-03.4. Sam pling Distribution of the Sam ple M ean W eight (taken from a random sample with replacem ent) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14. For the BMI values, an illustration of the sampling distribution for sample averages of sample size n=3, n-5, n=9 and n=14 is given in Figure 3-03.5, where we observe that for “small samples” (of size n=3, 5), the sampling distribution appears to be a “mixture” of at least two distributions, but as the sample size increases, once again the sampling distribution appears to stabilize toward a normal curve.

!

256$

20

25

5

10

Percent

15

20 15

Percent

10

0

5 0

10 20 30 40 50 Sampling Distribution for Average BMIs, n=5

6

Percent

4

10

0

0

2

5

Percent

8

10

15

10 20 30 40 50 60 Sampling Distribution for Average BMIs, n=3

15 20 25 30 35 40 Sampling Distribution for Average BMIs, n=9

15 20 25 30 35 Sampling Distribution for Average BMIs, n=14

Figure 3-03.5. Sam pling Distribution of the Sam ple M ean BM I (taken from a random sam ple with replacem ent) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14.

Summary Three% Major% Points% about% the% Sampling% Distribution% of% the% Sample% Mean% (i) The!EV!is!the!population!mean!µ! (ii) The!SE!of!the!mean!is! !!(for!samples!with!replacement),!where!σ!is! the!population!standard!deviation.!! (iii) The! shape! is! approximately! normal,! provided! the! sample! size! is! large!enough,!and!regardless!of!the!shape!of!parent!distribution.! !

(i) The first major point, The EV is the population mean m, illustrates why the sample mean is a reasonable estimate of the population mean. While there may be some chance errors involved, on average, the chance errors cancel out. That !

257$

is, the average of the sampling distribution is the population mean, which we wish to estimate. The second major point, The SE of the mean is (for samples with replacement), where s is the population standard deviation, provides us a fast way to calculate the SE of the mean (when we have generated a random sample with replacement). It should be stressed to learners that the SE is an important guide for establishing the reliability of the estimate. Other things being equal, you would prefer an estimate that has a small SE. In the case of estimating the population mean, you could either minimize the individual variability or increase the sample size to achieve a desired standard error level. The third major point is called the Central Limit Theorem, and it states that whether the shape of the population histogram is skewed or symmetric, the sampling distribution of the sample mean will still be reasonably approximated by a normal curve, provided that you have a rather large sample size. Actually, there are some mathematical conditions that the population distribution ought to obey, but they need not be a concern. Also, how “large” a large enough sample is will actually depend on the parent distribution. For most distributions, 30 appears to be good enough, for practically all, 100 appears to be good enough. If, however, the parent distribution is symmetric, even 12 to 15 would be good enough. If the parent distribution is a normal curve, then all we need is one observation! Similar results can be obtained for the case of sample proportions (since a proportion is also an average of 1’s and 0’s): •

the expected value of the sampling distribution is the true value of the population percentage regardless of the sample size. Thus, the sample percentage (like the sample mean) is an unbiased estimate of the population percentage (respectively the population mean).



The SE of the sampling distribution (of sample proportions) is inversely proportional to the square root of the sample size, in fact SE = so that as the sample size increases, the SE becomes smaller, i.e. the sampling distribution of the proportion tends to bunch up more toward the true value of the population proportion. Note that since the SE depends on the unknown population percentage P, we can estimate the SE by bootstrapping the population proportion P by the sample proportion p, i.e. estimating the SE with

!

258$

Alternatively, one might notice that the SE of the sample proportion is at most since the quadratic function P (1-P) is maximized at ¼ when P = ½. •

as the sample size is increased, with both np and np(1-p) both at least 10, the sampling distribution of sample proportions gets to be more and more like a normal curve.

It is interesting to note that when estimating percentages or proportions, it is the absolute size of the sample, i.e., n itself, In summary, we can state the following major facts about the sampling distribution of the sample proportion:

Three%Major%Points%about%the%Sampling%Distribution%of%the%Sample% Proportion!p! (i) The!EV!!for!the!sample!proportion!is!the!population!percentage! P! (ii) The!SE!of!the!proportion!is

.!!Since!this!SE!depends!on!the!

unknown!population!percentage!P,!we!need!to!estimate!it.!The! SE! may! be! readily! estimated! by!

,! i.e.,! bootstrapping! the!

population! proportion! P! by! the! sample! proportion! p.! Alternatively,! one! might! notice! that! the! SE! of! the! sample! proportion!p!is!at!most!

!

!since!the!quadratic!function#P!(1FP)!

is!maximized!at!¼!when!P!=!½.! (iii) The!shape!is!approximately!normal,!provided!the!sample!size!is! large!enough,!with!both np and!np( 1- p)!both!at!least!10.!

3. Some Points of Confusion Often learners (and even teachers) get confused about the following a) The original parent distribution of heights, weights, and BMI levels (this is the “population) nor the distribution of sample data is not the same as the sampling distribution. When you take a sample of the data, one can look at the sample statistics and even the shape of the distribution of the sample.

!

259$

But this is not the sampling distribution. The sampling distribution is an imaginary construct: the collection of all possible values that a statistic (such as a sample mean, or a sample proportion). We look into the sampling distribution to be able to make a statement about the ‘behavior’ of a statistic (as a statistic, like a sample mean or sample proportion, serves as an estimate of a population parameter such as a population mean, or population proportion). b) The notion about independence in a random sample. The Central Limit Theorem, which discusses the behavior of the sampling distribution as the sample size gets large, depends on the notion of independence of sample data. Good and well-designed sampling (and randomized experiments) ensures independence c) The behavior of small samples. The Central Limit Theorem describes what happens for large samples. For the weight data, which was already rather symmetric (see Figure 3-03.1 b), even small samples would already yield a sampling distribution that is very nearly normal. However, for the BMI data, which was very skewed, the sample size would need to be fairly large for the normal approximation to work. d) Formulas for EV and SE. There are only two sample statistics examined here, the sample mean and the sample proportion. The EV for the sample mean turns out to be the population mean (and this is why the sample mean is said to be a rather good estimator of the population mean). The EV for the sample proportion likewise turns out to be population proportion (and this is why the sample proportion is a good estimator of the population proportion). Sample means and sample proportions might have some chance error from the true values, but the average of all possible estimates turns out to be the target. The SE of the sample mean is the ratio of the population standard deviation to the square root of the sample size. For the case of the sample proportion, which is also a mean (a mean of 1’s and 0’s), the SE is since the underlying probability model here for the population is a binomial which has a standard deviation of

!

260$

.

KEY POINTS •

For sampling distribution of sample means o The expected value is the population mean o The standard error of the mean is

σ for samples with replacement n

o The shape is approximately normal, provided the sample size is large enough, and regardless of the shape of parent distribution •

For sampling distribution of sample proportions: o The expected value is the population percentage o The standard error of the proportion is be readily estimated by

. This standard error may

, i.e.. bootstrapping the population

proportion P by the sample proportion p. Alternatively, one might notice that the SE of the sample proportion is at most

since the quadratic

function P (1-P) is maximized at ¼ when P = 1/2. o The shape is approximately normal, provided the sample size is large enough, with both np and np(1-p) both at least 10. ASSESSMENT 1. Records indicate that the value of dwellings in a certain city is skewed to the right, with a mean of P1.4 million and a standard deviation of P600,000. To check the accuracy of the records, the city officials plan to conduct a detailed appraisal of 100 dwellings, selected at random. Obtain an approximate sampling distribution model for the sample mean value of the dwellings selected. ANSWER: The sampling distribution should have an EV of P1.4 million with a SE =

. Use a normal distribution with mean P1.4 million and

standard deviation P60,000 to approximate the sampling distribution. 2. Suppose that a city gets an average of 36.4 inches of rain each year, with a standard deviation of 4.2 inches. Assume that a normal curve applies. a) During what percentage of years does the city get more than 41 inches of rain? b) Less than how much rain falls in the driest 20% of all years?

!

261$

c) A certain university is found in this city. A student has been studying in this university for 4 years. Let y be the average amount of rain in those 4 years. Describe the sampling distribution of the sample mean y d) What is the chance that those 4 years average less than 31 inches of rain? ANSWER: a) b) Identify the value of x where the normal distribution, x,

. From the table of values from . From here, we can get the value of

c) Sampling distribution has EV = 36.4 and SE =

; population of rainfall

data follows a normal curve, so the sampling distribution is, in fact, having an exact normal curve with mean 36.4 and standard deviation 2.1 d) 3. The weight of potato chips in a medium-sized bag is said to be 10 ounces. The amount that the packaging machine puts in a bag of potato chips can be modeled by a normal curve with mean 10.2 ounces and standard deviation 0.12 ounces. a) What fraction of all bags sold are underweight? b) Some of the chips are sold in bargain packs of 3 bags. What is the chance that none of the 3 is underweight? c) What is the probability that the average weight of the 3 bags is below the stated amount? d) What is the chance the average weight of a 24-bag case of potato chips is below 10 ounces? ANSWERS: a) b) c) Since the question is about the average of three, we use the sampling distribution of the sample mean. The average is the same as that of the population, 10.2. However, the standard deviation is now Therefore the answer will be,

!

262$

.

d) This is very similar to the previous question. However, the standard deviation is now

.Thus, the answer to the question is

4. Consider a school district that has 10,000 11th graders. In this district, the average weight of an 11th grader is 45 kg, with a standard deviation of 10 kg. Suppose you draw a random sample of 50 learners. What is the probability that the average weight of a sampled student will be less than 42.5 kg? ANSWER: To solve this problem, we need to define the sampling distribution of the mean. Because our sample size (of 50) is fairly large, we might assume that the Central Limit Theorem holds, i.e. that the sampling distribution of the sample mean will approximate a normal curve. The EV of the sampling distribution is equal to the mean of the population (45 kg), while the SE of the sampling distribution (for sampling with replacement) is given by SE = Thus, using areas under a normal curve, the probability that a sample mean (of 50 learners) will have an average weight less than or equal to 42.5 kg is approximately equal to 0.038. 5. Based on past experience, a bank believes that 20% of credit card customers are considered bad credits. The bank has recently given 500 credit cards. a) What are the mean and standard deviation of the sample proportion of bad credits among the 500 credit cards? b) What assumptions underlie your answer in a) c) What is the chance that over 25% of these credit card applicants become bad credits? ANSWER: a) For the sampling distribution of the sample proportions, The EV, i.e., the mean of the sampling distribution, is EV =m = 20% (i.e. the population proportion) while SE, the standard deviation of the sampling distribution, is SE =

!

263$

b) Assume that credit card clients pay their dues independently of each other so that you have a random sample of all possible clients, and that these represent less than 20% of all possible clients; n p = 500 (0.20) = 100 and n p ( 1- p) = 500 (0.2) (0.8) = 80 are both at least 10. c) Using a normal approximation to the sampling distribution, the chance that over 25% of these credit card applicants become bad credits is approximately equal to the area under a normal curve (with mean 20% and standard deviation 1.8%) to the right of 25%, i.e., 0.0027366 6. Assume that 40% of senior high school learners have Twitter accounts, a) We randomly choose 100 learners. Let p represent the proportion of learners in this sample that have Twitter accounts. What is the approximate model for the sampling distribution of p? Specify the name of the distribution, the mean, and standard deviation of the sampling distribution. Be sure to verify that the conditions are met. b) What is the approximate probability that less than half of this sample have Twitter accounts? ANSWER: a) Normal with EV= m = 40%, SE = n p = 100 (0.40) = 40 and n p (1- p) = 100 (0.4) (0.6) = 24 are both at least 10 b) Using a normal approximation to the sampling distribution, the chance that less than half of the sample have Twitter accounts is approximately equal to the area under a normal curve (with mean 40% and standard deviation 4.9%) to the left of 50%, i.e., 0.97936546 7. When a truckload of bananas arrives at the pier, a random sample of 150 is selected and examined for bruises, discoloration, and other defects. The whole truckload is rejected if more than 5% of the sample is unsatisfactory. Suppose that in fact, 10% of the bananas on the truck do not meet the desired standard. What is the chance that the entire truckload of bananas will be accepted anyway? ANSWER: 0.0206 using a Normal curve with a mean of 0.10, and a standard deviation of 0.0245

!

264$

CHAPTER 3: SAMPLING Lesson 4: Sampling Without Replacement TIM E FRAM E: 60 minutes OVERVIEW OF LESSON This lesson continues the discussion on random sampling and sampling distribution. The lesson first discuss preliminary concepts on the hypergeometric probability model, which serves as the model for estimating proportions when sampling without replacement. Then it reexamine the example from the previous lesson, where samples are drawn with replacement. This time, the sampling is done without replacement to determine the effect on the sampling distribution. LEARNING CO M PETENCIES At the end of the lesson, the learner should be able to: • •





identify sampling distributions of statistics (sample mean and sample proportion) when sampling is conducted without replacement calculate the mean (or expected value) and standard deviation (or standard error) of the sampling distribution of the sample mean (and sample proportion) when sampling is conducted without replacement describe the approximate sampling distribution of the sample mean (and sample proportion) when the sample size is large and when sampling is conducted without replacement solve problems involving sampling distributions of the sample mean

LESSON OUTLINE A. B. C. D.

Motivation: Survey Estimates of Voting Preferences Introduction: The Binomial and Hypergeometric Probability Models Main Lesson: Sampling Marbles from a Box without Replacement Enrichment

KEY CONCEPTS: Sampling, Estimation, Sampling Variation, Standard Error, Central Limit Theorem

!

265$

DEVELO PM ENT O F THE LESSO N A. M otivation: Survey Estim ates of Voting Preferences On May 10, 2004, the country elected Gloria Macapagal Arroyo (GMA) as President with 36.4% of the votes cast, while her closest rival, Fernando Poe, Jr. (dubbed as the King of Philippine movies) garnered 33.2% of the votes. A few weeks before the elections, Pulse Asia, conducted its last pre-election survey suggesting that GMA would lead Poe 37% to 31%, while another organization, Social Weather Stations had Arroyo leading 37% to 30%. Both surveys were within one percentage point of the actual percentage of votes received by GMA as per the official count of the Commission on Elections. SWS and Pulse Asia, however, underestimated the percentage of votes for FPJ by 3.2 and 2.2 percentage points, respectively. For the other three presidential candidates, both surveys were less than two percentage points away from the actual proportions. Months before this, however, FPJ was leading GMA. The SWS also conducted a day of the election survey that suggested GMA getting 36.6% of the votes, and FPJ 27.6% of the votes. Although the Day of the Election survey involved more sample voters, the Exit Poll’s underestimation was quite off at 5.6 percentage points (as the FPJ voters may have come home late and were missed by the Exit Poll given that it was raining hard that day, and the poll estimates were released soonest). Why do survey estimates differ from the “true value”? Why do sample proportions vary at all, from survey to survey, assuming that survey protocols (of reputable organizations) are fairly similar? How can surveys that conducted essentially at the same time and asked similar questions get different results? Learners will probably suggest that people had not made up their minds yet on who to vote for, but by a month or so before elections, voting preference would have already stabilized. If estimates would differ from the true value, by how much variability among surveys should you expect to see? In the previous lesson, it was assumed that sampling is conducted with replacement. Clearly, in a survey, it does not make sense if we were to ask the same person twice about his or her voting preference. So, we will examine sampling without replacement in this lesson. B. Introduction: The Binom ial and Hypergeom etric Probability M odels Mention to learners that in Lesson 3-01, we re-examined coin tossing from a statistical perspective. What happens in every flip of the coin is unaffected by earlier and later flips. That is, the probability p of getting a head remains the same through all n flips of the coin.

!

266$

In other words, the flips, or more precisely, their outcomes, are independent. As a sampling exercise or procedure, we call this sampling with equal probability with replacement or more commonly referred to as simple random sampling1 (with replacement). Learners have also learned in the previous chapter and were reminded in Lesson 3-01 that the probabilities of observing x heads in n independent tosses of a coin is given by P(X = x) = nCx p x (1 - p) n-x

for x = 0, 1, 2, …, n

where the chance p of getting heads is a fixed constant. Learners have also learned or been told before that it is called the binomial probability mass function (pmf). From Lesson 3-01, learners now know more: that the binomial pmf is the result of sampling for a proportion when the sampling is equal probability with replacement. Consider now a box that contains N marbles, M of which are white and the rest of other colors, so that p = M/N is the proportion of white marbles.

M!white!marbles! ! NGM!nonGwhite!marbles!

Mix the marbles well. Then, draw n < N successively, without putting back those previously drawn. In this case, the outcomes of the draws are no longer independent. Hence the variable X = number of white marbles in the sample will not behave according to the binomial pmf2. From basic counting techniques (through a branch of mathematics called combinatorics), the number of ways n can be drawn from N marbles is NCn =

; the

number of ways x white marbles can be drawn from M and n-x from N-M non-white marbles is MCx , (N-M)C(n-x), respectively. Therefore, the probability of having x marbles in the sample is

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 1

!Strictly!speaking,!random!does!not!mean!equal!probability.!However,!very!early!on!in!the!history!of!statistics,! simple!random!was!used!to!describe!sampling!with!equal!probability,!and!the!usage!persisted.! ! 2 !X!will!follow!the!binomial!pmf!if!the!marbles!are!drawn!with!replacement:!draw,!replace,!mix!the!marbles,!and! repeat!n!times.!Then,!the!outcomes!of!the!draws!are!independent!and!you!are!drawing!from!the!same!population! (of!marbles)!each!time.!!

!

267$

P(X=x) =

for x = 0, 1, 2, …, n

This is the hypergeom etric pmf. When N and M, hence P = M/N too, are known, the hypergeometric pmf can be computed exactly, without uncertainty. This is a mere exercise in probability. C. M ain Lesson: Sam pling M arbles from a Box without Replacem ent In practice, the fraction P of marbles that are white is unknown. There are many important real-world sampling situations when P is not known and the sampling is done without replacement, hence the hypergeometric pmf is useful. The aim is to infer about P. Think of the proportion of voters for candidate A. This is now a problem in statistics, no longer an exercise in computing the exact probabilities. Since in each draw equal probability is assigned to all remaining marbles, it is intuitively obvious that the sampling procedure assigns equal probability for every ball to be in the sample. Hence the procedure is called sampling with equal probability without replacement, or simple random sampling without replacement. Furthermore, the equal probability property of the sampling procedure suggests that assigning equal importance or weight to the outcome of every draw should lead to a reasonable estimate for P. Picture the outcome of a sample as a sequence of n 1’s and 0’s, where 1 stands for white and 0 not white; the sum of the sequence is X = number of whites. Giving equal weight, with the sum of the weights = 1, means 1/n for each. This leads to the estimate p = X/n, which is the same formula in sampling with equal probability without replacement. Notice that X is the sum of individual outcomes, each of which is 1( If Head in the flipping coin experiment, white in the marble drawing experiment), 0 otherwise (Tail, not White). Hence p is a simple average, and serves to estimate the true proportion (of marbles in the marble drawing experiment, or of heads in the flipping coin experiment). We will see in the next chapter that the simple mean is still used to estimate the true mean of more complex variables like height, weight, BMI, test scores, etc. This is because, when sampling with equal probability, the simple mean possesses some desirable or optimal properties. For instance, if it were feasible to draw all possible samples (of size n) and compute their simple means, then the mean of the latter (the so-called expected value of the sampling distribution) is equal to the true mean. This was illustrated in the last lesson on sampling distribution, under conditions that the sampling is conducted with replacement. Learners may then wonder if the estimates of sample means would be the same with or without replacement. What is the difference between sampling with compared to sampling without replacement?

!

268$

As will be shown in the examples below, the sampling distribution without replacement will be more closely clustered together than a sampling distribution with replacement. This means the sampling variation in the former is smaller, or that the estimate is more precise (with lower SE) compared to sampling with replacement. Tell learners that this should not come as a surprise since we sample to gain information about the population. Additional information is gained whenever a new unit is drawn. However, no new information is gained from a unit that had already been drawn previously. When selecting a relatively small sample from a large-sized population, obtaining a sample of independent units occurs whether you sample with replacement or without replacement. If you sample from a bathtub full of marbles, you do not need to sample with replacement because drawing one white marble does not influence the next color of marble to be drawn. The marbles are independent of each other, which means the selection of one marble doesn’t influence the selection of another. When the population is small, obtaining a sample without replacement, such that units selected are independent, is difficult. For example, if we sample without replacement from a small bag of marbles, removing one white marble can influence the next color to be drawn. Standard Error of Sample Mean when Sampling without Replacement How much lower is the SE of the mean without replacement? As was indicated in the last lesson, under conditions of sampling with replacement, the standard error (SE) of the sampling distribution of the mean is proportional to the population standard deviation s and inversely proportional to the square root of the sample size n: SE = On the other hand, when sampling is conducted without replacement, the SE is the SE =

The term

=

is called the finite population correction (fpc) and the ratio

is

the sam pling rate, where n and N are the sample size and population size, respectively. When the sampling rate is small enough, the two SEs for the mean (where sampling is conducted with and without replacement) can be assumed to be virtually the same. But how small is “small enough”? learnersIt depends on the situation, although is a workable rule of thumb in many real situations. In majority of actual sampling applications, N is very large so the fpc is replaced by 1. Example (Heights, Weights and BMIs of Female Learners Revisited): !

269$

< .05

In the last lesson, the heights, weights, and BMIs of 15 female learners formed the population to be studied, with the (population) average height, (population) average weight, and (population) average BMI the parameters of interest. The entire sampling distributions for the average height, average weight, and average BMI for a sample of size n=2 learners were extensively tabulated and illustrated. In addition, the behavior of the sampling distribution was also simulated with 10,000 experiments (on a software) for sample sizes n=3, n=5, n=9, and n=12. It was observed that the expected value of the sampling distributions always hit the mark (i.e. the relevant population averages), but the SE got smaller with increasing sample size. In addition, the sampling distribution could be approximated rather well with a normal curve when the sample size was increased. Here, you revisit the results, but this time under the assumption of sampling without replacement. If a random sample of size n=2 were to be obtained, then there would be (15) (14) =210 possible equally likely samples to be selected. The full list of samples is given in Table 3-04.1

!

270$

!

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

2

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

F IR S T STUDENT

1

SAMPLE

13

12

11

10

9

8

7

6

5

4

3

1

15

14

13

12

11

10

9

8

7

6

5

4

3

2

SECOND STUDENT

1.51

1.57

1.5

1.52

1.47

1.56

1.51

1.573

1.27

1.585

1.52

1.58

1.655

1.59

1.57

1.63

1.56

1.58

1.53

1.62

1.57

1.633

1.33

1.645

1.58

1.58

AVERAGE H E IG H T

43

52

48

52

46.1

50.5

44

47.5

55

47.5

49.5

45

51.5

45

38

47

43

47

41.1

45.5

39

42.5

50

42.5

44.5

45

AVERAGE W E IG H T

18.82064

21.1087

21.321

96

95

94

7

7

7

7

7

93

92

22.50693

7

91

7

7

7

7

7

21.28483

90

89

88

87

86

20.78158

19.26508

19.33087

39.6557

19.0851

21.42486

7

6

85

84

18.25669

6

83

6

6

6

6

6

18.73083

82

81

80

79

78

17.97746

15.43605

17.72412

17.93642

19.12234

17.90025

6

6

77

76

17.39699

6

75

6

6

6

6

F IR S T STUDENT

15.8805

74

73

72

71

SAMPLE

15.94628

36.27112

15.70052

18.04028

18.25669

AVERAGE BMI

13

12

11

10

9

8

6

5

4

3

2

1

15

14

13

12

11

10

9

8

7

5

4

3

2

1

SECOND STUDENT

271$

1.5

1.56

1.49

1.51

1.46

1.55

1.563

1.26

1.575

1.51

1.51

1.57

1.648

1.583

1.563

1.623

1.553

1.573

1.523

1.613

1.563

1.323

1.638

1.573

1.573

1.633

AVERAGE H E IG H T

37

46

42

46

40.1

44.5

41.5

49

41.5

43.5

44

39

54

47.5

40.5

49.5

45.5

49.5

43.6

48

41.5

52.5

45

47

47.5

42.5

AVERAGE W E IG H T

of sam ple

16.44445

18.73251

18.94481

20.13074

18.90864

18.40539

16.95468

37.27951

16.70891

19.04867

19.26508

15.8805

19.80501

19.05164

16.51023

18.7983

19.0106

20.19652

18.97443

18.47117

16.95468

37.3453

16.7747

19.11446

19.33087

15.94628

AVERAGE BMI

166

165

164

163

162

161

160

159

158

157

156

155

154

153

152

151

150

149

148

147

146

145

144

143

142

141

SAMPLE

12

12

12

12

12

12

12

12

12

12

12

12

11

11

11

11

11

11

11

11

11

11

11

11

11

11

F IR S T STUDENT

9

8

7

6

5

4

3

2

1

13

11

10

9

8

7

6

5

4

3

2

1

15

14

13

12

10

SECOND STUDENT

1.56

1.55

1.57

1.52

1.61

1.56

1.623

1.32

1.635

1.57

1.57

1.63

1.575

1.51

1.49

1.55

1.5

1.45

1.54

1.49

1.553

1.25

1.565

1.5

1.5

1.56

AVERAGE H E IG H T

45

50

54

48.1

52.5

46

49.5

57

49.5

51.5

52

47

54.5

48

41

50

50

44.1

48.5

42

45.5

53

45.5

47.5

48

43

AVERAGE W E IG H T

18.28807

20.78843

21.97436

20.75226

20.24901

18.73251

18.7983

39.12313

18.55253

20.89229

21.1087

17.72412

21.79514

21.04177

18.50037

20.78843

22.18666

20.96456

20.46131

18.94481

19.0106

39.33543

18.76483

21.10459

21.321

17.93642

AVERAGE BMI

Table 3-04.1 Distinct sam ples of size two (without replacem ent) and average heights, weights, and BM Is

!

2

2

3

3

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

4

4

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

15

14

13

12

11

10

9

8

7

6

5

3

2

1

15

14

13

12

11

10

9

8

7

6

5

4

2

1

15

14

1.66

1.595

1.575

1.635

1.565

1.585

1.535

1.625

1.575

1.638

1.335

1.585

1.585

1.645

1.595

1.53

1.51

1.57

1.5

1.52

1.47

1.56

1.51

1.573

1.27

1.585

1.52

1.58

1.595

1.53

54

47.5

40.5

49.5

45.5

49.5

43.6

48

41.5

45

52.5

47

47.5

42.5

56

49.5

42.5

51.5

47.5

51.5

45.6

50

43.5

47

54.5

47

49.5

44.5

56.5

50

19.55924

18.80587

16.26447

18.55253

18.76483

19.95076

126

125

124

123

122

121

9

9

9

9

9

9

9

9

120

119

18.72866

9

118

9

9

9

9

9

18.22541

117

116

115

114

113

16.70891

16.7747

37.09953

18.86869

19.0851

15.70052

8

8

112

111

21.899

8

110

8

8

8

8

8

21.14563

109

108

107

106

105

18.60423

20.89229

21.10459

22.29052

21.06842

20.56517

8

8

104

103

19.04867

8

102

8

8

8

7

7

19.11446

101

100

99

98

97

39.43929

18.86869

21.42486

18.04028

22.11541

21.36204

15

14

13

12

11

10

8

7

6

5

4

3

2

1

15

14

13

12

11

10

9

7

6

5

4

3

2

1

15

14

272$

1.545

1.48

1.46

1.52

1.45

1.47

1.51

1.46

1.523

1.22

1.535

1.47

1.47

1.53

1.635

1.57

1.55

1.61

1.54

1.56

1.51

1.55

1.613

1.31

1.625

1.56

1.56

1.62

1.585

1.52

52.6

46.1

39.1

48.1

44.1

48.1

46.6

40.1

43.6

51.1

43.6

45.6

46.1

41.1

57

50.5

43.5

52.5

48.5

52.5

46.6

44.5

48

55.5

48

50

50.5

45.5

50.5

44

21.75897

21.0056

18.4642

20.75226

20.96456

22.15049

20.42514

18.90864

18.97443

39.29926

18.72866

21.06842

21.28483

17.90025

21.25572

20.50235

17.96094

20.24901

20.46131

21.64723

20.42514

18.40539

18.47117

38.79601

18.22541

20.56517

20.78158

17.39699

19.73922

18.98585

196

195

194

193

192

191

190

189

188

187

186

185

184

183

182

181

180

179

178

177

176

175

174

173

172

171

170

169

168

167

14

14

14

14

14

14

14

14

14

14

14

14

14

14

13

13

13

13

13

13

13

13

13

13

13

13

13

13

12

12

15

13

12

11

10

9

8

7

6

5

4

3

2

1

15

14

12

11

10

9

8

7

6

5

4

3

2

1

15

14

1.605

1.52

1.58

1.51

1.53

1.48

1.57

1.52

1.583

1.28

1.595

1.53

1.53

1.59

1.585

1.52

1.56

1.49

1.51

1.46

1.55

1.5

1.563

1.26

1.575

1.51

1.51

1.57

1.645

1.58

56.5

43

52

48

52

46.1

50.5

44

47.5

55

47.5

49.5

50

45

49.5

43

45

41

45

39.1

43.5

37

40.5

48

40.5

42.5

43

38

58.5

52

21.83618

18.54141

20.82947

21.04177

22.2277

21.0056

20.50235

18.98585

19.05164

39.37647

18.80587

21.14563

21.36204

17.97746

19.29478

18.54141

18.28807

18.50037

19.68629

18.4642

17.96094

16.44445

16.51023

36.83507

16.26447

18.60423

18.82064

15.43605

21.58284

20.82947

!

5

5

5

5

5

5

5

5

5

5

5

5

5

5

57

58

59

60

61

62

63

64

65

66

67

68

69

70

15

14

13

12

11

10

9

8

7

6

4

3

2

1

1.345

1.28

1.26

1.32

1.25

1.27

1.22

1.31

1.26

1.323

1.335

1.27

1.27

1.33

61.5

55

48

57

53

57

51.1

55.5

49

52.5

52.5

54.5

55

50

40.12984

39.37647

36.83507

39.12313

39.33543

40.52136

140

139

138

137

136

135

10

10

10

10

10

10

10

10

134

133

39.29926

10

132

10

10

10

10

10

38.79601

131

130

129

128

127

37.27951

37.3453

37.09953

39.43929

39.6557

36.27112

15

14

13

12

11

9

8

7

6

5

4

3

2

1

273$

1.595

1.53

1.51

1.57

1.5

1.47

1.56

1.51

1.573

1.27

1.585

1.52

1.52

1.58

58.5

52

45

54

50

48.1

52.5

46

49.5

57

49.5

51.5

52

47

22.98107

22.2277

19.68629

21.97436

22.18666

22.15049

21.64723

20.13074

20.19652

40.52136

19.95076

22.29052

22.50693

19.12234

210

209

208

207

206

205

204

203

202

201

200

199

198

197

15

15

15

15

15

15

15

15

15

15

15

15

15

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

1.605

1.585

1.645

1.575

1.595

1.545

1.635

1.585

1.648

1.345

1.66

1.595

1.595

1.655

56.5

49.5

58.5

54.5

58.5

52.6

57

50.5

54

61.5

54

56

56.5

51.5

21.83618

19.29478

21.58284

21.79514

22.98107

21.75897

21.25572

19.73922

19.80501

40.12984

19.55924

21.899

22.11541

18.73083

.15 0

.05

.1

Fraction

.2

.25

Figure 3-04.1 illustrates the sampling distributions for the average height, average weight, and average BMI of sample size n=2.

1.2

1.3

1.4 1.5 Average Height

1.6

1.7

0

.05

Fraction

.1

.15

(a)

35

40

45 50 Average Weight

(b)

!

274$

55

60

.3 .2 0

.1

Fraction

15

20

25 30 Average BMI

35

40

(c) Figure 3-04.2. Sam pling distributions of the sam ple m eans of size n=2 fem ale learners (selected at random without replacem ent) for their (a) heights, (b) weights, and (c) BM I levels

Computations can also be readily made for the EVs and the SEs of the sampling distributions for the average height, average weight, and average BMI when a sample size n=2 is taken (where sampling is done without replacement). They yield: Average Height EV SE

Average Weight 1.52 48.21 0.10 5.04

Average BMI 22.09 6.70

Recall the EVs and the SEs of the sampling distributions for the average height, average weight, and average BMI of sample size n=2 (when sampling is conducted with replacement). They were: Average Height EV SE

!

Average Weight 1.52 48.21 0.11 5.23

275$

Average BMI 22.09 6.96

As was pointed out earlier, the EV for sampling distributions of means, whether with or without replacement, is the target population parameter, i.e. the population mean. While the SE for sampling without replacement is less than the SE for sampling with replacement. In fact, the SE for means when samples are done with replacement for a sample of size n is given by: SE = where s is the population standard deviation, while the SE for means of sample size n from a population with size N, when sampling is conducted without replacement is: SE =

Example (continued): Learners may remember that the sampling distribution tends to get better approximated by a normal curve with a center given by the EV and a standard deviation given by the SE. Show learners the results of a simulation experiment conducted with a statistical software called Stata. Simulation experiments of 10,000 experiments of random samples without replacement for sample size n=3, n=5, n=9 and n=12 from the box model representing the sampling distributions of heights, weights, and BMIs of the N=15 learners are shown in Figure 3-04.3, Figure 3-04.4, and Figure 3-04.5, respectively.

!

276$

8 6

8

4 2 0

Percent

6 4 2 0

Percent

1.4 1.45 1.5 1.55 1.6 1.65 Average Height (Sampling wihout Replacement), n=5

6

Percent

4

4 1.45 1.5 1.55 1.6 Average Height (Sampling wihout Replacement), n=9

0

0

2

2

Percent

6

8

8

10

1.3 1.4 1.5 1.6 1.7 Average Height (Sampling wihout Replacement), n=3

1.48 1.5 1.52 1.54 1.56 1.58 Average Height (Sampling wihout Replacement), n=12

Figure 3-04.3. Sam pling Distribution of the Sam ple M ean Height (taken from a random sample without replacement) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14

!

277$

6

8

4 2

Percent

6 4

Percent

0

2 0

40 45 50 55 Average Weight (Sampling without Replacement), n=5

44 46 48 50 52 Average Weight (Sampling without Replacement), n=9

6 4

Percent

2 0

4 0

2

Percent

6

8

8

40 45 50 55 60 Average Weight (Sampling without Replacement), n=3

46 48 50 52 Average Weight (Sampling without Replacement), n=12

Figure 3-04.4. Sam pling Distribution of the Sam ple M ean W eight (taken from a random sample without replacement) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14

!

278$

10 6 4

Percent

8

15 10

0

0

2

5

Percent

15 20 25 30 Average BMI (Sampling without Replacement), n=5

18 20 22 24 26 Average BMI (Sampling without Replacement), n=9

10

Percent

0

0

2

5

4

Percent

6

8

15

15 20 25 30 35 Average BMI (Sampling without Replacement), n=3

19 20 21 22 23 24 Average Height (Sampling without Replacement), n=12

Figure 3-04.5. Sam pling Distribution of the Sam ple M ean BM I (taken from a random sam ple without replacem ent) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14

!

279$

What justifies the choice of sampling “without replacement” over “with replacement”? As was pointed out, more information is gained by having sampling done without replacement. Provide the following example. Example 2: A janitor has 20 keys, and one of them is the key to a locked office door. Should he sample the keys with or without replacement? If he randomly tries the keys one by one, but does not eliminate the ones he tries, then he is sampling with replacement. In this case, the long-run average number of tries to unlock the door is 20. If he randomly tries the keys one by one, eliminating the ones that do not work, then he is sampling without replacement. In this case, the long-run average number of tries to unlock the door is 11. In this case, sampling without replacement makes sense over sampling with replacement. D. Enrichm ent Inform learners that it is often a puzzle to many why merely a sample of 1,200 respondents in a poll would be enough to represent a population of voters. Most polling organizations try to get an estimate of the population of voters who will vote for some candidate (or whatever behavior of interest). As was pointed out, the SE of the sampling distribution of a sample mean where sampling is done without replacement is given by SE = A percentage or fraction may be viewed as an average of tickets drawn from a box containing 1’s (representing those who will vote for the candidate) and 0’s (representing those who will not), where n tickets are drawn successively and independently without replacement. If P represents the fraction of 1’s among the tickets, then the first ticket will follow a binomial probability model with a mean P, variance P (1-P) and thus standard deviation In consequence, the sample percentage will have a sampling distribution with EV = P and !

280$

SE = that can be approximated by a normal curve. With a large sample size, the finite population correction (fpc) is nearly one so that SE will be practically SE =

The latter inequality above follows from the observation that P(1-P) is maximized at p =1/2. Since the sampling distribution follows a nearly normal curve, 95% of the time we would expect the sample percentage to be within 2 SE from the true value. There would be a “margin of error” between the sample and true percentage of 2 SE so if we allow the margin of error to be 3 percentage points, then solving for n in the equation

yields n = 1111. This is why reputable organizations in the country tend to use about 1,200 respondents (regardless of the population size).

KEY PO INTS •

Sampling with replacement results in independent events that are unaffected by previous outcomes, but in practice, there is more of sampling without replacement since we do want to have more information. Additional information is gained whenever a new unit is drawn, but no new information is gained from a unit that had already been drawn previously (which happens when sampling is done with replacement).





When selecting a relatively small sample from a large population, obtaining a sample of independent subjects occurs whether we sample with replacement or without replacement While the standard error (SE) of the sampling distribution of the mean is SE = when sampling with replacement, the SE for the mean for sampling without replacement is less, and given by SE =

!

=

281$

where s is the population standard deviation, while n and N are the sample size and population size, respectively. o

The term ratio

is called the finite population correction (fpc) and the is the sam pling rate.

When the sampling rate is small enough, the two SEs (for with and without replacement) can be assumed to be virtually the same. In majority of actual sampling applications, N is very large so that the fpc is replaced by 1. For the special case, when the sample mean is actually a proportion, the EV of the sampling distribution of sample proportion p is the population proportion P; the standard error (SE) is o



where n and N are the sample size and population size, respectively. REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los Baños, College Laguna 4031

ASSESSM ENT

1. A city has 300,000 registered voters, with 120,000 of them poor. A survey organization is about to take a random sample of 1,000 registered voters. Describe the sampling distribution of the fraction of poor among the 1,000 sampled voters. ANSWER: Approximately normal with mean given by EV = And standard deviation given by

!

282$

SE =

= 0.0155

2. Consider a school district that has 10,000 11th graders. In this district, the average weight of an 11th grader is 45 kg, with a standard deviation of 10 kg. Suppose you draw a random sample of 50 learners. What is the probability that the average weight of a sampled student will be less than 42.5 kg? ANSWER: To solve this problem, you need to define the sampling distribution of the mean. Because the sample size (of 50) is fairly large, you might assume that the Central Limit Theorem holds, i.e. that the sampling distribution of the sample mean will approximate a normal curve. The EV of the sampling distribution is equal to the mean of the population (45 kg), while the SE of the sampling distribution (for sampling with replacement) is given by SE = Thus, using areas under a normal curve, the probability that a sample mean (of 50 learners) will have an average weight less than or equal to 42.5 kg is approximately equal to 0.0382.

3. According to Sherlock Holmes, (The Sign of Four) “While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will be up to, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician.” The statistician does not actually say that. What is Sherlock Holmes forgetting? ANSWER: Sherlock Holmes forgets about chance error, about uncertainty. The “mathematical certainty” he may be referring to is that there is some kind of predictability, the sampling distribution tends toward a normal curve (the Central Limit Theorem), but this is a statistical model that involves uncertainty.

4. One public opinion poll uses a simple random sample (without replacement) of size 1,200 drawn from a region with a population of 10 million. Another poll uses another simple random sample of size 1,200 from a region with a population of 1 million. The polls are trying to estimate the proportion of voters who are in favor of constitutional change. Other things being equal: a) The first poll is likely to be a bit more accurate than the second;

!

283$

b) The second poll is likely to be a bit more accurate than the first; c) There is not likely to be much difference in the accuracy of the polls.

ANSWER: (c). While there is a finite population correction that should be applied to the SE for both polls, this fpc is nearly 1. The reliability (i.e. precision) of the sample proportion depends not on big N = the population size (when N is large), but rather on small n (the sample size).

5. A survey organization wants to take a simple random sample without replacement to estimate the proportion of voters who are in favor of voting for candidate A in the next election. To keep costs down, they want to take a sample as small as possible, but their client would like to only tolerate chance errors of 1 percentage point or so in the estimate. Should they use a sample of size 100, 2,500, or 10,000? Note that past experience suggests that the population percentage is in the range 20% to 40%.

ANSWER: The sample size should be around 2,500. Since the client wants SE =

and p is maximum here for p= 0.4, merely solve (0.4)*(0.6)/n = (0.01)2

6. A simple random sample of 400 persons 15 years old and above is taken in Naga City. The total years of schooling of all the sampled persons is 3230, so that the average educational attainment is

8.1 years. The standard deviation of the sample data is

4.1 years. Describe the sampling distribution.

ANSWER: Approximately a normal curve with mean given by EV = and standard deviation given by SE =

!

= 0.2036

284$

CHAPTER 3: SAMPLING Lesson 5: Sampling from a Box of Marbles, Nips, or Colored Paper Clips, and One-Peso Coins TIM E FRAM E: 60 minutes OVERVIEW OF LESSON This lesson provides an activity to further help learners understand sampling. It focuses on sampling objects from a box. The objects that can be categorized (e.g., marbles classified as white or non-white; Nips colored candy chocolates categorized green or non-green color; coloredpaper clips categorized red or non-red; one-peso coins categorized based on the year they were made or minted). Learners then select random samples without replacement of marbles (or Nips or colored paper clips) or one peso-coins. For marbles (or Nips or colored paper clips), learners examine the distribution of colors of several samples. For one-peso coins, learners examine the age distributions of one-peso coins (categorizing the years into two: the year 2014, and other years). Based on the data distributions, learners think about what the color distribution of all marbles (or Nips, or colored paper clips), or the age distribution of all one-peso coins will look like. LEARNING CO M PETENCIES At the end of the lesson, the learner should be able to: • • • •

represent data with graphs (dot plots) describe statistical methods as a process for making inferences about population parameters based on a random sample from that population use data from a sample survey to estimate a population mean or proportion and develop a margin of error through the use of simulation for random sampling describe the difference on the sampling distribution when more samples are included

LESSON OUTLINE A. B. C. D. E.

Introduction Main Lesson: Estimation of Probability of Getting a Head in a Single Toss of a Coin Data Collection Data Analysis and Interpretation Enrichment

KEY CONCEPTS: Sampling, Estimation, Sampling Variation, Standard Error, Central Limit Theorem

!

285$

M ATERIALS NEEDED For the activity, learners will need pencil, paper, and notes. The teacher should bring •

a large container with either (a) marbles, (b) Nips (although the drawback is that these might melt), or (c) colored paper clips; or, (d) one-peso coins with different minting dates



copies of the Activity Worksheet

DEVELO PM ENT O F THE LESSO N A. Introduction Before starting the activity, the teacher may wish to review the definitions of population, sample, population parameter, and sample statistic, reinforcing learners’ understanding of these basic concepts for the lesson/activity on sampling marbles from a box with about 15% white marbles in the box. The box should contain at least 500 marbles. To make the lesson more tractable for learners in the class, a teacher should prepare the box ahead of time. In other words, a teacher should have a large box of marbles containing 15% white marbles for learners to sample in the classroom prior to the beginning of the lesson. Learners should be asked to consider how random sampling could be used to explore the population parameter of interest (the proportion of white marbles). They will be presented with a large box of marbles and asked how they could possibly use it to help them with their investigation. After learners have discussed the possibilities, they will be guided through an investigation with a series of questions on the Activity Worksheet. They will first calculate the proportion of white marbles in a single sample of marbles. They will record the proportion on a sticky note and the class will construct a dot plot on the board showing the approximate sampling distribution of the proportion of white marbles in a sample of a fixed size of marbles. They will then take another sample of marbles and calculate the proportion of white marbles in the total number of marbles from both sample selections. A second dot plot will be constructed on the board showing the sampling distribution of the proportion of white marbles from the two samples combined.

!

286$

B. Data Analysis At this point, learners will be performing an informal analysis of the data displayed on the two class dot plots. They will first be asked to use the two plots to estimate the proportion of white marbles. Then, they will be asked to compare the dot plots to analyze the effects of a larger sample size. The increased sample size should decrease the spread of the distribution. Sample outcomes are shown in the tables on the next page. The data in the sample outcomes were collected from sampling from a large box of white marbles Sample Data Analysis Note: Your actual samples will most likely be different from this one. So just use this as a mere reference on how the class can interpret the results. 20 samples were collected for sample sizes n = 30 and n = 60. The results and corresponding dot plots are shown on Table 3-05.1 below. Table 3-05.1 Sam ples Collected for sam ple sizes n=30 and n=60 Sam ple Proportions for n = 30

Sam ple Proportions for n = 60

Sample #

Sample #

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

!

Proportion of white marbles 5/30 = 0.17 6/30 = 0.20 3/30 = 0.10 8/30 = 0.27 3/30 = 0.10 3/30 = 0.10 6/30 = 0.20 5/30 = 0.17 7/30 = 0.23 3/30 = 0.10 3/30 = 0.10 4/30 = 0.13 7/30 = 0.23 5/30 = 0.17 5/30 = 0.17 1/30 = 0.03 6/30 = 0.20 8/30 = 0.27 1/30 = 0.03 5/30 = 0.17

Proportion of white marbles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

287$

12/60 = 0.20 15/60 = 0.25 14/60 = 0.23 7/60 = 0.12 10/60 = 0.17 10/60 = 0.17 11/60 = 0.18 7/60 = 0.12 8/60 = 0.13 10/60 = 0.17 8/60 = 0.13 11/60 = 0.18 6/60 = 0.10 11/60 = 0.18 12/60 = 0.20 10/60 = 0.17 12/60 = 0.20 11/60 = 0.18 13/60 = 0.22 8/60 = 0.13

Figure 3-05.1 Dot Plots for Sam ple Proportions of W hite M arbles for Sample Sizes n = 30 and n = 60

Proportion of W hite M arbles Sam ple Size n = 30

Proportion of W hite M arbles Sam ple Size n = 60

Interpret the Results During the activity, learners can be asked if the information on the dot plots seemed to support the claim that 15% of all marbles are white. Even though the proportion of white marbles will vary considerably from sample to sample, the data displayed on the dot plot should support the claim. Learners will be asked to answer the following questions on the Activity Sheet as they work through the sampling procedure and analysis. Answers to the questions are discussed below. Teacher’s Step-By-Step Guide for This Activity STEP 1: Preliminary Questions.

!

288$

1. What is the specific question that needs to be addressed in your investigation? Answer: After reading the questions of interest on the activity sheet, learners should recognize that the question to be addressed is, “What proportion of all marbles are white?” To set up the question, learners should recognize that the proportion of all marbles that are white is a population parameter, with the population being all the marbles produced. 2. How does this question relate to the background information given at the beginning of the activity? Be sure to use some of the bold-faced terms in your answer. Answer: Learners should see that this question relates to the background information because it involves a population param eter (the proportion of white marbles) and a population (all marbles produced). 3. What is the population of interest in the investigation? Answer: The population of interest is all marbles produced. It is important that learners recognize that this includes more than the number of marbles in the large box in class, the marbles throughout the country, or even the number of marbles produced in a given time period, but ALL marbles produced. 4. What is the population parameter of interest in the investigation? Answer: The population parameter of interest is the proportion of white marbles. Learners should be able to distinguish between the proportion and other possible quantities or parameters. 5. Why can we not realistically calculate the population parameter of interest directly? Answer: It is important that learners recognize how large a quantity of marbles would be involved in calculating the proportion of white marbles in ALL marbles produced. They should also realize that production is an ongoing process. Both of these factors would make it impossible to count all marbles produced and determine the proportion of white marbles. 6. How could the concept of random sampling be used to investigate the population parameter of interest? Answer: In answering this question learners should mention that random sampling is a practical, reasonable way to gather information that can be used to help draw conclusions about a population of interest. They should point out that in this activity, a sample of marbles can easily be obtained at a local store and since it is part of the population of all marbles, it can help us draw conclusions about the proportion of white marbles in all marbles produced.

!

289$

STEP 2: Answer the following questions before using the samples of marbles in your investigation. 1. What are we interested in finding out about each of the samples of marbles? Answer: The statistical question set up by learners in Step 1, #1, involved the proportion of all marbles that are white. After referring back to this question, learners should conclude that they are interested in finding out about the proportion of white marbles in each sample. 2. What information do we need in order to answer our question or investigate the claim that 15% of marbles produced are white? How can we use the samples of marbles to obtain this information? Answer: Learners are again asked to recognize that they need the proportion of white marbles in the population of all marbles produced. They should then use the background information at the beginning of the activity sheet to help them conclude that they can use the samples of marbles as samples from which they can collect information about the population of all marbles. STEP 3: Each pair or group of learners will perform the following investigation using ONE sample of marbles. The total number of marbles in one sample is 30. 1. Calculate the proportion of white marbles in the sample. Proportion of white marbles __________________________ Answer: Answers will vary. It is recommended that the teacher know the proportion of white marbles in the box from which learners are drawing samples so he/she can know if learners’ results are reasonable and can have an idea of the needed range on the dot plot created in (3). 2. Give an estimate of the proportion of white marbles based on your sample. Answer: Answers will vary, but should correspond with the answer obtained from the previous question. 3. Write the proportion you found in (1) on the sticky note you were given and place it in the appropriate position on the dot plot your teacher has drawn on the board. Did every sample have the same proportion of white marbles? Answer: Student answers should be checked in order to determine the range needed for the class dot plot. The horizontal axis on the dot plot should be scaled so learners

!

290$

can easily determine where to place their sticky note. Learners should quickly see that every sample did not have the same proportion of white marbles. 4. What value is at the center of the dot plot constructed by using one sample per group? Answer: The value at the center of the dot plot should be close to the proportion of white marbles in the box from which the samples were drawn, say 15%. The learners should conclude that it is not a coincidence that the value is close to/far from the 15% (true value). 5. Define each of the following in the context of the investigation you are performing with one sample of marbles. Population of interest _____________________________________________________ Answer: All marbles produced. See explanation in the answer to Step 1, #3.

Population parameter of interest_____________________________________________ Answer: Proportion of white marbles in the population of all marbles. See explanation in the answer to Step 1, #4.

Sample (drawn from population of interest) ___________________________________ Answer: 30 marbles from the bucket. Learners should be very specific with their answer, indicating the number of marbles in the sample and where the sample was obtained. Statistic (used to estimate population parameter of interest) _______________________ Answer: Proportion of white marbles in one sample of marbles from the box. It is important that learners indicate that it is the proportion in ONE sample, and where the sample was obtained. STEP 4: Each pair or group of learners will perform the following investigations using two samples of marbles. The total number of marbles in two samples is 60.

!

291$

1. Calculate the proportion of white marbles in the overall sample (two samples with 30 marbles in each sample). Proportion of white marbles in two samples __________________________ Answer: Answers will vary, but should be reasonable considering the proportion of white marbles in the box from which the sample was drawn. 2. Give an estimate for the proportion of white marbles based on your overall sample. Answer: Answers will vary, but should correspond with the answer obtained for the previous question. 3. Write the proportion you found in (1) on the sticky note you were given and place it in the appropriate position on the dot plot your teacher has drawn on the board. Did every group have the same proportion of white marbles in two samples? Answer: Student answers should be checked in order to determine the range needed for the class dot plot. The horizontal axis on the dot plot should be scaled so learners can easily determine where to place their sticky note. Learners should quickly see that every group did not have the same proportion of white marbles in two samples. 4. What value is at the center of the dot plot constructed by using the overall samples of each group? Is it a coincidence that the value is close to/far from the value of the proportion of white marbles in the box (say 15%)?

Answer: The value at the center of the dot plot should be close to the proportion of white marbles in the box from which the samples were drawn. The learners should conclude that it is not a coincidence that the value is close to/far from 15%. 5. Define each of the following in the context of the investigation with two samples of marbles. Population of interest _____________________________________________________ Answer: All marbles produced. See explanation in the answer to Step 1, #3. Population parameter of interest_____________________________________________ Answer: Proportion of white marbles in the population of all marbles produced. See explanation in the answer to Step 1, #4.

!

292$

Sample (drawn from population of interest) ___________________________________ Answer: Two samples of size 60 of marbles from the box. Learners should be very specific in their answers, indicating the number of marbles in the sample and where the sample was obtained. Statistic (used to estimate population parameter of interest) _______________________ Answer: Proportion of white marbles in two samples of marbles from the box. It is important that learners indicate the proportion in two samples (total of 60 marbles) and where the sample was obtained. 6. What type of changes occurred in the dot plot when you used two samples of marbles instead of one? Answer: Learners should observe that the spread of the distribution decreased when the sample size increased. 7. If you only had one sample on which to base your estimate, how far off would your estimate be? What would the worst case scenario be with one sample shown on the dot plot? What would the worst case scenario be if you had two samples shown on the dot plot? Answer: For the data on the sample dot plot created on page 4 of this lesson, learners may find a sample proportion as small as .03 or as high as .27 for n = 30. The worst case scenario for n = 60 will be .10 on the low end and .25 on the high end. Learners should recognize that the spread is decreasing as the n increases. Learners will give similar answers based on the class dot plots for sample sizes n = 30 and n = 60.

Note to Teacher: Should you use other materials, just change the question as needed. For example, if the materials available are Nips, then change it to the proportion of green Nips candies, or if you have colored paper clips then change it to the proportion of red clips. This activity is applicable, as long as there are two categories possible. Also, the teacher can modify this to ease the traffic of learners getting marbles. What you can do is have each person get 10 marbles per sample, in this case, you will have an easier time in letting the learners get samples.

C. Possible Extensions 1. After learners understand the basic concept of a sampling distribution, design activities for them. For instance, the teacher may ask for a copy of DepEd’s Basic Education Information !

293$

System for the year, and ask learners to sample schools, and then focus on a particular parameter, e.g.., the average pupil-to-teacher ratio in high school, the average enrolment, the proportion of high school learners who are indigenous people. Ask them to get a sample of schools, and obtain sample statistics from their samples, and then combine the statistics with other learners, and develop a dot plot of the sampling distribution and describe the shape, center, spread, and outliers. 2. Learners are introduced to the concept of the variability of a statistic and its relationship to the spread of its sampling distribution. This is extended further as learners are introduced to the concept that larger samples give smaller spreads. This can lead to the discussion of the fact that the size of the population does not affect the spread of the sampling distribution when the population is fairly large.

REFERENCES Many of the materials in this lesson were adapted from: Population Parameters with NIPS® by Anna Bargagliotti and Jeanie Gibson STatistics Education Web (STEW), https://www.amstat.org/education/stew/pdfs/PopulationParameterswithMMs.docx See also: De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031 Probability and statistics: Module 24. (2013). Australian Mathematical Sciences Institute and Education Services Australia. Retrieved from http://www.amsi.org.au/ESA_Senior_Years/PDF/InferenceProp4g.pdf

!

294$

ACTIVITY SHEET 3-05 Background A population param eter is a summary measure that describes some characteristic of a given population. In statistics, the population is the entire collection of units (individuals, households, establishment, farms, etc.) about which we want information. The population parameter is a constant value that does not change. Many times, it is impractical or even impossible to calculate the population parameter of interest, the most common reason being that populations are often composed of very large numbers of units. When we cannot calculate the population parameter directly, we use a sam ple, which is a part of the population from which we actually collect information. From the sample, we calculate a statistic, a summary measure that describes some characteristic of the sample. A statistic will vary depending on the sample from which it was calculated. Preferably the sample should be designed so that it is representative of the population. How are the sam ple and the population related? We use information from a sam ple (a statistic) to draw conclusions about a population param eter. We will use this relationship while performing the following investigation. Question of Interest: W hat is the proportion of m arbles that are white? Instructions: Your class will be divided into pairs or small groups to perform the following investigation in order to answer the question above. STEP 1: Preliminary questions. 1. What is the specific question that needs to be addressed in your investigation? 2. How does this question relate to the background information given at the beginning of the activity? Be sure to use some of the bold-faced terms in your answer. What is the population of interest in the investigation? What is the population parameter of interest in the investigation? Why can we not realistically calculate the population parameter of interest directly? How could the concept of random sampling be used to investigate the population parameter of interest?! Your group will now be presented with a large box full of marbles. Groups will then come up to the bucket one-by-one and, without looking, grab 30 marbles. The group will note on the number of marbles of each color, place the sampled marbles back in the box, and repeat the process. Once two samples are recorded, the group will return to their seats to answer the questions in Steps 2 through 4.

!

295$

STEP 2: Answer the following questions before using the samples of marbles in your investigation. 1. What are we interested in finding out about each of the samples of marbles?! What information do we need in order to answer our question about the proportion of marbles that are white? How can we use the samples of marbles to obtain this information? ! STEP 3: Each pair or group of learners will perform the following investigation using one sample of marbles. The total number of marbles in one sample is 30. 1. Calculate the proportion of white marbles in the sample. Proportion of white marbles __________________________ 2. Give an estimate of the proportion of white marbles based on your sample.! 3. Write the proportion you found in (1) on the sticky note you were given and place it in the appropriate position on the dot plot your teacher has drawn on the board. Did every sample have the same proportion of white marbles? ! 4. What value is at the center of the dot plot constructed by using one sample per group? ! 5. Define each of the following in the context of the investigation you are performing with one sample of marbles. Population of interest __________________________________________________________! Population parameter of interest_________________________________________________ Sample (drawn from population of interest) _______________________________________ Statistic (used to estimate population parameter of interest)

_______________________

STEP 4: Each pair or group of learners will perform the following investigation using two samples of marbles. The total number of marbles in two samples is 60. 1. Calculate the proportion of white marbles in the overall sample (two samples with 30 marbles in each sample). Proportion of white marbles in two samples __________________________ 2. Give an estimate for the proportion of white marbles based on your overall sample. 3. Write the proportion you found in (1) on the sticky note you were given and place it in the appropriate position on the dot plot your teacher has drawn on the board. Did every group have the same proportion of white marbles in two samples? 4. What value is at the center of the dot plot constructed by using the overall samples of each group? ! 5. Define each of the following in the context of the investigation with two samples of marbles. !

296$

Population of interest _____________________________________________________! Population parameter of interest_____________________________________________ Sample (drawn from population of interest) ___________________________________ Statistic (used to estimate population parameter of interest) _______________________ 6. What type of changes occurred in the dot plot when you used two samples of marbles instead of one?! If you only had one sample on which to base your estimate, how far off would your estimate be? What would the worst case scenario be with one sample shown on the dot plot? What would the worst case scenario be if you had two samples shown on the dot plot? ! Answers to Activity Sheet STEP 1: 1. What proportion of all marbles produced is white? 2. We want to know a population parameter, which is the proportion of white marbles. The population of interest is the total population of all marbles produced. 3. All marbles produced. 4. The proportion of white marbles. 5. It would be impossible to count all marbles produced because there would be too many and production is an ongoing process. 6. We could use a sample of marbles that could reasonably be obtained at a local store and easily be counted. We would use information from the sample to draw conclusions about the proportion of white marbles in all marbles produced since the sample is a part of this population. STEP 2: 1. The proportion of white marbles. 2. We need the proportion of white marbles in the population of all marbles produced. We can use the samples of marbles as samples from which we can collect information about the population. STEP 3: 1. 2. 3. 4.

Answers will vary. Answers will vary. No, every sample did not have the same proportion of white marbles Answers will vary, but should be approximately equal to the true proportion of white marbles in the box. This is not a coincidence. 5. Population of interest – all marbles produced. Population parameter of interest – proportion of white marbles in the population of all marbles produced. Sample – 30 marbles from the box Statistic – proportion of white marbles ® in one sample of marbles from the box!

!

297$

STEP 4: 1. 2. 3. 4. 5.

6.

7. 8.

Answers will vary. Answers will vary. No, every sample did not have the same proportion of white marbles. Answers will vary, but should be approximately equal to the true proportion of white marbles in the box. This is not a coincidence. Learners should observe that the spread of the distribution of sample proportions for sample size n = 60 is less than the spread for n = 30. Therefore, there is less variability in the sample proportions for size n = 60. For this reason learners should conclude that the information on the dot plot for two samples seems to support the claim more than the information obtained from one sample. Population of interest – all marbles produced. Population parameter of interest – proportion of white marbles in the population of all marbles produced. Sample – two samples of size 30 of marbles Statistic – proportion of white marbles in 60 marbles from the box. The spread of the distribution decreased. For the data on the sample dot plot created on page 4 of this lesson, learners may find a sample proportion as small as .03 or as high as .27 for n = 30. The worst case scenario for n = 60 will be .10 on the low end and .25 on the high end. Learners should recognize that the spread is decreasing as the n increases. Learners will give similar answers based on the class dot plots for sample sizes n = 30 and n = 60.

ASSESSMENT 1. Explain the difference between a population parameter and a sample statistic. Answer: A statistic is a numerical summary computed from a sample. A parameter is a numerical summary computed from a population. A population parameter is a constant value that does not change, whereas a statistic will vary depending on the sample from which it was calculated. 2. There are several different colors of marbles. Suppose you obtained a bag of marbles and found that 10% of the bag was a certain specified color. Describe an activity that would allow you to estimate how far away your estimate of 12% might be from the population proportion of that color marble. Answer: The answers will vary. Essential elements in the description would be: a. A random sampling method using marbles as originally packaged. b. Sample sizes that are large enough to be representative of the total population of marbles c. A sampling method that involves obtaining repeated samples of the same size. d. A method of representing the sampling distribution of the proportion of the specified color of marbles, such as a dot plot or histogram. e. Directions for performing an informal analysis of the sampling distribution displayed in the dot plot or histogram. 3. If there are more samples, say each person gets 75 marbles, what can be the effect on the estimation for the parameter? Answer: The bigger the sample size, the better the estimates that can be generated.

!

298$

CHAPTER 3: SAMPLING Lesson 6: Sampling from the Periodic Table TIM E FRAM E: 120 minutes OVERVIEW OF LESSON: This lesson introduces an activity that further helps learners understand various sampling methods. This is largely taken from a STatistics Education Web (STEW) lesson plan called It’s Elemental. Using the Periodic Table of Elements discussed in Chemistry, learners collect data using simple random and systematic sampling. When both samples are collected, learners then calculate appropriate descriptive statistics and use sampling distributions to compare the performance of the methods. They also determine how to set up a stratified random sample and a cluster sample, but only perform the cluster sample in this activity. LEARNING CO M PETENCIES At the end of the lesson, the learner should be able to: • • •

use data from a sample to estimate a population mean describe the sampling distributions of some statistics (sample mean) compare sampling distributions for sample means from different sampling methods to determine the optimal sampling strategy

LESSON OUTLINE A. B. C. D.

Introduction Data Collection and Preliminary Analysis Further Analysis and Interpretation Enrichment: Possible Extensions

KEY CONCEPTS: Simple Random Sampling (SRS), Systematic Sampling, Stratified Sampling, Cluster Sampling PREREQ UISITES: Before learners begin the activity/lesson, they should have basic knowledge of random sampling methods such as simple random, systematic, stratified, and cluster sampling (discussed in Lesson 3-02. They should also be familiar with univariate descriptive statistics, graphs (discussed in Chapter 1), and sampling distributions (discussed in Lessons 3-01, 3-03, and 3-04). M ATERIALS N EEDED : For the activity, learners will need pencils. The instructor should provide the activity sheet, complete with a periodic table of elements.

!

299#

DEVELO PM ENT O F THE LESSO N A. Introduction Start the activity by giving learners the Activity Worksheet 3-06 and explaining that the overall goal of the activity is to determine which sampling method is the most appropriate for estimating the mean atomic weight of elements in the periodic table. The main question of interest for this activity is: After performing a simple random sample and a systematic sample on the periodic table of elements, which of the two is the most appropriate method? The various sampling methods the learners will implement in this activity are based on the periodic table of elements, so give a basic description of the periodic table as this will be important in the design. The following description is available in the Worksheet, but go ahead and summarize it for the learners. “The periodic table is a tabular arrangement of chemical elements, ordered by their atomic number (i.e., the number of protons in the nucleus of an element), electron configurations, and recurring chemical properties. Dmitri Mendeleev created the periodic table in 1869. The table reflects the “periodic” trends in the elements. Each of the rows of the table are called periods and elements within a period have the same valence electron shell based on quantum mechanical theory. The groups contain elements with similar physical properties due to the number of electrons in the respective valence shell. The most up-to-date periodic table has 117 confirmed elements, 92 of them occurring naturally on earth, with scientists producing the rest artificially in a laboratory. The value that learners will be most interested in is the atomic weight of elements, which can be determined using a weighted average of the weights for each element’s various isotopes.” The periodic table the learners will use comes from the US National Institute of Standards and Technology (www.nist.gov). This table only has 114 elements, so for the purposes of this activity, the population size is N = 114 and the population mean atomic weight is µ = 141.09 grams per mole. The last two pages of the Activity Worksheet contain a text version of the periodic table that may be easier for some learners to use. B. Data Collection and Prelim inary Analysis After the periodic table has been described to learners, have them all start the activity. Strategy 1: Finding a simple random sam ple (without replacement) of 25 elem ents Using the RANDBETWEEN() function in MS Excel, enter RANDBETWEEN(1,114) in at least 27 cells. Note that this function will yield independent draws with replacement. Consider the sample data set in Table 3.01. Since at least once instance occurs where the sample produced

!

300#

duplicates elements, we will delete these duplicate entries (68 and 25). Slearners can continue to sample elements until the sample is comprised of 25 unique elements. Table 3-06.1. Exam ple of a Sim ple Random Sam ple with Replacem ent 87 69 28

92 10 5

72 68 88

68 43 94

50 49 39

52 25 96

25 89 112

7 63 98

37 17 51

With the 25 elements, learners need to find the mean atomic weight of the sample. For Table 3.06.1, the resulting sample average here is 142.2808 grams per mole. Alternative to the use of Excel, learners may use a table of random digits to select the elements in the periodic table. A table of random digits is a list of the 10 digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 constructed in such a way that the digit in any position in the list has the same chance of being one of the ten digits, and each value in the list has no influence on the other values in the list. An example of a part of this table appears below. 11164 21215 10438 36792 73944 49563 64208 51486 99756 71325 65285 17264 95639 61555 78137 62490 24216 16975 59138 29478 96155 29621 12639 14544 !

36318 91791 44482 26236 04773 12872 48237 72875 26360 55217 97198 57327 99754 76404 98768 99215 63444 95428 39542 59652 95009 66583 75291 37134

75061 76831 66558 33266 12032 14063 41701 38605 64516 13015 12138 38224 31199 86210 04689 84987 21283 33226 71168 50414 27429 62966 71020 54714

37674 58678 37649 66583 51414 93104 73117 29341 17971 72907 53010 29301 92558 11808 87130 28759 07044 55903 57609 31966 72918 12468 17265 02401

26320 87054 08882 60881 82384 78483 33242 80749 48478 00431 94601 31381 68368 12841 79225 19177 92729 31605 91510 87912 08457 20245 41598 63228 301#

75100 31687 90870 97395 38370 72717 42314 80151 09610 45117 15838 38109 04985 45147 08153 14733 37284 43817 77904 87154 78134 14015 64074 26831

10431 93205 12462 20461 00249 68714 83049 33835 04638 33827 16805 34976 51092 97438 84967 24550 13211 22250 74244 12944 48407 04014 64629 19386

20418 43685 41810 36742 80709 18048 21933 52602 17141 92873 61004 65692 37780 60022 64539 28067 37485 03918 50940 49862 26061 35713 63293 15457

19228 19732 01806 02852 72605 25005 92813 79147 09227 02953 43516 98566 40261 12645 79493 68894 10415 46999 31553 96566 58754 03980 53307 17999

91792 08468 02977 50564 67497 04151 04763 08868 10606 85474 17020 29550 14479 62000 74917 38490 36457 98501 62562 48825 05326 03024 48766 18306

83403 67642 64041 17048 93039 82244 96990 09119 57666 46492 08433 10011 92420 35542 86595 72115 07428 35379 10982 90127 63002 40779 43216 96167 70975 85812

88827 05204 99011 94523 89416 34392 55244 74803 41204 61594 19172 75004 65431 55865 26247 34985 58863 27922 22807 33341 12990 86382 12608 64375 62693 61875

09834 30697 14610 97444 52795 96607 70693 97303 47589 26729 08320 86054 16530 07304 18552 58036 96023 28906 10920 77806 23510 48454 18167 74108 35684 23570

11333 44806 40273 59904 10631 17220 25255 88701 78364 58272 20839 41190 05547 47010 29491 99137 88936 55013 26299 12446 68774 65269 84631 93643 72607 75754

68431 96989 09482 16936 09728 51984 40029 51380 38266 81754 13715 10061 10683 43233 33712 47482 51343 26937 23593 15444 48983 91239 94058 09204 23026 29090

31706 68403 62864 39384 68202 10753 23289 73143 94393 14648 10597 19660 88102 57022 32285 06204 70958 48174 64629 49244 20481 45989 82458 98855 37004 40264

26652 85621 01573 97551 20963 76272 48819 98251 70713 77210 17234 03500 30176 52161 64844 24138 96768 04197 57801 47277 59815 45389 15139 59051 32989 80399

04711 45556 82274 09620 02477 50985 07159 78635 53388 12923 39355 68412 84750 82976 69395 24272 74317 36074 10437 11346 67248 54847 76856 56492 24843 47254

34593 35434 81446 63932 55494 97593 60172 27556 79865 53712 74816 57812 10115 47981 41387 16196 27176 65315 43965 15884 17076 77919 86019 11933 01128 40135

22561 09532 32477 03091 39563 34320 81697 20712 92069 87771 03363 57929 69220 46588 87195 04393 29600 12537 15344 28131 78910 41105 47928 64958 74658 69916

We can use this table by first arbitrarily choosing a line. We then use the three successive digits in this line: if the first digit is even, write down a zero; if it is odd, write it as 1; Then, consider the 2nd and third digits to yield a number (and throw away the resulting number if it is larger than 114 or a repeat): Suppose the chosen line is third line, and continuing 104-384-448-266-558-376-490-888-290-870-124-624-181-001-806-029-773679- 226-236-332-666-658-360-881-973-952-046-136-742-028-525-056-473944-047-731-203-251-414-823-843-837-000-249-807-097-260-567-497

Then, according to the protocol, we would choose the following 25 distinct elements: 104 X 48 66 X X 90 88 90 70 X 24 X 1 6 29 X 79 26 36 X 66 58 X 81 X X 46 X X 28 X 56 73 !

302#

X

47 X

3 51

14 23 43

(Note that here 66 and 90 were taken twice). Now, ask learners to complete the list. Strategy 2: Finding a 1-in-5 system atic sam ple of elem ents. Learners should be able to set the seed using the same method as in Strategy 1. Instead of sampling 25 from the numbers 1 up to 114 as in the simple random sample, the learners will sample one integer 1 ≤ k ≤ 5 and then select every 5th element from the ordered list of elements by atomic weight starting at k . The sample size will be determined by the value of k picked since 114 is not divisible by 5. Learners may think that 25 elements need to also be in this systematic sample, but that is impossible. Suppose the integer chosen between 1 to 5 is 4? So the sample will contain the 23 elements in Table 2. Table 3-06.2. Elem ents in the 1-in-5 system atic sam ple where 4 was first drawn (from 1 to 5), and every fifth elem ent is then taken. 4 49 94

9 54 99

14 59 104

19 64 109

24 69 114

29 74

34 79

39 84

44 89

From the 23 elements in Table 2, the 1-in-5 systematic sample produces a sample mean of 144.5087 grams per mole. Strategy 3: Stratified Random Sam pling LearnersLearners should explain how they would take a stratified random sample of 25 elements from the 114 using the following 4 strata: solid, liquid, gas, and artificial. Learners are also given the information that there are 77 solids, 2 liquids, 11 gases, and 24 artificial elements. In order to take a stratified sample, learners should divide each population stratum size by the population size and then multiply 25 and this percentage. For example, there

77 ×100% = 67.5% of the population of elements are solids. Thus, there 114 should be .675 × 25 = 16.875 ≈ 17 solids in the sample. are 77 solids, so

Similarly, learners should discover that the sample will have 17 solids, 1 liquid, 2 gases, and 5 artificial elements. A student may ask how the 1 liquid made the sample because they will see that .43 liquids should be included in the sample and based on the rounding used for the other three strata, this would lead to 0 liquids. However, it could be argued that the sample should have at least 1 representative element of the liquids and by rounding up to 1, the resulting sample has 25 elements.

!

303#

Strategy 4: Cluster Sam ple Ask learners to take a cluster sample using the columns of elements as clusters. Therefore, there are 18 clusters in total and the goal is to take a random sample of 4 clusters. Now, the sample itself is not difficult to obtain, but learners need to understand why the columns are the clusters and not the rows. The main reason is the variability in atomic weight in a column is more representative of all elements and a row will have very similar weights across all elements. Tell learners to take a random sample of 4 of the 18 clusters. Finding this sample may now be very easy for learners. Suppose that the 4 clusters randomly chosen from 1 to 18 are cluster numbers 11, 5, 8, and 9. Therefore, Table 3 includes all the elements that are in these 4 clusters. Table 3-06.3. Elem ents in the cluster sam ple of 4 groups 23 58 90

26 61 93

27 62 94

29 64 96

41 73 105

44 76 108

45 77 109

47 79 111

With the elements in Table 3 as the sample of 24 elements, the sample mean for the cluster sample is 167.76 grams per mole. C. Further Analysis and Interpretation Now that learners have a thorough understanding of how to take a simple random sample (SRS) and a systematic sample, have them continue further with the activity. Ask learners which of the two sampling methods, SRS and systematic sampling, they think will produce the least variable mean atomic weight estimate after repeated sampling. It seems reasonable that most learners will suspect the SRS to have the least variable estimate, but actually since there are only 5 possible systematic samples, the repeated systematic sampling should produce the more precise estimate. With the samples roughly ordered by atomic weight, the systematic sample should be more representative of the population of elements. The learners don’t have to be correct in their response to this question because they will know/learn the answer at the end of this activity. Tell learners to redo the process of generating a simple random and systematic sample, but let them do this individually. Then, ask them to determine the mean atomic weight of their samples and to record their means on the board under the appropriate heading. Then, divide the board into two sections such that learners can compile a list of means needed for the sampling distribution portion of the activity. Table 3-06.4 below contains example of data from a set of say 30 learners.

!

304#

Table 3-06.4. M ean atom ic weights for sim ple random and system atic sam ples for 30 learners S 1 2 3 4 5 6 7 8

S 133.70 150.10 175.44 156.46 135.53 139.34 139.43 138.58

142.86 137.13 137.13 140.73 140.73 140.73 137.13 140.10

9 10 11 12 13 14 15 16

S 136.03 168.06 104.24 163.85 137.62 150.40 128.72 139.48

140.10 144.64 144.64 144.64 142.86 142.86 140.73 137.13

17 18 19 20 21 22 23 24

S 151.91 130.21 146.36 156.55 133.23 124.48 147.88 146.95

142.86 144.64 144.64 137.13 144.64 140.73 142.86 144.64

25 26 27 28 29 30

143.44 142.97 129.85 120.39 115.46 158.40

144.64 140.10 137.13 140.73 140.10 140.10

Once all learners have completed their individual samples and copied them to the board, have them create stem plots for both types of samples. The stem plots should look similar to those in Figure 1.

Figure 3-06.1. Stem and leaf plots for sam pling distributions of the sam ple m ean atom ic weights Descriptive Statistics Additionally, descriptive statistics should be calculated for each type of sample. Table 3-06.5 shows descriptive statistics for the example data in Table 3-06.4. Table 3.06.5. Descriptive statistics for the sam pling distributions Statistic Mean Standard Deviation 1st Quartile Median rd 3 Quartile

!

Simple Random Samples 141.50 15.44 132.48 139.46 150.78

305#

Systematic Samples 141.30 2.72 140.10 140.73 144.64

Results and Interpretation The learners should be able to see right away that the standard deviation for the simple random samples is quite large compared to the systematic samples. Also, the average mean atomic weights for both types of samples are nearly equal at 141.50 and 141.30. Therefore, learners should conclude that the systematic sample produces a mean atomic weight that is more accurate. In the periodic table, the elements are ordered according to the number of protons and this is directly related to the atomic weight of the elements. So for the most part, the elements are ordered according to the atomic weight, which was the measure of interest. D. Enrichm ent: Possible Extensions 1. Carry out in class a cluster sample and stratified random sample to compare the variability and precision of the mean atomic weight after repeated sampling. 2. Demonstrate the Central Limit Theorem by taking various sized samples of the simple random sample. Repeatedly sample 5, 10, and 50 elements and compare the sampling distributions of the mean atomic weights. 3. Begin with the 114 elements and calculate the sample size needed to reach a specified margin of error for the four sample methods. 4. You can use other data sets as the source of data for sampling: data from the batch of learners (weight, height, etc), BEIS (record of all the schools of DepEd), or medical records from the school physician.! KEY PO INTS

!



Various random sampling methods may be employed in practice, such as simple random, stratified, cluster, and systematic sampling.



In systematic random sampling, the researcher first randomly picks the first item from the population. Then, the researcher will select each k'th subject from the list. The main advantage of using systematic sampling over simple random sampling is its simplicity and the assurance that the population will be evenly sampled. There exists a chance in simple random sampling that allows a clustered selection of subjects. This is systematically eliminated in systematic sampling.



With stratified sampling, the population can be divided into groups (the strata) that are in some meaningful way different from each other. The sample is chosen by having a simple random sample chosen in each strata.



With cluster sampling, the population is divided into groups (the clusters) that are all essentially the same as each other, but within the groups, their members are as diverse as 306#

the population. Thus, the cluster sample is obtained by having a random sample of the clusters (with all members in the cluster taken). REFERENCES Many of the materials in this lesson were adapted from: Malloure, M. and Richardson, M. It’s Elemental! Sampling from the Periodic Table. Grand Valley State University. STatistics Education Web (STEW). Retrieved from https://www.amstat.org/education/stew/pdfs/ItsElemental!.docx De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031 ACTIVITY SHEET NUM BER 3-06 The periodic table of chem ical elem ents is a tabular method of displaying the chemical elements. Although there were precursors to this table, its creation is generally credited to Russian chemist Dmitri Mendeleev in 1869. Mendeleev intended the table to illustrate recurring (“periodic”) trends in the properties of the elements. The layout of the table has been refined and extended over time, especially with new elements being discovered, and new theoretical models being developed to explain chemical behavior. The periodic table provides an extremely useful framework to classify, systematize, and compare all the many different forms of chemical behavior. The table has also found wide application in physics, biology, engineering, and industry. The current standard table contains 117 confirmed elements as of January 27, 2008 (while element 118 has been synthesized, element 117 has not been). Ninety-two are found naturally on Earth, and the rest are synthetic elements that have been produced artificially in particle accelerators. The main value of the periodic table is the ability to predict the chemical properties of an element based on its location on the table. It should be noted that the properties vary differently when moving vertically along the columns of the table, than when moving horizontally along the rows. The layout of the periodic table demonstrates recurring (“periodic”) chemical properties. Elements are listed in order of increasing atomic number (i.e. the number of protons in the atomic nucleus). Rows are arranged so that elements with similar properties fall into the same vertical columns (groups). According to quantum mechanical theories of electron configuration within atoms, each horizontal row (period) in the table corresponded to the filling of a quantum shell of electrons. There are progressively longer periods further down the table. In printed tables, each element is usually listed with its element symbol and atomic number. Many versions of the table also list the element’s atomic weight and other information. The atom ic weight is the average mass of the atoms of an element. It is a weighted average of the naturally-occurring isotopes. For example, the atomic weight of Hydrogen is 1.00794 grams per mole.

!

307#

!

308$

Atomic Number

Atomic Weight

Element

Abbr.

Type

1

1.01

Hydrogen

H

Gas

1

2

4.00

Helium

He

Gas

1

3

6.94

Lithium

Li

Solid

2

4

9.01

Beryllium

Be

Solid

2

5

10.81

Boron

B

Solid

2

6

12.01

Carbon

C

Solid

2

7

14.01

Nitrogen

N

Gas

2

8

16.00

Oxygen

O

Gas

2

9

19.00

Fluorine

F

Gas

2

10

20.18

Neon

Ne

Gas

2

11

22.99

Sodium

Na

Solid

3

12

24.30

Magnesium

Mg

Solid

3

13

26.98

Aluminum

Al

Solid

3

14

28.09

Silicon

Si

Solid

3

15

30.97

Phosphorus

P

Solid

3

16

32.06

Sulfur

S

Solid

3

17

35.45

Chlorine

Cl

Gas

3

18

39.95

Argon

Ar

Gas

3

19

39.10

Potassium

K

Solid

4

20

40.08

Calcium

Ca

Solid

4

21

44.96

Scandium

Sc

Solid

4

22

47.87

Titanium

Ti

Solid

4

23

50.94

Vanadium

V

Solid

4

24

52.00

Chromium

Cr

Solid

4

25

54.94

Manganese

Mn

Solid

4

26

55.84

Iron

Fe

Solid

4

27

58.93

Cobalt

Co

Solid

4

!

309$

Period

28

58.69

Nickel

Ni

Solid

4

29

63.55

Copper

Cu

Solid

4

30

65.41

Zinc

Zn

Solid

4

31

69.72

Gallium

Ga

Solid

4

32

72.64

Germanium

Ge

Solid

4

33

74.92

Arsenic

As

Solid

4

34

78.96

Selenium

Se

Solid

4

35

79.90

Bromine

Br

Liquid

4

36

83.80

Krypton

Kr

Gas

4

37

85.47

Rubidium

Rb

Solid

5

38

87.62

Strontium

Sr

Solid

5

39

88.91

Yttrium

Y

Solid

5

40

91.22

Zirconium

Zr

Solid

5

41

92.91

Niobium

Nb

Solid

5

42

95.94

Molybdenum

Mo

Solid

5

43

98.00

Technetium

Tc

Artificial 5

44

101.07

Ruthenium

Ru

Solid

5

45

102.91

Rhodium

Rh

Solid

5

46

106.42

Palladium

Pd

Solid

5

47

107.87

Silver

Ag

Solid

5

48

112.41

Cadmium

Cd

Solid

5

49

114.82

Indium

In

Solid

5

50

118.71

Tin

Sn

Solid

5

51

121.76

Antimony

Sb

Solid

5

52

127.60

Tellurium

Te

Solid

5

53

126.90

Iodine

I

Solid

5

54

131.29

Xenon

Xe

Gas

5

55

132.91

Cesium

Cs

Solid

6

56

137.33

Barium

Ba

Solid

6

!

310$

57

138.91

Lanthanum

La

Solid

6

58

140.11

Cerium

Ce

Solid

6

59

140.91

Praseodymium

Pr

Solid

6

60

144.24

Neodymium

Nd

Solid

6

61

145.00

Promethium

Pm

Artificial

6

62

150.36

Samarium

Sm

Solid

6

63

151.96

Europium

Eu

Solid

6

64

157.25

Gadolinium

Gd

Solid

6

65

158.93

Terbium

Te

Solid

6

66

162.50

Dysprosium

Dy

Solid

6

67

164.93

Holmium

Ho

Solid

6

68

167.26

Erbium

Er

Solid

6

69

168.93

Thulium

Tm

Solid

6

70

173.04

Ytterbium

Yb

Solid

6

71

174.97

Lutetium

Lu

Solid

6

72

178.49

Hafnium

Hf

Solid

6

73

180.95

Tantalum

Ta

Solid

6

74

183.84

Tungsten

W

Solid

6

75

186.21

Rhenium

Re

Solid

6

76

190.23

Osmium

Os

Solid

6

77

192.22

Iridium

Ir

Solid

6

78

195.08

Platinum

Pt

Solid

6

79

196.97

Gold

Go

Solid

6

80

200.59

Mercury

Hg

Liquid

6

81

204.38

Thallium

Tl

Solid

6

82

207.20

Lead

Pb

Solid

6

83

208.98

Bismuth

Bi

Solid

6

84

209.00

Polonium

Po

Solid

6

85

210.00

Astatine

At

Solid

6

!

311$

86

222.00

Radon

Rn

Gas

6

87

223.00

Francium

Fr

Solid

7

88

226.00

Radium

Ra

Solid

7

89

227.00

Actinium

Ac

Solid

7

90

232.04

Thorium

Th

Solid

7

91

231.04

Protactini

Pa

Solid

7

92

238.03

Uranium

Ur

Solid

7

93

237.00

Neptunium

Np

Artificial

7

94

244.00

Plutonium

Pu

Artificial

7

95

243.00

Americium

Am

Artificial

7

96

247.00

Curium

Cm

Artificial

7

97

247.00

Berkelium

Bk

Artificial

7

98

251.00

Californium

Cf

Artificial

7

99

252.00

Einsteinium

Es

Artificial

7

100

257.00

Fermium

Fm

Artificial

7

101

258.00

Mendelevium

Artificial

7

102

259.00

Nobelium

No

Artificial

7

103

262.00

Lawrencium

Lr

Artificial

7

104

261.00

Rutherfordium

Rf

Artificial

7

105

262.00

Dubnium

Db

Artificial

7

106

266.00

Seaborgium

Sg

Artificial

7

107

264.00

Bohrium

Bh

Artificial

7

108

277.00

Hassium

Hs

Artificial

7

109

268.00

Meitnerium

Mt

Artificial

7

110

281.00

Ununnilium

Uun

Artificial

7

111

272.00

Unununium

Uuu

Artificial

7

112

285.00

Ununbium

Uub

Artificial

7

114

289.00

Ununquadium

Uuq

Artificial

7

116

292.00

Ununhexium

Uuh

Artificial

7

!

Md

312$

Part 1. Taking Sam ples Instructions: Refer to the Periodic Table produced by the National Institute of Standards and Technology (NIST) in 2003. This Periodic Table displays 114 elements, along with their corresponding atomic numbers and atomic weights. Notice that the atom ic weight of the elem ents generally increases with the atom ic num ber of the elem ents. Thus, the first element listed, Hydrogen, with an atomic number of 1, has the lowest atomic weight of 1.01 grams per mole and the last element listed, Ununhexium, with an atomic number of 116, has the highest atomic weight of 292 grams per mole. In order to practice selecting different types of samples and to compare the performance of different types of samples, we are going to consider our Population of interest to be all of the elements shown on the NIST 2003 Periodic Table (thus, N = 114) and the variable of interest is atomic weight. Let’s assume that we are interested in selecting samples from this population in order to estimate the population mean atomic weight. The true mean atomic weight of the 114 elements on the NIST Periodic Table is µ = 141.09 grams per mole. Strategy #1 Select a sim ple random sam ple of 25 elements. Sample without replacement. What is the mean atomic weight for the 25 sampled elements?

x = _______________

Strategy #2 Select a 1-in-5 system atic sam ple of elements. What is the mean atomic weight for the sampled elements?

x = _______________

Strategy #3 To select a stratified random sam ple of 25 elements, without replacement, divide the table into 4 strata: Solid, Liquid, Gas, and Artificial. Note that there are 77 Solids, 2 Liquids, 11 Gases, and 24 Artificial elements. We would then sample 17 Solids, 1 Liquid, 2 Gases, and 5 Artificial elements. ! Briefly explain why it makes sense to sample 17 of the Solids. Strategy #4 Select a cluster sam ple of elements. Use the colum ns of elements as the clusters. Thus, there are 18 clusters. Randomly select 4 clusters. Briefly explain why it makes sense to use the columns as clusters and not the rows. What is the mean atomic weight for the sampled elements?

!

313$

x = _______________

Part 2. Com parison of Sam pling Strategies We want to use class data to determine if repeated sim ple random sam pling of 25 elements will result in sample mean atomic weights that are less variable than the sample mean weights resulting from repeated 1-in-5 system atic sam pling of elements. 1. Do you think that repeated simple random sampling of elements will likely to produce less variable sample mean atomic weights than repeated 1-in-5 systematic sampling of elements? Why? Or, why not? 2. Write your sample mean atomic weight on the white board in the column labeled “Sample Means from Simple Random Samples.” 3. Write your sample mean atomic weight on the white board in the column labeled “Sample Means from Systematic Samples.” 4. Record the class sample means for each of the sampling techniques below. Sim ple Random Sam pling: System atic Sam pling: Create stem plots and calculate descriptive statistics for the class sample means. Sim ple Random Sam ples

System atic Sam ples

Sim ple Random Sam pling mean = __________ standard deviation = __________ first quartile = __________ median = __________ third quartile = __________

System atic Random Sam pling mean = __________ standard deviation = __________ first quartile = __________ median = __________ third quartile = __________

Based upon the above calculations, do you think that repeated simple random sampling of elements from the Periodic Table would most likely produce a more accurate estimate of the population mean atomic weight than would a repeated 1-in-5 systematic sampling of elements? Why? Or, why not? ASSESSM ENT 1. Identify the type of sampling method used in the following 4 scenarios. The possible response options are: A. Systematic Sample B. Cluster Sample C. Simple Random Sample D. Stratified Random Sample.

!

314$

Scenario 1: In a factory that produces television sets, every 100th set produced is inspected. Answer: A Scenario 2: A class of 200 learners is numbered from 1 to 200, and a table of random digits is used to choose 60 learners from the class. Answer: C Scenario 3: A class of 200 learners is seated in 10 rows of 20 learners per row. Three learners are randomly selected from every row. Answer: D Scenario 4: An airline company randomly chooses one flight from a list of all international flights taking place that day. All passengers on that selected flight are asked to fill out a survey on meal satisfaction. Answer: B 2. Suppose a state has 10 universities, 25 four-year colleges, and 50 community colleges, each of which offers multiple sections of an Introductory Statistics course each year. Researchers want to conduct a survey of learners taking Introductory Statistics in the state. Explain a method for collecting each of the following types of samples: A. Stratified Random Sample B. Cluster Sample C. Simple Random Sample Answer: First, compile a list of all the Introductory Statistics courses taught in the state at each type of learning institution. (A) Randomly sample a representative proportion of Introductory Statistics courses from each of the 3 strata: universities, four-year colleges, and community colleges. (B) Randomly sample one of the 3 types of learning institutions and then take a census of all introductory courses within that type of institution. (C) Simply take a random sample of n introductory courses from the list of considering the type of learning institution.

!

315$

offerings without

CHAPTER 4: ON ESTIMATION OF PARAMETERS Lesson 1: Concepts of Point and Interval Estimation TIME FRAME: 60 minutes OVERVIEW OF LESSON: This chapter further builds on the discussions in the

previous chapter on sampling and sampling distribution to illustrate one of the basic purposes of statistics—inference. Learners are given descriptions of basic concepts of estimation, both point estimation and interval estimation.

LEARNING COMPETENCIES: At the end of the lesson, the learner should be able to: a. Classify a decision process as one that makes use of inferential statistics, or not b. Illustrate point and interval estimation. c. Differentiate point from interval estimation. PRE-REQUISITE KNOWLEDGE AND SKILLS: Knowledge in sampling and sampling distribution (discussed in Chapter 3) LESSON OUTLINE: a. Motivation: Concept of Inferential Statistics b. Preliminary Lesson: Development of the Concept of Estimation c. Differentiate point from interval estimation

DEVELOPMENT OF THE LESSON A. Motivation: Concept of Inferential Statistics At this point, learners should have already imbibed the idea that in reality they do not have the whole population to work on. Hence, they should make use of a representative subset of the population which they referred to in previous lessons as a random sample. From this random sample, they will generate statistics they will use to make inferences about the population and/or its parameters. This process is referred to as inferential statistics and is illustrated below.

!

316$

Note that the sample taken from the population must be a random sample obtained using one of the sampling techniques discussed in the previous chapter. Likewise, the inferences that they will make are subject to uncertainty which means that they are not 100% sure of the inferences or conclusions they’ll make about the population or its parameters, based on the statistics generated from the random sample. In other words, there is a chance or likelihood that they will make a wrong inference and that they will try to measure this likelihood so that they can minimize it. This is the reason why probability measure was discussed in earlier chapter. B. Preliminary Lesson: Development of the Concept of Estimation In making inferences about the population, learners can either provide a value or values for the parameter or evaluate a statement about the parameter. This chapter will focus on the former, which is generally referred to as estimation. For this lesson, we will discuss two ways in estimation, namely: point and interval estimation and differentiate one from the other. As motivational activity, ask learners to write on 1/4 sheet of paper the following: 1. His/her “best” guess of your age by giving a single number 2. The same as in Number 1, but this time he/she should give a range of values wherein your age would most likely fall 3. Ask the student to rate his/her confidence from 0% (not confident) to 100% (very confident) in his/her educated guess of the range of values in Number 2. Then, collect all the papers. C. Main Lesson: Differentiate point from interval estimation Discuss the results of this activity to learners, emphasizing the following points:

!

317$











There is no right or wrong numerical value for the given answers. However, there might be misconceptions or misunderstanding of the concepts when they provided their answers. On the first item, learners should give one logical number between 21 and 65 (inclusive). The student should have given only one number since you asked them to give a point estimate. A point estimate is a numerical value and it identifies a location or a position in the distribution of possible values. The student’s guess of your age should be between 21 and 65 for it to be logical since one usually starts working or teaching at the age of 21 and retires at the age of 65 (compulsory retirement age). On the second item, tell learners they should have given a logical range of values or set of values with lower and upper limits. The logical lower limit should be at least 21 while the upper limit should be at most 65. The reasons behind these numbers are similar to what were stated earlier. What they have given is their interval estimate of your age. An interval estimate is a range of values where most likely the true value will fall. As stated in item 3, the percentage should be within 0% to 100%. This measure of confidence in the interval estimate is referred to as confidence coefficient. When combined with their estimate in item 2, such is now referred to as confidence interval estimate. Hence, a confidence interval estimate is a range of values where one has a certain percentage of confidence that the true value will likely fall in. In point estimation, since you are giving only a single value, there are only two possibilities, that is, the estimate is either wrong or correct. Also, there is no measure of how confident one is in his/her estimate. On the other hand, the interval estimate gives more than one possible values as estimates. In addition, the confidence coefficient provides a measure of confidence in the estimate.

Before the session ends, inform your learners your true age. Take note, too, of how many learners gave correct point estimates and how many gave confidence interval estimates that included your age. SEATWORK The seatwork can be something similar to the motivational activity but be sure you know the true value so you would have a basis for logical answers. Some interesting parameters to estimate are as follows: 1. Age of a celebrity such as Sharon Cuneta 2. Income or salary of your father or mother 3. Length (in kilometres) of EDSA (in Metro Manila or some local national highway) 4. Height of the tallest building in your city or municipality !

318$

5. Number of text messages per day sent by the principal of the school TEACHER TIPS • Keep the submitted papers as materials for future lessons. • Ask learners to reflect on the difference between a “fortune teller” and one who provides a confidence interval estimate. KEY POINTS • A point estimate is a numerical value and it identifies a location or a position in the distribution of possible values. • A confidence interval estimate is a range of values where one has a certain percentage of confidence that the true value will likely fall in. REFERENCES De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031 ASSESSMENT For each of the following situations, ask the student whether inferential statistics is applicable or not: 1. The government would like to know the per capita rice consumption per day of Filipinos. Answer: Inferential statistics is applicable as you cannot take the daily rice consumption of every Filipino by obtaining a probability sample of households. 2. The effectiveness of a newly developed cure of cancer Answer: Inferential statistics is NOT applicable as medical research makes use of volunteers and not a random sample of cancer patients to test the effectiveness of a newly developed cure for the disease. 3. A presidential candidate decides to take a survey through text messaging to determine the proportion of voters who are likely to vote for him/her. Answer: Inferential statistics cannot be used here since it is likely that not all voters have cell phones, and even if everyone has, it is important that the sample represents the target population, which does not here, partly also due to non-responses. 4. A farmer wants to estimate the number of pigs he has in the pig pen. He decides to capture 20 pigs, puts a red mark on the captured pigs, and then,

!

319$

lets them loose. After a day or so, the farmer decides to recapture another set of pigs, say, 10 of them, and notices that only one of them has a red mark, and so he estimates that he has 20/(0.1) = 200 pigs Answer: Inferential Statistics here, assuming that the 20 originally captured pigs had roamed around across the population of pigs, and that the recaptured 10 pigs form a random sample, so that 1/10 = 0.1 = 20/N and thus N =200) 5. An auditor of a government office wants to assess what proportion of experiment records were done correctly. Instead of going over 10,000 records, she decides to sample the first 100 records on her desk, and notices that 97 of them were done well. She concludes that 97% of all the records given to her were done well. Answer: If the records on the desk were put in random order, then this is a valid application of inferential statistics. However, if the records are not in random order, then the sample is not a random sample, and there will likely be some bias in the estimate. learners

!

320$

CHAPTER 4: ESTIMATION OF PARAMETERS Lesson 2: Point Estimation Of The Population Mean TIM E FRAM E: 60 minutes LEARNING COMPETENCIES At the end of the lesson, the learner should be able to: • Identify possible point estimators of the population mean • Discuss characteristics of a “good” estimator • Appraise why the sample mean is the “best” estimator of the population mean • Compute for a point estimate of the population mean PRE-REQUISITE KNOWLEDGE AND SKILLS Knowledge in basic concepts in estimation (Lesson 4-01) as well as the sampling distribution of the sample mean (Chapter 3) LESSON OUTLINE A. Possible point estimators of the population mean B. Properties of a “good” estimator C. The sample mean as the best linear unbiased estimator (BLUE) of the population mean D. Illustration on the computation of a point estimate of the population mean

DEVELO PM ENT O F THE LESSON At the start, review the difference between a parameter and statistic. A param eter is a characteristic of the population which is usually unknown and needs to be estimated. On the other hand, A statistic is computed from a random sample and hence, it is known and is used to estimate the unknown parameter. Recall that there are two types of estimation: point and interval estimation. In estimating a parameter, the mathematical expression or formula you used in coming up with the estimate is referred to as estim ator while the estim ate is a numerical value that you arrived at when you apply the estimator using the sample data. As motivational activity, ask learners to group themselves into five. Each group will be given a sample data set on weights of 10 learners that you generated beforehand. Each group must have a different sample data set. Hence, the number of sample data sets corresponds to the number of groups of five learners. Ask each group to discuss how they are going to use the sample data to estimate the average weight of all learners in the class and implement the process to get an estimate. This activity should be done in 10 minutes

!

321$

After the group activity, the leaders of each group were then asked to report on the mathematical expression or formula they used as well as the resulting estimate. It is recommended that you list their reported estimators and estimates on the board. For the discussion of the results of the activity, consider the following: •

Use what were reported and listed on the board to illustrate that there could be several estimates for a parameter. Now, it is possible that the learners will give only one estimator and most commonly, they will give the sample mean or the average. If this is the case, you could supply other possible estimators. There could be several estimators for a parameter. For a population mean, usually represented by the Greek letter µ, the following are possible estimators that make use of a sample data obtained using simple random sampling scheme. Sample Mean

Sample Median where

is the ith observation in an array or when

the observations are arranged in increasing or decreasing order. Sample Mode •

is the value(s) with the highest frequency

With several estimators, we must choose and use the “best” estimator but how do we choose the “best”? An estimator could be evaluated based on the two statistical properties: accuracy and precision, which are both measures of closeness. Accuracy is a measure of closeness of the estimates to the true value while precision is a measure of closeness of the estimates to each other. To illustrate, take the bull’s eye in a dart board as the parameter and the ‘hits’ made on the board as the estimates. There could be a “hit” that is near the bull’s eye or an estimate that is near the parameter. On the other hand, there could be a “hit” that is far from the bull’s eye or an estimate that is far from the parameter. As shown in the following figure, Estimate No. 1 is far from the parameter value while Estimate No. 3 is near the parameter value.

Parameter!

Estimate! No.!2!

µ! Estimate! No.!1!

!

322$

Estimate! No.!3!

An accurate estimator will have estimates that, on the average, are near the parameter value. When the estimates on the average are equal to the true value, the estimator is said to unbiased. Thus, an unbiased estim ator is one whose average value is equal to the parameter itself. If the average value of the estimates deviates from the parameter value then the estimator is said to biased. Bias can then be measured as the difference between the average value of the estimator (i.e. the expected value of the sampling distribution) and the parameter value or mathematically, we compute bias as:

If, on the average, the estimates are greater than the parameter value or the bias is positive, then we say the estimator overestimates the parameter. On the other hand, if on the average, the estimates are less than the parameter value or the bias is negative, then we say that the estimator underestimates the parameter. When bias is equal to zero, the estimator is unbiased. In terms of precision, an estimator is precise if the estimates are close to each other. Otherwise, the estimator is not precise. A measure of precision of the estimator is its standard error, which is the square root of the estimator’s variance. The smaller the standard error, the more precise the estimator is. Ideally, we choose an estimator that is both accurate and precise. An example of this is the sample mean (computed from using a simple random sample of size n), which is an accurate and precise estimator of the population mean. However, we could not find, in all cases, such an ideal estimator. For practical reasons, one can opt to use a biased and less precise estimator. For example, a quality control engineer would use the minimum value of life span of an electric bulb rather than the sample mean in assessing the quality of a batch of manufactured electric bulbs. For efficiency and practical reasons, the engineer does not need to wait for a sample of n bulbs to bust in order to compute the sample mean. Instead, he could use the life span of the first bulb as an estimate of the parameter which is the true value of the life span of the batch of manufactured electric bulbs. The following figures illustrate the different kinds of estimators based on their accuracy and precision: An estimator that is both accurate and precise:

Parameter! µ!

!

323$

An estimator that is accurate but not precise:

Parameter! µ!

! An estimator that is precise but not accurate:

Parameter! µ!

An estimator that is neither accurate nor precise:

Parameter! µ!



!

As mentioned before, the sample mean, computed as the arithmetic mean of observations obtained using a simple random sample, is an accurate and precise estimator of the population mean. It is a “good” estimator of the population mean. That is why it is usually referred to as the Best Linear Unbiased Estimator (BLUE) of the population mean. It is “best” since it is the most precise as it has the smallest variance among all possible estimators of the population mean. It is linear in function and it is accurate as it is an unbiased estimator.

324$



Illustration of the Computation: Consider the following observed weights (in kilograms) of a random sample of 20 learners and use it to estimate the true value of the average weight of learners enrolled in the class. 40 58

45 59

46 60

48 60

48 62

50 62

55 64

55 64

56 65

58 66

The sample mean is computed as:

Thus, we say the average weight of all learners in the class is estimated to be around 56.05 kg based on a simple random sample of 20 observations. ENRICHM ENT For problems described in Numbers 1, 3, and 4, identify or formulate in words the hypothesis of interest to obtain the objective of problem. TEACHER TIPS • Use the same numerical example and assessment problems for future lessons. KEY PO INTS •





When estimating a parameter (such as a population mean), there are various possible estimators to use (including the sample mean, sample median, and sample mode). What makes an estimator a good estimator? An estimator should have both accuracy and precision. o Accuracy is a measure of closeness of the estimates to the true value o Precision is a measure of closeness of the estimates to each other We also prefer the estimator to be unbiased. o Bias is the difference between the average value of the estimator (i.e. the expected value of the sampling distribution) and the value of the population parameter.

REFERENCES De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031

!

325$

ASSESSMENT The following are some problems that can serve as computational exercises on point estimation of the population mean and related topics. 1. A company that manufactures electronic calculators uses a certain type of plastic. An alternative plastic material is introduced in the market and the manager of the company is thinking of shifting to this material. He will decide to shift if the mean breaking strength of the new material is greater than 155. It is known that the breaking strengths of the new plastic material follow the normal distribution and have a standard deviation of 10 psi (pounds per square inch). Six samples of the new plastic materials were randomly selected and their breaking strengths were determined. The data obtained were 156, 154, 168, 157, 160 and 158. Estimate the true mean breaking strength of the new plastic material and measure the precision of this estimate. Answer:

psi.

The estimated true mean breaking strength of the new plastic is 158.83 psi with precision (measured by standard error of the estimate) equal to

If

the manager of the company uses this estimate in deciding whether to shift to the new material, he is going to decide to do so. However, the risk of committing an error on this decision could not be measured with a point estimate only. There is a need to do interval estimation or hypothesis test procedure. 2. The nickel metal Hydride (Nimh) battery is one of the highly advertised rechargeable batteries today. It is lighter and can last up to 2 to 4 times longer than alkaline or standard Nickel-Cadmium (NiCd) batteries. To evaluate its performance, a random sample of 10 Nimh batteries was taken. The number of photos taken using each battery in a digital camera is given as follows: 405, 564, 342, 456, 435, 543, 473, 452, 462, and 475. Estimate the true mean number of photos taken using the Nimh battery. Also, provide a measure of precision of this estimate. Answer:

photos.

The estimated true mean number of photos taken using Nimh battery is 460.7 or around 461 photos with precision (measured by standard error of the estimate) equal to

3. The head of the Philippine University (PU) observes a decline on the alcoholic expenditures of learners from a monthly expenditure of Php 350 in the previous year. To check on this, he randomly selected 10 PU learners who drink alcoholic beverages and asked for the amount, in pesos, that they usually spent on alcoholic beverages in a month. It is known that the usual amount spent on alcoholic beverages by learners who drink alcoholic beverages follows the normal distribution with standard deviation of Php 10. The data collected are: 400, 235, 200, 250, 200, 300, 500, 430, 420, and 220. Estimate the true average amount spent on alcoholic beverages by learners who drink alcoholic beverages.

!

326$

Answer:

pesos.

The estimated true mean expenditure of PU learners for alcoholic beverages is Php 315.50 with precision (measured by standard error of the estimate) equal to pesos. If the head of the university uses this estimate to check on his observation, he could say there is a decline on the alcoholic expenditures of the learners. However, the risk of committing an error on this decision could not be measured with a point estimate only. There is a need to do interval estimation or hypothesis test procedure. 4. A local government official observes an increase in the number of individuals with cardiovascular and obesity problems in his barangay. In order to improve the health conditions of his constituents, he aims to promote an easy and cheap way to reduce individuals’ weight. It is known that obesity results to a greater risk of having illnesses like diabetes and heart problems. He encouraged his constituents to participate in his Dance for Life project every weekend for 3 months. To know if the program is effective in reducing their weight, he randomly selected 12 participants from the group who completed the program. The weight loss data of the 12 randomly selected participants, in kilograms, after completing the program are: 0.5, 0.7, 0.9, 1.1, 1.2, 1.3, 1.4, 2.0, 2.3, 2.4, 2.7, and 3.0. It is known that the weight loss of those who have completed the dance program follows a normal distribution with variance of 3.24 kg2. Provide a point estimate of the average weight loss of the participants who have completed the dance program. Give also a measure of the estimate’s precision. Answer:

kg.

The estimated true mean weight loss of the program participants is 1.625 kg with precision (measured by standard error of the estimate) equal to

kg.

If the local government official uses this estimate to know whether his program is effective or not, he could say that the average weight loss of the participants after the program is greater than zero. However, the risk of committing an error on this decision could not be measured with a point estimate only. There is a need to do interval estimation or hypothesis test procedure.

!

327$

CHAPTER 4: ESTIMATION OF PARAMETERS Lesson 3: Confidence Interval Estimation of the Population Mean (Part 1) TIM E FRAM E: 60 minutes OVERVIEW OF LESSON: In this lesson, learners learn how to construct interval estimates for the population mean when the variance s2 of the parent distribution is known. They are also provided examples on how to determine minimum sample size requirements for estimating the population mean given information on the variability of the parent distribution. LEARNING COM PETENCIES At the end of the lesson, the learner should be able to: • Assess accuracy of confidence interval estimates through its width • Interpret confidence interval estimates • Construct a (1-α)100% confidence interval estimator of the population mean when the population variance is known • Determine the required sample size in estimating the population mean under the simple random sampling scheme PRE-REQ UISITE KNOW LEDGE AND SKILLS: Knowledge in point estimation as well as sampling distribution of the sample mean LESSON OUTLINE A. Accuracy of the confidence interval estimates through its width. B. Construction and interpretation of a (1-α)100% confidence interval estimator of the population mean when the population variance is known. C. Computation of the interval estimate of the population mean when the population variance is known and its interpretation D. Sample size determination under simple random sampling scheme in estimating the population mean E. Computation of the required sample size under simple random sampling scheme in estimating the population mean and its interpretation

!

328$

DEVELO PM ENT O F THE LESSON At the start, review the difference between a point and an interval estimate. A point estim ate gives a single value of the parameter while an interval estim ate gives a range of possible values of the parameter. Also, with an attached confidence coefficient the interval estimate is referred to as confidence interval estim ate. Using the data collected in the motivational activity of Lesson 1 regarding learners’ estimate of your age, prepare a graph similar to the one shown below. Each line segment represents an interval estimate of the true value of your age. Then, assume that such estimates are all 95% confidence interval estimates.

0

20

40 TRUE VALUE 60

80

: :

For the discussion of the graph, consider the following: •

The 95% confidence interval estimates, represented by the line segments, are of different widths. Some are short and some are long. The width of the interval estim ate represents the accuracy of the estim ate. The narrower the interval or the shorter the segment is, the more accurate the interval estimate is.



There are line segments that include the true value but there are others that exclude the true value. If all the line segments represent all possible 95% confidence interval estimates, then 95% of them will contain the true value and only 6% of them will not contain the true value. Thus, we could say that if we have 1,000 possible 95% confidence interval estimates, 950 of these estimates will contain the true value and only 50 of the estimates will not have the true value within the interval estimate. This is another way of interpreting a confidence interval estimate. In general, an interval estimator is constructed as follows:

where the Tabular Value depends on the sampling distribution of the point estimator.

!

329$



In particular, for the population mean, the point estimator is the sample mean while the standard error of the sample mean will be used in the computation. With a known population variance (σ2) and sample size (n), the standard error of the sample mean is computed as a ratio of the standard deviation (square root of the variance) and the square root of the same size or mathematically, .



Also, since the population variance is known, the sampling distribution of the sample mean will follow the standard normal distribution or the Z distribution. This would mean that the tabular value would come from the Z-distribution table. Usually, we use the notation Z /2 as a tabular value in the Z-distribution table whose area to its right is equal to α/2. α



Thus, a (1-α)% confidence interval (CI) of the population mean (µ) when the population variance (σ2) is known is constructed as or where

is the sample mean computed from a simple random sample of size n. The

lower limit of the interval is •

while the upper limit is

The width of the interval estimate is the difference between the upper limit and the lower limit of the interval estimate. Expressing it mathematically, we have:

This would lead to 2

where

is usually referred to as m axim um

allowable deviation, denoted by D. •

The maximum allowable deviation is a function of three factors: (1) population standard deviation, σ ; (2) sample size, n and (3) confidence coefficient (1-α)% through the tabular value Z /2. Take note of the following relationships between each of these three factors and the confidence interval estimator holding other factors constant: α

1. The larger is the variability of the population from which the simple random sample was drawn, or the larger value of σ will result to larger maximum allowable deviation and consequently, wider confidence interval estimate. 2. Bigger sample size will lead to smaller maximum allowable deviation and narrower confidence interval estimate. 3. Higher confidence interval coefficient (1-α)% means lower value of α, thus higher tabular value Z /2 which leads to larger maximum allowable deviation and consequently, wider confidence interval estimate. α

!

330$



The (1-α)% confidence interval (CI) of the population mean (µ) can be interpreted as a probability statement or a confidence statement. It is a probability statement when the upper and lower limits are still considered random variables or they are not yet fixed. Otherwise, it is considered a confidence statement. For example, one could say that the probability that a 95% CI of the population mean will include the population mean in the interval is equal to 0.95 Mathematically, this is expressed as:

Once, we have computed or fixed the lower and upper limits, say the lower limit is 40 and the upper limit is 60, the 95% CI of the population mean becomes a confidence statement. Thus, we say that we are 95% confident that the true mean value will be between 40 and 60, and the probability that the true mean value will be between 40 and 60 is either one or zero. The probability is one if the true mean value is indeed between 40 and 60, and if otherwise, the probability is zero. Note also that we could interpret the 95% CI of the population mean, in terms of the number of interval estimates out of all possible confidence interval estimates that will contain or include the population mean. Like what was said earlier, out of all possible confidence intervals, 95% of them will contain or include the population mean. •

Illustration of the Computation: Consider the numerical example used in point estimation of the population mean where the following observed weights (in kilograms) of a random sample of 20 learners were used. 40 58

45 59

46 60

48 60

48 62

50 62

55 64

55 64

56 65

58 66

The sample mean is computed as:

Assuming that the population standard deviation of the weights of all learners in the class is 9 kg, the 95% confidence interval estimate of the true average weight of the learners is

Thus, we are 95% confident that the true average weight of all learners in the class is between 52 kg and 60 kg (rounded off to the nearest integer). •

Using the expression on maximum allowable deviation,

the required

sample size in estimating the population mean under simple random sampling scheme is computed as

!

(rounded up to the next higher integer). Hence,

331$

greater variability of the population, larger confidence coefficient and smaller maximum allowable deviation require larger sample size. •

Illustration of the Computation: Suppose we want to estimate the true average weight of learners enrolled in a school using a sample to be drawn using simple random sampling. How large should the sample be if we want the estimate to be within 2 kg away from the true value and that we are 99% confident of our estimate? We could assume that population standard deviation of the weight is 9 kg.

Thus, we need 135 learners in estimating the true average weight of learners enrolled in this class under simple random sampling scheme with 99% confidence and maximum allowable deviation is within 2 kg. TEACHER TIPS • Use the same numerical example for future lessons. ENRICHM ENT For an enrichment activity, ask learners to collect data to determine how far they can walk blindfolded down the length of a hallway before they deviate from a straight walking path and go “out of bounds.” This data will be used to test the theory that humans are not able to walk in a straight line without being able to see landmarks. Learners can examine if the data set they collected appear to be normally distributed and build a 95% confidence interval to estimate the mean distance learners can walk in.

REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Freedman, D., Pisani, R, and Purves. (2007). Statistics, Fourth Edition. New York: W. W. Norton & Company. Huey, M. Walk the Line. STatistics Education Web (STEW). Retrieved from https://www.amstat.org/education/stew/pdfs/WalktheLine.docx Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031

!

332$

ASSESSMENT I. Using some of the problems in Lesson 2 of this Chapter, ask learners to do the computational exercises on the construction of the confidence interval of the population mean and determination of the required sample size. 1. A company that manufactures electronic calculators uses a certain type of plastic. An alternative plastic material is introduced in the market and the manager of the company is thinking of shifting to this material. He will decide to shift if the mean breaking strength of the new material is greater than 155. It is known that the breaking strengths of the new plastic material follow the normal distribution and have a standard deviation of 10 psi (pounds per square inch). Six samples of the new plastic materials were randomly selected and their breaking strengths were determined. The data obtained were 156, 154, 168, 157, 160 and 158. Construct and interpret a 98% confidence interval for the true mean breaking strength of the new plastic material. Answer: With

psi and its standard error equal to , the 98% confidence interval for the true mean breaking strength of the

new plastic material is . We say that we are 98% confident that the true mean breaking strength of the new plastic material is between 148.8 and 167.8 psi. 2. The head of the Philippine University (PU) observes a decline on the alcoholic expenditures of learners from a monthly expenditure of Php 350 pesos in the previous year. To check on this, he randomly selected 10 PU learners who drink alcoholic beverages and asked the amount, in pesos, that they usually spend on alcoholic beverages in a month. It is known that the usual amount spent on alcoholic beverages by learners who drink alcoholic beverages follows the normal distribution with standard deviation of Php 10. The data collected are: 400, 235, 200, 250, 200, 300, 500, 430, 420, and 220. a.

Construct and interpret a 95% confidence interval for the true mean amount spent by the learners on alcoholic beverages. Answer: With equal to

pesos and its standard error pesos, the 95% confidence interval for the true mean amount

spent by the learners on alcoholic beverages is expressed as . We say that we are 95% confident that the true mean amount spent by learners on alcoholic beverages is between Php 308.95 and Php 321.35. b. Find a 90% confidence interval for the true mean amount spent by the learners on alcoholic beverages. Answer: The 90% confidence interval for the true mean amount spent by the learners on alcoholic beverages is expressed as .)

!

333$

c.

Calculate the widths of the confidence intervals in (a) and (b). How was the width of the confidence interval affected by the decrease in the confidence coefficient holding the other factors constant?

Holding other factors constant, as the confidence coefficient decreases, the width of the interval becomes narrower. d. How many learners must be sampled in order to be 99% confident that the estimated mean amount spent on alcoholic beverages will be within Php 2.00 of the true mean?

3. A local government official observes an increase in the number of individuals with cardiovascular and obesity problems in his barangay. In order to improve the health conditions of his constituents, he aims to promote an easy and cheap way to reduce weight. It is known that obesity results in greater risk of having illnesses like diabetes and heart problems. He encouraged his constituents to participate in his Dance for Life project every weekend for 3 months. To know if the program is effective in reducing weight, he randomly selected 12 participants from the group who completed the program. The weight loss data, in kilograms, of the 12 randomly selected participants after completing the program are: 0.5, 0.7, 0.9, 1.1, 1.2, 1.3, 1.4, 2.0, 2.3, 2.4, 2.7, and 3.0. It is known that the weight loss of those who have completed the dance program follows a normal distribution with variance of 3.24 kg2. a. Construct and interpret a 90% confidence interval for the true mean weight loss of the participants who have completed the dance program. Answer: With

kg and its standard error equal to kg, the 90% confidence interval for the true mean weight loss of the

participants who have completed the dance program is expressed as . We say that we are 90% confident that the true mean weight loss of the participants who have completed the dance program is between 0.7702 and 2.4798 kg.

b. How many participants must be sampled in order to be 95% confident that the estimated mean weight loss of the participants will be within 0.5 kg of the true mean?

!

334$

II. Provide the best choice in each item: 1. The width of a confidence interval estimate for a proportion will be a) narrower for 99% confidence than for 95% confidence. b) wider for a sample size of 100 than for a sample size of 50. c) narrower for 90% confidence than for 95% confidence. d) narrower when the sample proportion is 0.50 than when the sample proportion is 0.20. ANSWER: C 2. A 99% confidence interval estimate can be interpreted to mean that a) if all possible samples are taken and confidence interval estimates are developed, 99% of them would include the true population mean somewhere within their interval. b) we have 99% confidence that we have selected a sample whose interval does include the population mean. c) Both of the above. d) None of the above. ANSWER: C 3. When determining the sample size necessary for estimating the true population mean, which factor is not considered when sampling with replacement? a) The population size. b) The population standard deviation. c) The level of confidence desired in the estimate. d) The allowable or tolerable sampling error. ANSWER: A 4. Suppose a 95% confidence interval for µ turns out to be (1,000, 2,100). To make more useful inferences from the data, it is desired to reduce the width of the confidence interval. Which of the following will result in a reduced interval width? a) Increase in the sample size. b) Decrease in the confidence level. c) Increase in the sample size and decrease in the confidence level. d) Increase in the confidence level and decrease in the sample size. ANSWER: C 5. In the construction of confidence intervals, if all other quantities are unchanged, an increase in the sample size will lead to a interval. a) narrower b) wider c) less significant d) biased ANSWER: A

!

335$

Chapter 4: Estimation of Parameters Lesson 4: Confidence Interval Estimation of the Population Mean (Part 2) TIM E FRAM E: 60 minutes LESSON OVERVIEW : In this lesson, learners continue to learn about interval estimation of the population mean discussed in the previous lesson, but this time under the assumption that the parent distribution follows a normal curve, and that the population variance s2 is unknown. The interval estimate makes use of percentiles of a Student’s t distribution with n-1 degrees of freedom. LEARNING COM PETENCIES At the end of the lesson, the learner should be able to: • Construct a (1-α)100% confidence interval estimator of the population mean when the population variance is unknown • Use the Student’s t distribution table in getting a tabular value • Construct a (1-α)100% confidence interval estimator of the population mean when the population variance is unknown and sample size is large enough to invoke the Central Limit Theorem • Interpret confidence interval estimates PRE-REQ UISITE KNOW LEDGE AND SKILLS: Knowledge in confidence interval estimation of the population mean when the population variance is known LESSON OUTLINE A. Construction and interpretation of a (1-α)100% confidence interval estimator of the population mean when the population variance is unknown B. Use of the Student’s t distribution table in getting a tabular value C. Construction and interpretation of a (1-α)100% confidence interval estimator of the population mean when the population variance is unknown and sample size is large enough to invoke the Central Limit Theorem D. Illustration on the computation of an interval estimate of the population mean and its interpretation

DEVELO PM ENT O F THE LESSO N First, recall how to construct an interval estimator.

!

336#

In this expression, the tabular value depends on the sampling distribution of the sample mean. You learned in the previous lecture that the tabular value to use in the mathematical expression when the population variance is known is to be taken from the standard normal distribution. When the population variance is unknown, there is a slight change in the construction of the confidence interval and the changes involve the tabular value and the standard error of the sample mean. A. Construction and interpretation of a (1-α)100% confidence interval estimator of the population mean when the population variance is unknown With an unknown population variance (σ2), it has to be estimated using a simple random sample of size n. A point estimator of the population variance is the sample variance denoted as s2 and computed as

The square root of the sample variance is the

sample standard deviation, denoted as s. Such point estimate of the population standard deviation is used in the computation of the standard error of the sample mean and can be computed as a ratio of the sample standard deviation and the square root of the same size or mathematically, . B. Use of the Student’s t distribution table in getting a tabular value The tabular value to use would come from the Student’s t-distribution table. Usually, we use the notation t ( /2,n-1) as a tabular value in the Student’s t-distribution with degrees of freedom equal to n-1. Such tabular value is also a point in the distribution whose area to its right is equal to α/2. α



A Student’s t-distribution table (Please see attached table generated using MS Excel®) provides the area or probability to the right of a given value (t0). The illustration below shows a part of the table. The first row of the table provides selected probabilities or areas while the first column provides the degrees of freedom. The intersection of the area and the degrees of freedom is the needed tabular value.

Degrees!of!Freedom!(df)!

df 1 2 3 4 5 6

0.10 0.05 3.08 6.31 1.89 2.92 1.64 2.35 1.53 2.13 1.48 2.02 1.44 1.94

0.025 12.71 4.30 3.18 2.78 2.57 2.45

0.01 0.005 31.82 63.66 6.96 9.92 4.54 5.84 3.75 4.60 3.36 4.03 3.14 3.71

Selected!Probabilities!

Tabular!Value!with!area!of! 0.025!to!its!right!in!a!Student’s! t!distribution!with!3!df.! Graphically,!this!is!shown! below:!

0.025!

!

!

337#

! !

!!!!t(0.025,3)=3.18!

!



Thus, a (1-α)% confidence interval (CI) of the population mean (µ) when the population variance (σ2) is unknown is constructed as or where

and s are the sample mean and sample standard deviation, respectively. Both

are computed using a simple random sample of size n. The lower limit of the interval is while the upper limit is •

For this case, the width of the interval estimate is computed as:

and the maximum allowable deviation is C. Construction and interpretation of a (1-α)100% confidence interval estimator of the population mean when the population variance is unknown and sample size is large enough to invoke the Central Limit Theorem A property of the Student’s t distribution is that it approaches the standard normal distribution as its degrees of freedom increase. Since the degrees of freedom that we are concerned about at the moment depend on the sample size n, we can say that as n increases, the Student’s t distribution approaches the standard normal distribution. This is also in consonance to the Central Limit Theorem, discussed in the previous chapter. With these concepts, the tabular value to be used in the construction of the confidence interval for the population mean when the sample size is at least 30 is to be taken from the Zdistribution table. Thus, the following expression is to be used in constructing a (1-α)% confidence interval (CI) of the population mean (µ) when the population variance (σ2) is unknown and the sample size is at least 30: or •

For this case, the width of the interval estimate is computed as:

and the maximum allowable deviation is D. Illustration of the Com putation Again, consider the numerical example used in point and interval estimation of the population mean where the following observed weights (in kilograms) of a random sample of 20 learners were used.

!

338#

40 58

45 59

46 60

48 60

48 62

50 62

55 64

55 64

56 65

58 66

The sample mean is computed as: kg. This time, you don’t have an assumed value of the population standard deviation of the weights of all learners in the class. Because of this situation, there is a need to use a point estimate of the population standard deviation. Using the same sample observations given above, a point estimate of the population standard deviation is

With the sample mean and standard deviation, the 95% confidence interval estimate of the true average weight of the learners is

Thus, we say that we are 95% confident that the true average weight of all learners in the class is between 52 kg and 60 kg (rounded off to the nearest integer). TEACHER TIPS • Use the same numerical example for future lessons. ENRICHM ENT Plan an enrichment activity that involves learners measuring their foot sizes. The teacher records the foot sizes of all learners in class in order to obtain the population mean foot size of the entire class. The class is then divided into groups of 3 to 5 learners. Using a simple random sample of 10 learners, the groups will estimate the average foot size of the entire class. Numeric summaries (mean and five-number summary) and box plots can be used to obtain point and interval estimates, respectively, for the mean foot size of the entire class. The confidence level, or reliability, for the interval estimates computed by the learners is estimated by obtaining the proportion of interval estimates that “trap” the population average foot size of the entire class. REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore.

!

339#

Parks, S., Steinwachs, M., Diaz, R., and Molinaro, M. Did I Trap the Median? STatistics Education Web (STEW). Retrieved from https://www.amstat.org/education/stew/pdfs/DidITrapTheMedian.docx De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Freedman, D., Pisani, R, and Purves, R. (2007). Statistics, Fourth Edition. New York: W. W. Norton & Company. Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031! ASSESSM ENT I. Using a problem in Lesson 2 of this Chapter, ask learners to do the computational exercises on the construction of the confidence interval of the population mean when the population variance is unknown. 1. The nickel metal Hydride (Nimh) battery is one of the highly advertised rechargeable batteries today. It is lighter and can last up to 2 to 4 times longer than alkaline or standard Nickel-Cadmium (NiCd) batteries. To evaluate its performance, a random sample of 10 Nimh batteries was taken. The number of photos taken using each battery in a digital camera is given as follows: 405, 564, 342, 456, 435, 543, 473, 452, 462, and 475. Construct and interpret a 95% confidence interval for the true mean number of photos taken using the Nimh battery. Answer: With equal to

photos and its standard error is the 95% confidence interval for the true mean number of

photos taken using the Nimh battery is . We say that we are 95% confident that the true mean number of photos taken using the Nimh battery is between 416 and 506 photos. Further, we could have the following additional problems: 1. The Municipal Planning Officer of Los Baños wants to determine if the average wage of labourers per hour in the municipality is below Php 320. A random sample of 40 labourers in the municipality yielded a mean of Php 300 per hour with a standard deviation of Php 50 per hour. Using this information, construct a 99% confidence interval estimate of the true average wage rate per hour of labourers in the Municipality of Los Baños. Answer: With

pesos and its standard error is equal to

the

99% confidence interval for the true average wage rate per hour of labourers in the Municipality of Los Baños is

!

340#

. We say that we are 99% confident that the true average wage rate per hour of labourers in the Municipality of Los Baños is between Php 280 and Php 320. 2. A machine produces metal pieces which are cylindrical in shape with an average mean diameter of 14.20 cm if the machine is in good condition. A quality engineer officer evaluates the condition of the machine by using a random sample of 36 runs which resulted to a mean diameter of 14.25 cm with standard deviation of 0.30 cm. Using this information, construct a 95% confidence interval estimate of the true average diameter of the cylindrical metal pieces produced by the machine. Answer: With

cm and its standard error is equal to

the 95%

confidence interval for the true average diameter of the cylindrical metal pieces produced by the machine is . We say that we are 95% confident that the average diameter of the cylindrical metal pieces produced by the machine is between 14.152 and 14.348 cm.

II. Provide the Best Choice 1. Which of the following is not true about the Student’s t distribution? a) It has more area in the tails and less in the center than does the normal distribution. b) It is used to construct confidence intervals for the population mean when the population standard deviation is known. c) It is bell-shaped and symmetrical. d) As the number of degrees of freedom increases, the t distribution approaches the normal distribution. ANSWER: B 2. The t distribution a) assumes the population is normally distributed. b) approaches the normal distribution as the sample size increases. c) has more area in the tails than does the normal distribution. d) All of the above. ANSWER: D 3. A major department store chain is interested in estimating the average amount its credit card customers spent on their first visit to the chain’s new store in the mall. Fifteen credit card accounts were randomly sampled and analyzed with the following results:

!

341#

and

. Assuming the distribution of the amount spent on

their first visit is approximately normal, what is the shape of the sampling distribution of the sample mean that will be used to create the desired confidence interval for µ? a) Approximately normal with a mean of Php 2525 b) A standard normal distribution c) A t distribution with 15 degrees of freedom d) A t distribution with 14 degrees of freedom ANSWER: D 4. A major department store chain is interested in estimating the average amount its credit card customers spent on their first visit to the chain’s new store in the mall. Fifteen credit card accounts were randomly sampled and analyzed with the following results: and . Construct a 95% confidence interval for the average amount its credit card customers spent on their first visit to the chain’s new store in the mall. a) 2525 pesos ± 454.5 peos b) 2525 pesos ± 506 pesos c) 2525 pesos ± 550 pesos d) 2525 pesos ± 554 pesos ANSWER: D 5. As an aid to the establishment of personnel requirements, the director of a hospital wishes to estimate the mean number of people who are admitted to the emergency room during a 24-hour period. The director randomly selects 64 different 24-hour periods and determines the number of admissions for each. For this sample, X = 19.8 and s2 = 25. Which of the following assumptions is necessary in order for a confidence interval to be valid? a) The population sampled from has an approximate normal distribution. b) The population sampled from has an approximate t distribution. c) The mean of the sample equals the mean of the population. d) None of these assumptions are necessary. ANSWER: D

!

342#

Student’s t Distribution Table

probability!

t0!

Df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ∞

!

selected probability or area to the right of a tabular value (α) 0.10 0.05 0.025 0.01 0.005 3.08 6.31 12.71 31.82 63.66 1.89 2.92 4.30 6.96 9.92 1.64 2.35 3.18 4.54 5.84 1.53 2.13 2.78 3.75 4.60 1.48 2.02 2.57 3.36 4.03 1.44 1.94 2.45 3.14 3.71 1.41 1.89 2.36 3.00 3.50 1.40 1.86 2.31 2.90 3.36 1.38 1.83 2.26 2.82 3.25 1.37 1.81 2.23 2.76 3.17 1.36 1.80 2.20 2.72 3.11 1.36 1.78 2.18 2.68 3.05 1.35 1.77 2.16 2.65 3.01 1.35 1.76 2.14 2.62 2.98 1.34 1.75 2.13 2.60 2.95 1.34 1.75 2.12 2.58 2.92 1.33 1.74 2.11 2.57 2.90 1.33 1.73 2.10 2.55 2.88 1.33 1.73 2.09 2.54 2.86 1.33 1.72 2.09 2.53 2.85 1.32 1.72 2.08 2.52 2.83 1.32 1.72 2.07 2.51 2.82 1.32 1.71 2.07 2.50 2.81 1.32 1.71 2.06 2.49 2.80 1.32 1.71 2.06 2.49 2.79 1.31 1.71 2.06 2.48 2.78 1.31 1.70 2.05 2.47 2.77 1.31 1.70 2.05 2.47 2.76 1.31 1.70 2.05 2.46 2.76 1.31 1.70 2.04 2.46 2.75 1.28 1.65 1.96 2.33 2.58

343#

CHAPTER 4: ESTIMATION OF PARAMETERS Lesson 5: Point and Confidence Interval Estimation of the Population Proportion TIM E FRAM E: 60 minutes OVERVIEW OF LESSON: In this lesson, learners learn how to construct interval estimates of the population proportion. They are also taught how to determine minimum sample size requirements for estimating the population proportion. LEARNING COM PETENCIES At the end of the lesson, the learner should be able to: • Identify a point estimator of the population proportion • Discuss the properties of the sample proportion as point estimator • Compute for a point estimate of the population proportion • Identify an appropriate confidence interval estimator of the population proportion using large sample based on the Central Limit Theorem • Construct a (1-α)100% confidence interval estimator of the population proportion using a large sample • Interpret point and confidence interval estimates of the population proportion PRE-REQ UISITE KNOW LEDGE AND SKILLS: Knowledge in point estimation as well as the sampling distribution of the population proportion LESSON A. B. C.

OUTLINE Point estimator of the population proportion Properties of the sample proportion as point estimator of population proportion Construction and interpretation of a (1-α)100% confidence interval estimator of the population proportion using a large sample D. Illustration on the computation of a point and interval estimates of the population proportion and its interpretation.

DEVELO PM ENT O F THE LESSO N First, review the lesson on proportion as a parameter. The ratio of the number of units possessing a characteristic to the total number of units in the population is a population proportion. Examples are the proportion of learners who passed the last examination, the proportion of Filipinos who live in poverty, the proportion of housing units in the Philippines with roof made of strong materials, and proportion of Piatos chips that are not broken. As a motivational activity, present the partial list of variables below in a data set gathered from learners enrolled in Grade 11 Statistics and Probability this school year.

!

344!

VARIABLE HRS_STUD SEX HEIGHT WEIGHT WAIST HIP MGINCOME MONTH_ALLO W WEEK_FOOD AGE_FATHER AGE_MOTHER NUM_SIBLINGS MODE_TRANS GENRE

DEFINITIO N/DESCRIPTIO N usual number of hours spends studying outside school hours during weekdays biological sex height measured in cm weight measured in kg waist girth measured in cm hip girth measured in cm monthly family gross income monthly allowance weekly expenditures on food outside home father's age mother's age number of siblings mode of transportation in going to school (private, service, public, not applicable (i.e. walking)) preferred genre of music (e.g. rock, acoustic, mellow, etc)

Ask learners to identify proportions that could be defined from these variables. Note that some variables are straightforward while others need to be redefined further. The following are some examples identified: 1. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and who spend at least 2 hours studying outside school hours during weekdays 2. Proportion of male learners who are enrolled in Grade 11 Statistics and Probability this school year 3. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and at least 160 cm tall 4. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and at most 100 kg 5. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and with waist girth of at most 50 cm 6. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and with hip girth of at least 60 cm 7. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and who belong to a family whose gross monthly income is at most Php 15,000 8. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and with a monthly allowance equal to Php 4,000 9. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and with a weekly food expenditure outside home equal to P500 10. Proportion of learners who are enrolled in Grade 11 Statistics and Probability this school year and with a father whose age is at least 60 years

!

345!

11. Proportion of learners who are enrolled in Grade 11 Statistics school year and with a mother whose age is at least 60 years 12. Proportion of learners who are enrolled in Grade 11 Statistics school year and with at least 3 siblings 13. Proportion of learners who are enrolled in Grade 11 Statistics school year and who go to school using private vehicles 14. Proportion of learners who are enrolled in Grade 11 Statistics school year and who have rock as preferred music genre

and Probability this and Probability this and Probability this and Probability this

Choose one of these variables and ask learners what they are going to do if they were asked to estimate one of the variables. In the discussion, take note of the following: • If, in general, you were to estimate the proportion of learners enrolled in Grade 11 Statistics and Probability this school year and with rock as as preferred music genre from a simple random sample of size n, an estimator for this population proportion is the ratio of the number of sampled Grade 11 Statistics and Probability learners who preferred rock over the sample size n. This is referred to as the sam ple proportion which is defined as the ratio of the number of sample units possessing the characteristic of interest to n. Mathematically, the point estimator of the population proportion, based on a simple random sample of size n, is expressed as

where a is the number of

sample units having the characteristic of interest. •

The sample proportion

as estimator of the population proportion is unbiased

with standard error equal to

. This was discussed in the previous chapter on

sampling. Also, with sufficient sample size n, (say at least 100), the sampling distribution of the sample proportion could be approximated by the standard normal distribution based on the Central Limit Theorem (CLT). •

Using the above mentioned concepts, a (1-α)% confidence interval (CI) of the population proportion (P) is constructed as or where

is the sample proportion computed from a simple random sample of size n.

The lower limit of the interval is



!

while the upper limit is

Computing the width of the confidence interval estimate, we have:

346!

where the maximum allowable deviation is equal to



.

Illustration of the Computation: Suppose in a simple random sample of 50 Grade 11 Statistics and Probability learners, 30 of them said they preferred the music genre rock. The sample proportion is computed as

with standard error equal to

. Using the same sample, the 95% confidence interval (CI) of the population proportion of Grade 11 Statistics and Probability learners with rock as preferred genre of music is constructed as

Hence, we estimate that 6 out of every 10 Grade 11 Statistics and Probability learners would say that rock is their preferred music genre. Further, we could say that we are 95% confident that the true proportion of Grade 11 Statistics and Probability learners would say that rock is their preferred music genre is between 0.59 and 0.61 or out of every 100 Grade 11 Statistics and Probability learners, we are 95% confident that there will be between 59 to 61 of them who would say that rock is their preferred music genre. ENRICHM ENT Most national opinion polls sample at least 1,200 respondents (although there are 100= million Filipinos as of mid-2014) and typically ask about the approval ratings of government officials, especially the President. How is this possible? In this lesson, we saw that the likely size of the chance error in sample percentages depends on the size of the sample, and, hardly at all, on the population size. The huge number of possible Filipinos that could be sampled does not affect the standard error (of the proportion) but only makes it difficult operationally to draw the random sample. Is 1,200 a big enough sample? Most critics of sample surveys would find it illogical why 1,200 respondents would represent millions. It turns out that 1,200 would indeed be a reasonable sample size for estimating approval ratings. If the true approval ratings of the President were 50%, then with a sample size of 1200, the standard error for the proportion is about 6 percentage points, and we could have a margin of error of 3 percentage points at 95% confidence. This shows why we ought to be able to accurately assess the winner of a presidential race even before election day itself, unless the proportion of votes for two candidates in an election are very close. Suppose we will be required to construct a 95% confidence interval for the proportion so that it would have a width of 5%, what sample size would be required? In the previous

!

347!

chapter, we noted that a conservative estimate of the standard error of the sample proportion is

0.5 (0.5) n since the maximum value p(1-p) can take is when p = ½. Since we want the length of the confidence interval to be 5%, we would thus like to have the 95% confidence interval for the proportion take the form Sample Proportion ± 2.5% This means that we want 1.96 (Estimate of Standard Error) = 0.025 or equivalently

1.96

0.5 (0.5) = 0.025 n

Solving for this algebraic equation yields: 2

& , 0.5 )# n = $1.96* '! = 1537 + 0.025 (" % Ask learners whether they should account for sampling without replacement? Theoretically, the required sample size of 1,537 has to be adjusted by incorporating the population size. For a population of 100,000,000, we would have to obtain a sample of size: =

1537

which is the same as that obtained for sampling with replacement. This numerical result explains why nationwide polls typically use only 1,200 to 1,600 respondents. Emphasize to learners that a large population size has virtually no effect on the choice of the sample size when estimating a population proportion. REFERENCES Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Freedman, D., Pisani, R, and Purves, R. (2007). Statistics, Fourth Edition. New York: W. W. Norton & Company Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031

!

348!

ASSESSM ENT The following are some problems that could serve as computational exercises on point and confidence interval estimation of the population proportion. 1. Some government officials are proposing for the country’s academic calendar to be moved from June-March to August-May. This proposal, according to the officials, if approved, can further improve education by synchronizing our calendar with that of the other countries. Government officials would push through with the proposal if at least 85% of the student population favor it. To know the opinion of learners regarding the said proposal, a simple random sample of learners was obtained and they were asked if they were in favor of the said proposal. Of the 1,000 surveyed learners, 892 said they were in favor of the approval of the said proposal. a. Find a point estimate of the true proportion of learners who are in favor of the approval of the proposal and find its standard error. Answer: With a = 892, then

and its standard error equal to )

b. Construct a 99% confidence interval for the true proportion of learners who are in favor of the approval of the proposal. Interpret the confidence interval obtained. Answer: The 99% confidence interval estimate of the true proportion of learners who are in favor of the approval of the proposal is expressed as

We then say that we are 99% confident that the true proportion of learners who are in favor of the approval of the proposal is between 0.87 and 0.92 2. Because of several political problems the country is experiencing right now, a lawyer became interested in knowing the opinion of the residents of a certain municipality about plunder issues. A lawyer came up with a proposed program regarding the resolution of plunder cases if majority of the population were not satisfied with the result of plunder cases filed in the country. She randomly selected 500 individuals from the complete list of registered voters of the 2013 National Election in their municipality. Each respondent was asked if he/she were satisfied with the outcome of plunder cases filed in the country. Of the surveyed citizens, 180 said they were satisfied with the result of plunder cases filed in the country. a. Find a point estimate of the true proportion of citizens who are not satisfied with the result of plunder cases filed in the country. Calculate the standard error of the estimate.

!

349!

Answer: With a = 320, then

and its standard error equal to

b. Construct a 95% confidence interval for the true proportion of citizens who are not satisfied with the result of plunder cases filed in the country. Interpret the confidence interval obtained. Answer: The 95% confidence interval estimate of the true proportion of citizens who are not satisfied with the result of plunder cases filed in the country is expressed as

We then say that we are 95% confident that the true proportion of citizens who are not satisfied with the result of plunder cases filed in the country is between 0.60 and 0.68 ENRICHM ENT For problems described above, discuss how they could use the confidence interval estimates to obtain the objective of the problem. TEACHER TIPS • Use the same numerical example for future lessons.

!

350!

CHAPTER 4: ESTIMATION OF PARAMETERS Lesson 6: More on Point Estimates and Confidence Intervals TIM E FRAM E: 60 minutes OVERVIEW OF LESSON: In this lesson, learners undertake an activity to deepen their understanding of point and interval estimation. This lesson is largely taken from a STatistics Education Web (STEW) lesson plan called “Did I Trap the Median?” Learners recall the information provided at the beginning of Chapter 1, particularly the rating score they gave (from 1 to 10) about their state of happiness. Learners collect a random sample of 10 of their classmates’ records on their respective states of happiness in order to obtain point and interval estimates for the median level of state of happiness in the entire class. LEARNING COM PETENCIES At the end of the lesson, the learner should be able to: •

Calculate a point estimator of the population median number of text messages sent in a day in class



Construct a (1-α)100% confidence interval estimator of the population proportion using large sample



Interpret point and confidence interval estimates of the population proportion

M ATERIALS REQ UIRED : Ruler and Pencil, Calculators, Activity sheet LESSON OUTLINE A. Introduction B. Data Collection C. Data Analysis D. Enrichment DEVELOPM ENT OF THE LESSON A. Introduction This lesson involves an activity where learners collect sample data from their class to estimate the median state of happiness among the population of Grade 11 learners in the entire class. Each student obtains a point estimate and constructs an interval estimate for the median state of happiness in the entire class by using a simple random sample of 10 learners in the class. Numeric summaries and graphs are used to obtain point and interval estimates, respectively, for the median state of happiness in the entire class. The teacher examines the database on the state of happiness of all learners in class that was collected in the first lesson of the first chapter in order to obtain the population median state of happiness of the entire class. The confidence level for the interval estimates computed by

!

351$

the learners is estimated by obtaining the proportion of learners’ sample interval estimates that trap the population median state of happiness of the entire class. Ask learners to hypothesize the answers to some of these questions: 1. What are the advantages of collecting a sample of only 10 records of the state of happiness and not of the entire class to obtain the median state of happiness in the entire class? 2. What are the advantages and disadvantages of using the sample median to estimate the population median? 3. Is there any advantage to constructing an interval estimate as opposed to a point estimate (the sample median) for the population median? 4. Is it possible to ascribe a reliability value to the interval estimate (ascribe a probability that the interval contains the median)? 5. What are the factors that may affect the length and the reliability of an interval estimate? B. Data Collection Have the Learnerslearners recall that they reported their state of happiness to the teacher at the beginning of Lesson 1-01. This was put in a database. Now, let them get similar records of the state of happiness of 10 randomly selected learners in class. To ensure that each student will use a random sample of 10 records from the class database: 1. Ask learners to generate 10 random numbers from one to the total number of learners in class using a table of random digits. They may use the Table of Random Digits from Lesson 3-06. 2. Have learners write down the generated numbers from least to greatest on the data table. Explain that each number corresponds to a classmate. Next, list all the records of the state of happiness of each student (from the database collected in Lesson 101), together with the respective student numbers. Tell each student to write down only the records for each randomly generated number that corresponds to a classmate’s state of happiness. A sample student data set is shown in Table 4-06.1 below. A blank data table is provided in the Activity Sheet. Table 4-06.1. Exam ple Student Data Sam ple Sample Student State of Happiness 1 7 2 7 3 8.5 4 4 5 6.5 6 7 7 7.5 8 5.5 9 7.5 10 3.5 !

352$

An example class data set is shown in Table 4-06.2 below. Table 4-06.2. Exam ple Class Data Student Number State of Happiness 1 7 2 7 3 8.5 4 4 5 6.5 6 7 7 7.5 8 5.5 9 7.5 10 9.5 11 6.5 12 6 13 8 14 5 15 4 16 8.5 17 5.5 18 8 19 5 20 6 1. Computing and Displaying Numerical Summaries Different statistical tools are used for estimating numerical values in a population. For example, when drawing a random sample one can calculate the sample mean (or average), or the median (50th percentile) to obtain an estimate of a measure of the center of the distribution of values pertaining to the entire population. Also, the range, the inter-quartile range (difference from the 25th percentile or first quartile, to the 75th percentile or third quartile) as well as the standard deviation from sample data can be computed to estimate the spread of the values in a population (measures of variation).

2. Visualizing the Distribution A box and whiskers plot is a graphical summary proposed by John Tukey for data that uses the 5-number summary (minimum, 25th percentile, median, 75th percentile, and maximum) to graphically display the distribution of a data set while highlighting measures of the center (median), other positions (25thand 75th percentiles), and measures of variation (range, interquartile range). This plot also allows us to identify outliers, numbers that are very different from the rest of the data. Some features of the box plot of a sample data set will be used to construct interval estimates for the median of the population.

!

353$

To construct a box and whiskers plot, learners should compute the 5-number summary of their sample data. Ask learners to order the values in their sample, from smallest to largest. Now, learners can readily identify the minimum and maximum values in their sample data, and proceed to compute the quartiles. The median or second quartile (Q2) is found by locating the midpoint of the entire ordered sample data set. Since we have an even number of data points in the example used here, we have two middle values so we find the median by averaging these two values. The 25th percentile or first quartile (Q1) is found by calculating the median of the lower half of the sample data (first five numbers). For the sample data Q1 is the sole value in the middle position (third data point) of the first five numbers. The 75th percentile or third quartile (Q3) is similarly found by calculating the median of the upper half of the sample data (last five numbers). In this case, the third quartile is in the eighth position. The steps to draw the box plot using the sample data to construct an interval estimate for a population median can be better described by means of an example. This is done in the succeeding paragraphs using the sample data in Table 4-06.1. In addition, the teacher should construct a box plot for the data of the entire class for a later discussion. The median state of happiness in the entire class in the example in this lesson plan (Table 406.2) is 6.75, while for the student data sample (Table 4-06.1), it is 7.0. Note that the sample median of 7 can be used as a point estimate of the population median of 6.75. Point estimates are obtained with the hope that they are close to the population value that they are meant to estimate. Confidence intervals have the extra advantage of providing a sense of uncertainty in the estimation process. With confidence intervals, we can be quite confident of the accuracy in estimation; i.e., that the exact population value that is being estimated (the population median in this case) is captured or “trapped” by an interval constructed using sample data. To place an interval estimate for the population median using the features of a box plot, start by having each student obtain the 5-number summary of his/her sample data as described in section A above. Notice that the smallest level of happiness for the sample data in Table 4-06. 1 is 4.0, while the largest level of happiness is 9.5. The median value (Q2) of 7.0 indicates that about half of the learners in the data set have stated levels of happiness less than or equal to 7, and that about half of the learners have levels of happiness greater than or equal to 7. The first quartile of the student data sample is 6.5 and the third quartile is 7.5 cm. These values for Q1 and Q3 indicate that about 25% of the learners in this sample have levels of happiness less than or equal to 6.5, and about 25% of learners have levels of happiness greater than or equal to 7.5. These values also indicate about 50% or half of the learners in the sample have levels of happiness between 6.5 and 7.5.

!

354$

To construct a box plot follow these steps: 1. Mark the values of Q1 = 6.5, Q2 = 7.0 cm, and Q3 = 7.5 on a horizontal scale that spans across all the values in the sample data. Then, construct a box above the scaled line using these values as indicated in Figure 4-06.1.

_____________________________________________________________________ ! ! Q3!!!=!7.5! Q1!=!6.5! Q2!!!=!7.0! ! ! Figure !4-06. 1. Box plot: Step 1 !

2. To find if there are any outliers or extreme values in a data set, compute the inter-quartile range (IQR), which is the difference between the third and first quartiles. Any data point beyond what are called the lower outlier bound, Q1 – 1.5(IQR), or the upper outlier bound, Q3 + 1.5(IQR), is considered to be an outlier. In this case, IQR = 7.5 – 6.5 = 1; therefore any level of happiness smaller than Q1 − 1.5(IQR) = 6.5 − (1.5)(1) = 5, or larger than Q3 + 1.5(IQR) = 7.5 + 1.5(1) = 9.0 is an outlier. There are two outliers in this data set, 4.0 and 9.5. These outliers are indicated by drawing stars above the scaled line at about half the height of the box as shown in Figure 4-06.2 below.

* * !! _____________________________________________________________________ !! 4.0! !

Q1!=!6.5!

Q2!!!=!7.0!

Q3!=!7.5!

9.5!

Figure ! 4-06. 2 Box plot: Step 2

3. Finally, find the minimum value that is not an outlier and the maximum value that is not an outlier. Here, the minimum value that is not an outlier is 5.5, and the maximum value that is not an outlier is 8.5. Then, add what are called the whiskers to the box by drawing horizontal lines at about half the height of the box, first from Q1 down to the minimum value that is not at outlier, and second from Q3 up to the maximum value that is not an outlier as indicated in Figure 4-06.3 below. Only when there are no outliers would the whiskers go as far as the minimum and maximum values in the data set. To avoid drawing the whiskers incorrectly, make sure to draw them after the outliers (if any) have been added to the graph.

!

355$

* !!

* !!

! !

Q2!

Q2!

Q3!

_____________________________________________________________________ ! ! 4!

4.5!

!

!

Q1!=!6.5!

Q2!!!=!7.0!

Q3!=!7.5!

Figure ! 4-06. 3. Box plot: Step 3

8.5!

9.5!

!

Notice that the box and whiskers plot for the student sample data is quite symmetric. The distributions of random sample data tend to reflect the distribution of the population. At this point, you can write on the board the box plot you obtained for the data of the entire class, and ask learners if their sample data box plots resemble that of the population. There may be a small proportion of learners whose box plot may be quite different from the box plot of the data of the entire class. This is due to random variation in the samples. However, most of the learners should have a box plot that resembles that of the population. 3. Constructing an Interval Estimate Ask learners to discuss how much their sample median differs from the population median. In the above example, the sample median of 7.0 is off by 0.25 from the population median of 6.75. Learners should note the wide variability in estimation error when using their sample median as an estimate of the population median (compared to the variability in estimation error when estimating the population mean by the sample mean). Now ask learners if they would consider it reasonable to provide an interval estimate that has a high probability of capturing or trapping the exact median of the population. If they could provide an interval that captures or traps the population median by using their own sample data, what would this interval be? One suggestion might be to use the endpoints of the whiskers of their box plots as an interval that has a high probability of trapping the population median. However, learners may also realize that this interval is too wide to help hone in on the value of the population median (that is, that this interval has a large margin of error). Then, ask learners whether the shorter interval from Q1 to Q3 (endpoints of the box instead of endpoints of the whiskers) would be more reasonable to estimate the location of the population median. Now, you can ask learners how confident they are that each time they obtain a random sample of 10 learners and obtain the first and third quartiles of this sample, the interval (Q1, Q3) captures or traps the population median. It would not be surprising to have learners whose intervals (Q1, Q3) did not capture the population median. If so, this would prevent learners from saying that that they are 100% confident that each time they take a sample of 10 learners and obtain the first and !

356$

third quartiles of their sample, the interval (Q1, Q3) will trap the population median. So what is the level of confidence that learners have for capturing the population median with the interval (Q1, Q3) from a random sample of 10 learners? To answer this question, learners can obtain the reliability or level of confidence, of using (Q1, Q3) from their sample of 10 learners as an interval estimate for the population median. Simply obtain the proportion of learners in class whose interval estimate trapped the population median (class median). For example, if 15 of the 20 learners (75%) in the class obtained an interval (Q1, Q3) that trapped the class median of 6.75, then this means that each time someone takes a sample of 10 learners from the class, we expect 75% of the intervals (Q1, Q3) will trap the population median level of happiness. C. Data Analysis Learners should now have an idea of the advantages of using interval estimates, which, once their level of reliability is known, are called confidence intervals. However, learners may agree that a sample interval (Q1, Q3) is still too wide (that is, the interval has a large margin of error) as a predictor of the location of the median. Ask learners questions pertaining to possible refinements for these confidence intervals such as the following questions below, which will be explored on the second day of this lesson plan:

1. What do you think would happen to the sample interval (Q1, Q3) if the sample size increased from 10 to 15 (or 20)? 2. What do you think would happen to the sample interval (Q1, Q3) if the population distribution is not a symmetric distribution? 3. Do you have any idea how to construct interval estimates that are shorter than the interval (Q1, Q3)? Would a shorter interval necessarily change the level of reliability?

D. Enrichment Extend this lesson and activity by increasing the sample size to 15 or 20, and let learners see how increasing the sample size produces tighter intervals (Q1, Q3). Also, they should explore how symmetric distributions produce interval estimates with smaller reliability (lower level of confidence) than non-symmetric distributions. REFERENCES Many of the materials in this lesson were adapted from: Parks, S. Did I Trap the MEdian? (California State University Sacramento, Dept. of Mathematics and Statistics), Mathew Steinwachs (University of California Davis, iAMSTEM Hub), Rafael Diaz (California State University Sacramento, Dept. of Mathematics and Statistics), Marco Molinaro (University of California Davis). STatistics Education Web (STEW). Retrieved from https://www.amstat.org/education/stew/pdfs/DidITrapTheMedian.docx Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore.

!

357$

De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Freedman, D., Pisani, R, and Purves, R. (2007). Statistics, Fourth Edition. New York: W. W. Norton & Company Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031

ASSESSM ENT A class of 25 learners is selected and their exam scores are recorded. A random sample of 10 learners is taken from the classlearners. The data is shown in the tables below. Class Data Table Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Sample Data Table

Exam Scores 90 101 106 108 125 130 115 91 112 107 76 103 69 94 106 78 121 80 85 80 99 76 92 89 121

Student 1 2 3 4 5 6 7 8 9 10

Exam Scores 96 101 106 108 125 130 115 93 112 107

Using the above tables, answer the following questions: a) Calculate the 5-number summary for the sam ple data table. Answer: 5-number summary: minimum = 93, first quartile, Q1 = 101, median = 107.5, third quartile, Q3= 115, maximum = 130.

!

358$

b) Determine the lower and upper outlier bounds. Are there any outliers? Answer: Q1 = 101 c) What are the minimum and maximum values that are not outliers? Note: If there are no outliers below (above) the lower (upper) outlier bound, then the minimum (maximum) value that is not an outlier matches the minimum (maximum) value of the data set. Answer: Q3 = 115, IQR = Q3 – Q1 = 115 – 101 = 14; Lower outlier bound: Q1 – 1.5(IQR) = 101 – 1.5(14) = 80; Upper outlier bound: Q3 + 1.5(IQR) = 115 + 1.5(14) = 136; No outliers (no points located beyond the outlier bounds). d) Construct a box plot for foot sizes for the sam ple data table. Answer: Minimum that is not an outlier = Minimum of the sample data (no outliers below the lower outlier bound) = 93 Maximum that is not an outlier = Maximum of the sample data (no outliers above the upper outlier bound) = 130 e) Is the distribution of the data set symmetric or asymmetric? learners Answer: See box plot below: Sample Boxplot

100

110

120

130

IQ Score

The box plot indicates that the distribution of the sample data is asymmetric due to a longer upper whisker, and a larger spread for the values between the third quartile and the median. (In the second day of this lesson plan, learners will learn that when a box plot shows an asymmetry in this direction the distribution of the data is said to be skewed right or positively skewed). f)

Compute the population median (median of the entire class of 25 learners).

Answer: The median of the entire class is 99.

!

359$

!

g) Does the interval (Q1, Q3) trap the median of the class data? Answer: learnersNo, the sample box plot does not trap the class median of 99. The class median does not fall between 101 (first quartile) and 115 (third quartile).

Activity Sheet 4-06 1. Describe the data collection process that will be used. 2. Recall your answer to Activity Sheet 1-01a : On a scale from 1 (very unhappy) to 10 (happiest), how do you feel today? ________ 3. Record the state of happiness of 10 randomly chosen learners in your class.

Name

State of Happiness

4. Arrange the values of the state of happiness from smallest to largest. 5. Complete the table below showing numeric summaries for the state of happiness for your ten randomly chosen classmates.

Mean

Minimum

First Quartile (Q1)

Median

Third Quartile (Q3)

Maximum

6. Determine what values would be considered to be outliers for your 10 randomly chosen classmates. Are there any outliers? 7. Construct a horizontal box plot for your 10 randomly chosen classmates. In the event of having outliers for your data set, do not use outliers for the minimum or maximum values. For the minimum and maximum values, plot the minimum value that is not an outlier and the maximum value that is not an outlier.

!

360$

state of happiness

8.

What is the class median state of happiness? Does your Q1 to Q3 interval estimate trap the median for the entire class?

9.

Based on the median of the entire class given by your teacher and the median of your particular 10 randomly chosen classmates, calculate what proportion (percent) of box plots trap the median for the entire class. This is the reliability (confidence level) of using interval estimates from Q1 to Q3.

10. Think about what would happen if the sample size is increased. Would the proportion of box plots that would trap the median increase or decrease? Why? Sim ulation W orksheet 1. Hypothesize what the answers to the following questions might be and state why. a) What happens to the width of the confidence intervals when the sample size increases? Do the bounds of the intervals vary more? Why? b) What happens to the level of confidence (reliability or percentage of sample intervals that trap the population median) of the interval estimate when the sample size increases? Why? c) What happens to the width of the interval estimate when the population distribution shape changes? Do the bounds of the intervals vary more? Why? d) What happens to the level of confidence (reliability or percentage of sample intervals that trap the population median) when the population distribution shape changes? Why?

! !

361$

CHAPTER 5: TESTS OF HYPOTHESIS Lesson 1: Basic Concepts in Hypothesis Testing TIM E FRAM E: 60 minutes

LEARNING COM PETENCIES At the end of the lesson, the learners should be able to: • Illustrate a statistical hypothesis • Differentiate a null hypothesis from alternative hypothesis • Differentiate Type I from Type II error • Illustrate consequences of committing errors LESSON OUTLINE a. Definition of statistical hypothesis b. The difference of null hypothesis from alternative hypothesis c. Consequences of making a decision d. Two possible errors that could be committed in a test of hypothesis

DEVELO PM ENT O F THE LESSO N Recall that statistical inference is concerned with either estimation or evaluation of a statement or claim about a parameter or a distribution. The latter is the focus of this chapter. Evaluation of a claim about a parameter or a distribution is done through a statistical test of hypothesis. As a motivational activity, ask learners to react on the government pronouncement about El Niño phenomenon. Describe the El Niño phenomenon and its possible consequences further. “The country will experience El Niño phenomenon in the next few months.” Write learners’ reactions on the board. Their reactions may include the following: 1. The occurrence of El Niño phenomenon is not sure. 2. There is a possibility that El Niño phenomenon may not occur. 3. The effects of El Niño phenomenon are devastating to the country. 4. Some of the consequences of the El Niño phenomenon are tolerable while other consequences are not. 5. The validity of the statement could be tested based on some empirical facts. Discuss the results of this activity to learners with emphasis on the following points: •

!

The pronouncement is a claim that may be true or false. Such claim could be referred to as a statistical hypothesis. A statistical hypothesis is a claim or a conjecture that m ay either be true or false. The claim is usually expressed in terms of the value of a parameter or the distribution of the population values.

362$







There are two possible actions that one can do with the statement. These actions are either to accept the statem ent or to reject it. These actions are brought about by a decision whether the statement is true or false. Some of the learners may believe that the statement is true, hence they accept the pronouncement. Others may think that the statement is false, hence they reject the claim. The actions we made have consequences. Possible consequences of accepting that the statement is true include: (a) increase the importation of rice in anticipation of supply shortage; (b) buy materials for water storage; (c) use drought-resistant varieties of rice; (d) invest in programs to make Filipinos ready; and the like. On the other hand, when the statement is rejected because we think it is false, possible consequences are (a) We are not prepared for rice and water shortage; (b) Farmers experience great loss on production; or (c) We do not do anything. Some of the consequences are tolerable while other consequences are severe. Experiencing a few days of water shortage is tolerable but having rice shortage for a month or two is unbearable. The degree of the possible consequence is the basis in making the decision. If the consequences of accepting the claim that El Niño phenomenon is going to happen are tolerable, then we may not reject the pronouncement. However, if the consequences are severe, then we reject the claim.

Consider another statement or claim but this time regarding a parameter. Consider the average number of text messages that a Grade 11 student sends in a day. The statement could be stated as follows: “The average daily number of text messages that a Grade 11 student sends is equal to 100.” As discussed earlier, this statement can either be true or false. Hence, one can accept or reject this statement. The validity of this statement can be accessed through a series of steps known as test of hypothesis. A test of hypothesis is a procedure based on a random sam ple of observations with a given level of probability of com m itting an error in m aking the decision, whether the hypothesis is true or false. In hypothesis testing, we first formulate the hypotheses to be tested. In the formulation of the hypotheses, we take note of the following: •

!

There are two kinds of a statistical hypothesis: the null and the alternative hypothesis. A null hypothesis is the statem ent or claim or conjecture to be tested while an alternative hypothesis is the claim that is accepted in case the null hypothesis is rejected. The symbol “Ho” is used to represent a null hypothesis while “Ha” is used to represent an alternative hypothesis. The statement “The average daily number of text messages that a Grade 11 student sends is equal to 100.” is considered a null hypothesis. In the event that we reject this claim, we can accept another statement which states otherwise, that is, “The average daily number of text messages that a Grade 11 student sends is not equal to 100.” This statement is our alternative hypothesis.

363$



In formulating the hypotheses, we can use the following guidelines: 1. A null hypothesis is generally a statem ent of no change. Thus, a statement of equality or one which involves the equality is usually considered in the null hypothesis. Possible forms of the null hypothesis include (a) equality; (b) less than or equal; and (c) greater than or equal. 2. The statistical hypothesis is about a param eter or distribution of the population values. For example, the parameter in the statement is the average daily number of text messages that a Grade 11 student sends. Usually, the parameter is represented by a symbol, like for the population mean, we use µ. Hence, the null and alternative hypotheses could be stated using symbols as “Ho: µ = 100 against Ha: µ ≠ 100.” 3. The null and alternative hypotheses are com plem entary and m ust not overlap. The usual pairs are as follow: (a) Ho: Parameter = Value versus Ha: Parameter ≠ Value; (b) Ho: Parameter = Value versus Ha: Parameter < Value; (c) Ho: Parameter = Value versus Ha: Parameter > Value; (d) Ho: Parameter ≤ Value versus Ha: Parameter > Value; and (e) Ho: Parameter ≥ Value versus Ha: Parameter < Value



As discussed earlier, there are two actions that one can make on the hypothesis. One can either reject or fail to reject (accept) a hypothesis. The table below shows these actions: Action Reject the hypothesis Fail to reject (Accept) hypothesis





the

Hypothesis is TRUE Error Committed No Error Committed

Hypothesis is FALSE No Error Committed Error Committed

The table shows that there are no errors committed when we reject a false hypothesis and when we fail to reject a true hypothesis. On the other hand, an error is com m itted when we reject a true hypothesis and such error is called a Type I error. Also, when we fail to reject (accept) a false hypothesis, we are com m itting a Type II error. As mentioned earlier, for every action that one takes, there are consequences. When we commit an error, there are consequences, too. Since it is an error in decision making, the consequences may be tolerable or too severe, severe enough to cause lives. In Statistics, we measure that chance of committing the error so we will have a basis in making a decision.

ASSESSMENT As an assessment, choose one of the following problems and ask learners to formulate the appropriate null and alternative hypotheses. You can also ask them to identify situations where Type I and Type II errors are committed. Have them state its possible consequences.

!

364$

1. A manufacturer of IT gadgets recently announced they had developed a new battery for a tablet and claimed that it has an average life of at least 24 hours. Would you buy this battery? Answer: The null hypothesis can be stated as Ho: The average life of the newly developed battery for a tablet is at least 24 hours while the alternative hypothesis is Ha: The average life of the newly developed battery for a tablet is less than 24 hours. Type I error is committed when you did not buy the battery and a possible consequence is you lost the opportunity to have a battery that could last for at least 24 hours. On the other hand, Type II error is committed when you did buy the battery and found out later that the battery’s life was less than 24 hours. A possible consequence of this Type II error is that you wasted your money in buying the battery. 2. A teenager who wanted to lose weight is contemplating on following a diet she read about in the Facebook. She wants to adopt it but, unfortunately, following the diet requires buying nutritious, low calorie yet expensive food. Help her decide. Answer: The null hypothesis can be stated as Ho: The diet will not result to a change in her weight while the alternative hypothesis is Ha: The diet will induce a reduction in her weight. Type I error is committed when the teenager did follow the diet and a possible consequence is that she spent unnecessarily for a diet that did not help her reduce weight. On the other hand, Type II error is committed when the teenager did not follow the diet. A possible consequence of this Type II error is that the teenager lost the opportunity to attain her goal of weight reduction. 3. Alden is exclusively dating Maine. He remembers that on their first date, Maine told him that her birthday was this month. However, he forgot the exact date. Ashamed to admit that he did not remember, he decides to use hypothesis testing to make an educated guess that today is Maine’s birthday. Help Alden do it. Answer: The null hypothesis can be stated as Ho: Today is Maine’s birthday while the alternative hypothesis is Ha: Maine’s birthday is on another day and not today. Type I error is committed when Alden’s guess of Maine’s birthday is not on this day and a possible consequence is that Alden failed to greet or give Maine a birthday gift today. On the other hand, Type II error is committed when Alden guessed that today is Maine’s birthday. A possible consequence of this Type II error is that Alden made the mistake of greeting Maine a happy birthday on that day. 4. After senior high school, Lilifut is pondering whether or not to pursue a degree in Statistics. She was told that if she graduates with a degree in Statistics, a life of fulfilment and happiness awaits her. Assist her in making a decision. Answer: The null hypothesis can be stated as Ho: Life of fulfillment and happiness awaits her after obtaining a degree in Statistics while the alternative hypothesis is Ha: Life of fulfillment and happiness does not happen after obtaining a degree in Statistics. Type I error is committed when Lilifut does not pursue a degree in Statistics and a possible consequence is that she’ll miss the promised life of fulfilment and happinness after obtaining a degree in Statistics. On the other hand, Type II error is committed when Lilifut decides to obtain a degree in Statistcs. A possible consequence of this Type II error is that Lilifut will miss the opportunity to experience a life of fulfilment and happenness after obtaining a degree in Statistics. 5. An airline company regularly does quality control checks on airplanes. Tire inspection is included since tires are sensitive to the heat produced when the airplane passes through the airport’s runway. The company, since its operation, uses a particular type of tire which is guaranteed to perform even at a maximum surface temperature of 107oC. However, the tires cannot be used

!

365$

and need to be replaced when surface temperature exceeds a mean of 107oC. Help the company decide whether or not to do a complete tire replacement. Answer: The null hypothesis can be stated as Ho: The surface temperature of the tires is at most 107 oC while the alternative hypothesis is Ha: The surface temperature of the tires is greater than 107 oC. Type I error is committed when the airline company orders a tire replacement when in fact it is not needed. A possible consequence of this is that the company will waste money in replacing the tires. On the other hand, Type II error is committed when the airline company does not order tire replacement. A possible consequence of this Type II error is an accident that may happen because of nonreplacement of the tires. Multiple Choice 1. Which of the following would be an appropriate null hypothesis? a) The mean of a population is equal to 50. b) The mean of a sample is equal to 50. c) The mean of a population is greater than 50. d) Only (a) and (c) are true. ANSWER: A 2. Which of the following would be an appropriate null hypothesis? a) The population proportion is less than 0.45. b) The sample proportion is less than 0.45. c) The population proportion is no less than 0.45. d) The sample proportion is no less than 0.45. ANSWER: C 3. Which of the following would be an appropriate alternative hypothesis? a) The mean of a population is equal to 50. b) The mean of a sample is equal to 50. c) The mean of a population is greater than 50. d) The mean of a sample is greater than 50. ANSWER: C 4. Which of the following would be an appropriate alternative hypothesis? a) The population proportion is less than 0.45. b) The sample proportion is less than 0.45. c) The population proportion is no less than 0.45. d) The sample proportion is no less than 0.45. ANSWER: A 5. A Type II error is committed when a) we reject a null hypothesis that is true. b) we don't reject a null hypothesis that is true. c) we reject a null hypothesis that is false. d) we don't reject a null hypothesis that is false. ANSWER: D

!

366$

6. A Type I error is committed when a) we reject a null hypothesis that is true. b) we don't reject a null hypothesis that is true. c) we reject a null hypothesis that is false. d) we don't reject a null hypothesis that is false. ANSWER: A 7. Suppose we wish to test H0: 47 versus H1: > 47. What will result if we conclude that the mean is greater than 47 when its true value is really 52? a) We have made a Type I error. b) We have made a Type II error. c) We have made a correct decision. d) None of the above is correct. ANSWER: C 8. If, as a result of a hypothesis test, we reject the null hypothesis when it is false, then we have committed a) a Type II error. b) a Type I error. c) no error. d) an acceptance error. ANSWER: C 9. The owner of a local restaurant has recently surveyed a random sample of n = 250 customers of the restaurant. She would now like to determine whether or not the mean age of her customers is over 30. If so, she planned to provide background music to appeal to an older crowd. If not, no changes would be made to the background music in the restaurant. The appropriate hypotheses to test are: a) H0 : µ ≥ 30 versus H1 : µ < 30. b) c) d)

H0 : µ ≤ 30 versus H1 : µ > 30. H0 : X ≥ 30 versus H1 : X < 30 . H0 : X ≤ 30 versus H1 : X > 30 .

ANSWER: B 10. A major telco is considering opening a new telecom center in an area that currently does not have any such centers. The telco will open the center if there is evidence that more than 5,000 of the 20,000 households in the area use the telco. It conducts a poll of 300 randomly selected households in the area and finds that 96 subscribe to the telco. State the test of interest to the rental chain. a) H0 : p ≤ 0.32 versus H1 : p > 0.32 b) c) d)

H0 : p ≤ 0.25 versus H1 : p > 0.25 H0 : p ≤ 5,000 versus H1 : p > 5,000 H0 : µ ≤ 5,000 versus H1 : µ > 5,000

ANSWER: B

!

367$

CHAPTER 5: TESTS OF HYPOTHESIS Lesson 2: Steps in Hypothesis Testing TIM E FRAM E: 60 minutes LEARNING COM PETENCIES At the end of the lesson, the learners should be able to: • Identify the steps in hypothesis testing • Illustrate level of significance and corresponding rejection region • Calculate the probabilities of committing an error in a test of hypothesis LESSON A. B. C.

OUTLINE Introduce the steps in hypothesis testing procedure Define level of significance and its role in hypothesis testing Illustrate the corresponding rejection region based on a given level of significance D. Compute the probabilities of committing an error in a test of hypothesis

DEVELOPM ENT OF THE LESSON A test of hypothesis is a series of steps that starts with the formulation of the null and the alternative hypotheses and ends with stating the conclusion. Each step has several components to consider. It has parallelism with court proceedings which could be used as a motivational activity. As a motivational activity, ask learners how a court trial proceeds based on their knowledge. Guide them by citing a popular case and letting them identify the steps to come up with a verdict for the case. For example, take the case of former President Marcos’ ill-gotten wealth case. List the steps that the learners identified. They may mention the following: 1. State the accusation against the family of former President Marcos. 2. Choose the jury. Set or review the guidelines to be used in the decision-making process. 3. Present the evidences 4. Decide on the matter, based on the evidences. 5. State the verdict, based on the decision made. Discuss the results of this activity to learners, emphasizing that steps in a court proceeding are similar if one has to conduct a test of hypothesis. •

!

As in a court proceeding, the first step is to state the accusation or the statement of what will be evaluated as true or false. Parallel to hypothesis testing, we first formulate the hypotheses to be tested. Remember that we do not know the true state of nature of the hypothesis, that is, whether the hypothesis is true or false. Like in a court trial, we do not know whether the accused is guilty or not. 368$





As discussed in the previous lesson, there are two types of hypotheses to state: the null and the alternative. In a court proceeding, we can state the null hypothesis as Ho: The accused is not guilty. While the alternative is stated as Ha: The accused is guilty. The second step is to state the decision rule that we will follow in making a decision on whether to reject or fail to reject (accept) the null hypothesis. In a court proceeding, it is a guideline that the court uses to evaluate the quantity and quality of evidences to be presented. And based on this guideline, the court decides whether to reject or accept the hypothesis that the accused is not guilty. To be able to specify the decision rule in a hypothesis testing procedure, there is a need to specify the components of the rule. These components include the following: 1. We specify a level of significance, which is usually denoted as α in doing the test of hypothesis. It is the same α that we encounter in the discussion of the (1-α)% confidence interval estimate. A level of significance is the probability of rejecting a true null hypothesis or com m itting a Type I error in the test of hypothesis. Since it is a probability of committing an error, it is usually a small value and it is between 0 and 1. 2. We identify the test statistic to use in the decision rule. Usually, the test statistic is a standardized expression of the point estimator of the param eter identified in the hypothesis. Also, the distribution of the test statistic is also needed to be specified. 3. Part of the decision rule is the specification of the rejection region. The rejection region is that part of the distribution of the test statistic where we reject the null hypothesis. An example of a decision rule is stated as follows: “At a given α = 0.05, we reject Ho if the computed test statistic (denoted as tc) is greater than a tabular value of the t distribution with n-1 degrees of freedom. Otherwise, we fail to reject Ho.” In this decision rule, the level of significance is set at α equal to 0.05 and the test statistic is denoted by tc which is assumed to follow the Student’s t-distribution with n-1 degrees of freedom. The rejection region is the area to the right of the tabular value obtained from the Student’s t-distribution with n-1 degrees of freedom. Such rejection region is illustrated in the following figure.

rejection!region! ttab$



!

The third step is then to com pute the value of the test statistic using a random sam ple of observations gathered or collected for the purpose of the test of hypothesis. In a court proceeding, this is the time that the gathered evidences are presented.

369$





With the computed value of the test statistic, the next step is to use the decision rule to m ake a decision whether to reject or fail to reject (accept) the null hypothesis. As in a court proceeding, the jury or the court will decide whether the accused is guilty or not based on the evidences presented. Lastly, as a consequence of the decision, conclusions are m ade which are in relation to the purpose of the test of hypothesis. In a court proceeding, this is the time when the court gives its verdict on the accused. In both scenarios, this last step is the most awaited part of the procedure.

At this point, we can summarize the steps as follows: 1. Formulate the null and alternative hypotheses. 2. Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. 3. Using a simple random sample of observation, compute for the value of the test statistic. 4. Make a decision whether to reject or fail to reject (accept) Ho. 5. State the conclusion. Note that in drawing the conclusions based on the test of hypothesis, we are not 100% sure of our decisions and also with our conclusions. Like in a court proceeding when the court declares the accused is guilty, the decision is not made with certainty. In other words, the court is not 100% sure that the accused is guilty. There is still that possibility that the accused is not guilty and the court is making an error with its decision. Recall that there are two types of errors that one can commit in decision making and these are Type I and Type II error. Type I error is rejecting a hypothesis when in fact the hypothesis is true. Thus, the court commits a Type I error when it declares that the accused is guilty when in fact the accused is not guilty of the crime. In other words, the court had convicted an innocent person. On the other hand, Type II error is committed when a false hypothesis is accepted, that is, the court freed a person guilty of the crime. Note: if there is still time, let learners discuss the consequences of committing the two types of error and determine which of the two types of error—(1) convicting an innocent person or (2) freeing a person guilty of the crime—has greater consequence. However, unlike in a court proceeding, statistics can allow us to compute the probability of committing an error in decision making. The probability of committing Type I error is defined earlier as the level of significance and it is denoted by α. On the other hand, the probability of committing Type II error is usually denoted by b. Also, in statistics the decisions on the test of hypothesis are made with the given probabilities of Type I and Type II error. We have the assurance that the test procedures that we use in statistics were formulated with minimum probabilities of committing an error. The probability of committing an error is a conditional probability problem. It is the probability of making a decision based on the uncertainty of the true state of nature of the

!

370$

hypothesis being tested. You may use the numerical example provided in the previous lesson in this chapter to illustrate the computation of the probabilities of committing Type I and Type II errors. Numerical Example: In testing the null hypothesis “The average daily number of text messages that a Grade 11 student sends is equal to 100” against an alternative hypothesis stated as “The average daily number of text messages that a Grade 11 student sends is greater than 100”. A random sample of 16 students were selected and interviewed. The daily number of text messages she sends is obtained. The null hypothesis is said to be rejected if the sample mean is at least 102, otherwise the null hypothesis will be accepted or we fail to reject Ho. It is assumed that the number of text messages that a Grade 11 student sends in a day follows a normal distribution with standard deviation equal to 5 text messages. Computing for the probability of committing Type I error, we have

Thus, we say the probability of rejecting a true null hypothesis is 0.0058 or we say that on the average, we are assured with 94.52% (1-0.0058 = 0.9452) confidence that we are making a correct decision in accepting a true null hypothesis. The alternative hypothesis is stated as “The average daily number of text messages that a Grade 11 student sends is greater than 100.” If we assume that the true distribution of the number of text messages that a Grade 11 student sends in a day follows a normal distribution with a mean of 103 and a standard deviation equal to 5 text messages, then the computed probability of Type II error is

In this case, the probability of accepting a false null hypothesis or accepting Ho given that the average number of text messages that a Grade 11 student sends in a day is indeed 103 (greater than 100) is computed as 0.2119.

ASSESSMENT

!

371$

As an assessment, consider the following problem and ask students to do what is being asked for. 1. The Graduate Record Exam (GRE) is a standardized test required to be admitted to many graduate schools in the United States. A high score in the GRE makes admission more likely. According to the Educational Testing Service, the mean score for takers of GRE who do not have training courses is 555 with a standard deviation of 139. Brain Philippines (BP) offers expensive GRE training courses, claiming their graduates score better than those who have not taken any training courses. To test the company’s claim, a statistician randomly selected 30 graduates of BP and asked their GRE scores. a.

Formulate the appropriate null and alternative hypotheses. Answer: Ho: Graduates of BP courses did not score better than 555 while Ha: Graduates of BP courses did score better than 555.

b. Identify situations when Type I and Type II errors are committed and state their possible consequences. Answer: Type I error is committed when we declare that the company’s claim is true where in fact BP graduates do not perform better than 555 and a possible consequence is that the tuition fee paid for the training is wasted. On the other hand, Type II is committed when we declare that the BP’s claim is false when in fact BP graduates do score better than 555 and a possible consequence is that opportunity to score better than 555 is lost. c.

Suppose the decision rule is “Reject Ho if the mean score of the sampled BP graduates is greater than 590; otherwise, fail to reject Ho.” Compute for the level of significance for this test. Also, find the risk of concluding that the BP graduates did not score better than 555 when in fact the mean score is 600. Answer: The probability of Type I error is the same as the level of significance denoted by α.

On the other hand, the risk of concluding that the BP graduates did not score better than 555 when in fact their mean score is 600 is the probability of committing Type II and such risk or probability is computed as follows:

!

372$

2. Consider a manufacturing process that is known to produce bulbs that have life lengths with a standard deviation of 75 days. A potential customer will purchase bulbs from the company that manufactures the bulbs if she is convinced that the average life of the bulbs is 1550 days. a.

Formulate the appropriate null and alternative hypotheses.

Answer: Ho: null hypothesis, that the average life of bulbs is (at least) 1550 days against the alternative hypothesis, that the average is less than 1550 b. Identify situations when Type I and Type II errors are committed and state their possible consequences. Answer: Type I error is committed when we declare that the average life is less than 1550 days where in fact the average life is 1550 days or more. On the other hand, Type II is committed when we declare that the average is at least 1550 days, when in fact, it is less than 1550 days. c.

Suppose the decision rule is “Reject Ho if a random sample of 50 bulbs has a life less than than 1532 days; otherwise, fail to reject Ho.” Compute for the level of significance for this test. Also, find the risk of concluding that the average is greater than 1550 days when in fact their mean score is 1500. Answer: The probability of Type I error is the same as level of significance denoted by α.

On the other hand, the risk of concluding that the mean is at least 1550 days, when in fact it is less than 1500, is:

!

373$

CHAPTER 5: TESTS OF HYPOTHESIS Lesson 3: Test on Population Mean (Part 1) TIM E FRAM E: 60 minutes LEARNING COM PETENCIES At the end of the lesson, the learners should be able to: • Formulate appropriate null and alternative hypotheses on the population mean • Identify the appropriate form of the test statistic on the population mean when the population variance is assumed to be known • Identify the appropriate rejection region for a given level of significance when the population variance is assumed to be known • Conduct the test of hypothesis on population mean when the population variance is assumed to be known LESSON OUTLINE A. Introduction of the possible null and alternative hypotheses on population mean B. Steps in hypothesis testing on population mean when the population variance is assumed to be known C. Illustration of the test of hypothesis on the population mean when the population variance is assumed to be known DEVELOPM ENT OF THE LESSON As a review, recall the steps of hypothesis testing procedure discussed in the previous lesson. You may either give it as a quiz or in the form of recitation. Or, write it on the board to serve as a guide in the development of the lesson. The following are the steps identified in the previous lesson: 1. Formulate the null and alternative hypotheses. 2. Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. 3. Using a simple random sample of observation, compute the value of the test statistic. 4. Make a decision whether to reject or fail to reject Ho. 5. State the conclusion. As a motivational activity, you may use the first step of the procedure which is to formulate the appropriate null and alternative hypotheses. This way, you can also review the learners on how to formulate the null and alternative hypotheses by asking them to do it with each of the following real life problems. Through this, learners will be able to identify the population parameter of interest in the problem.

!

374$

Here are some real life problem situations that you can use: 1. The father of a senior high school student is lists down the expenses he will incur when he sends his daughter to the university. At the university where he wants his daughter to study, he hears that the average tuition fee is at least Php20,000 per semester. He wants to do a test of hypothesis. In this problem, the parameter of interest is the average tuition fee or the true population mean of the tuition fee. In symbol, this parameter is denoted as µ. As applied to the problem, the appropriate null and alternative hypotheses are: Ho: The average tuition fee in the targeted university is at least Php20,000. In symbols, Ho: µ ≥ Php20,000. Ha: The average tuition fee in the targeted university is less than Php20,000. In symbols, Ha: µ < Php20,000. 2. The principal of an elementary school believes that this year, there would be more students from the school who would pass the National Achievement Test (NAT), so that the proportion of students who passed the NAT is greater than the proportion obtained in previous year, which is 0.75. What will be the appropriate null and alternative hypotheses to test this belief? In this problem, the parameter of interest is the proportion of students of the school who passed the NAT this year. In symbol, this parameter is denoted as P. As applied to the problem, the appropriate null and alternative hypotheses are: Ho: The proportion of students of the school who passed the NAT this year is equal to 0.75. In symbols, Ho: P = 0.75. Ha: The proportion of students of the school who passed the NAT this year is greater than 0.75. In symbols, Ho: P > 0.75. Discuss the results of this activity to learners, pointing out the following. • • •

A statistical hypothesis is a statement about a parameter and deals with evaluating the value of the parameter. The null and alternative hypotheses should be complementary and non-overlapping. Generally, the null hypothesis is a statement of equality or includes the equality condition as in the case of ‘at least’ (greater than or equal) or ‘at most’ (less than or equal).

Choose the first problem or the problem on the average tuition fee to further develop the lesson. In the problem, the parameter is the population mean. To identify the test statistics, which is part of the second step, certain assumptions have to be made. •

With the assumption of known population variance (σ2) and the variable of interest is measured at least in the interval scale and follows the normal distribution, the appropriate test statistic, denoted as ZC is computed as

where

is the sample mean computed from a

simple random sample of n observations; µ0 is the hypothesized value of the parameter; and σ is the population standard deviation. The test statistic follows the standard normal distribution which means the tabular value in the Z-table will be used as critical or tabular value. With this, the decision rule can be one of the following possibilities: 1. Reject the null hypothesis (Ho) if ZC < -Z . Otherwise, we fail to reject Ho. 2. Reject the null hypothesis (Ho) if ZC > Z . Otherwise, we fail to reject Ho. 3. Reject the null hypothesis (Ho) if |ZC|> Z /2. Otherwise, we fail to reject Ho. α

α

α

!

375$

For the problem, the first is the appropriate decision rule. Suppose the level of significance (α) is set at 0.05, then the decision rule for the problem could be stated as ‘Reject Ho if ZC < -Z0.05 = 1.645. Otherwise, we fail to reject Ho.” Note that this test procedure is referred to as “one-tail Z-test for the population mean when the population variance is known’ and the rejection region is illustrated as follows:

rejection!region!

&Z =&1.645! α



The third step is to compute for the value of the test statistic using a random sample of observations gathered or collected for the purpose of the test of hypothesis. Suppose from a simple random sample of 16 students, a sample mean of Php19,750 was obtained. Further, the variable of interest, which is the tuition fee in the university, is said to be normally distributed with an assumed population variance equal to Php160,000. Hence, the computed test statistic is



With the computed value of the test statistic equal to -2.50, the next step is to use the decision rule to make a decision and this is to reject Ho. Lastly, as a consequence of the decision, conclusions are made which are in relation to the purpose of the test of hypothesis. With the rejection of the null hypothesis, the father can then say that the average tuition fee in the university where he wants his daughter to study is less than Php20,000.



To summarize the lessons learned today, present the following table: Null Hypothesis (Ho)

Alternative Hypothesis (Ha)

Appropriate Test Statistic

Assumptions

Decision Rule and Rejection Region

Reject Ho if |ZC|> Z /2. Otherwise, we fail to reject Ho. α

µ = µ0

µ ≠ µ0

Variable of interest follows the normal distribution with known population variance (σ2)

rejection!regions!

!!!!!!!!!!"Z /2!!!!!!!!!!!!!!Z /2! α

µ = µ0 or µ ≤ µ0

µ > µ0

α

Reject Ho if ZC > Z ,. !!!!!!!Standard!Normal!Distribution! Otherwise, we fail to reject Ho.

Variable of interest follows the normal distribution with known population variance (σ2)

α

rejection!region!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Z ! α

!!!!!!Standard!Normal!Distribution!

!

376$

Reject Ho if ZC < - Z . Otherwise, we fail to reject Ho. α

µ = µ0 or µ ≥ µ0

µ < µ0

Variable of interest follows the normal distribution with known population variance (σ2)

rejection!region!

!!!!!!!!!!!!!!!!!!"Z ! α

!!!!!!!!Standard!Normal!Distribution!

ASSESSMENT Using the problem in the assessment of the previous lesson, perform the test of hypothesis on the statistician who wants to test the claim of Brain Philippines. The random sample of 30 graduates he obtains recorded a mean score of 560 in GRE. Step 1: Formulate the appropriate null and alternative hypotheses. Answer: Ho: Graduates of BP courses did not score better than 555 or in symbols, µ ≤ 555 while Ha: Graduates of BP courses score better than 555 or in symbols, µ > 555. Step 2: Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. Answer: The appropriate test statistic is

. With 5% level of significance, the decision rule

is “Reject the null hypothesis (Ho) if ZC > Z0.05 = 1.645. Otherwise, we fail to reject Ho.” The rejection region is found on the right tail of the standard normal distribution as shown below:

rejection!region!

Z =1.645! α

Step 3: Using a simple random sample of observations, compute the value of the test statistic. Answer: The computed test statistic is

Step 4: Make a decision whether to reject or fail to reject Ho. Answer: With the computed test statistic equal to 5.48, the null hypothesis is rejected. Step 5: State the conclusion. Answer: We then say that the graduates of Brain Philippines do better in GRE. MEETING LEARNERS’ NEEDS • Continue to use the examples/assessments in other lessons in the future.

!

377$

CHAPTER 5: TESTS OF HYPOTHESIS Lesson 4: Test on Population Mean (Part 2) TIM E FRAM E: 60 minutes LEARNING COMPETENCIES At the end of the lesson, learners should be able to: • Identify the appropriate form of test statistic on the population mean when the population variance is assumed to be unknown • Identify the appropriate rejection region for a given level of significance when the population variance is assumed to be unknown • Conduct the test of hypothesis on the population mean when the population variance is assumed to be unknown • Identify the appropriate form of test statistic on the population mean when the population variance is assumed to be unknown and the sample size is large enough to invoke the Central Limit Theorem • Identify the appropriate rejection region for a given level of significance when the population variance is assumed to be unknown and the sample size is large enough to invoke the Central Limit Theorem • Conduct the test of hypothesis on the population mean when the population variance is assumed to be unknown and the sample size is large enough to invoke the Central Limit Theorem LESSON OUTLINE A. Steps in hypothesis testing on population mean when the population variance is assumed to be unknown B. Illustration of the test of hypothesis on the population mean when the population variance is assumed to be unknown C. Steps in hypothesis testing on population mean when the population variance is assumed to be unknown and the sample size is large enough to invoke the Central Limit Theorem D. Illustration of the test of hypothesis on the population mean when the population variance is assumed to be unknown and the sample size is large enough to invoke the Central Limit Theorem DEVELOPMENT OF THE LESSON As in the previous lesson, you may start the lesson by reviewing the steps of hypothesis testing procedure: 1. Formulate the null and alternative hypotheses. 2. Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. 3. Using a simple random sample of observation, compute the value of the test statistic. 4. Make a decision whether to reject or fail to reject (accept) Ho. 5. State the conclusion. As a motivational activity, use the problem in the previous lesson but, in this case, emphasize that the population variance is unknown. The problem can be stated as follows:

!

378$

The father of a senior high school student lists down the expenses he will incur when he sends his daughter to the university where he wants her to study. He hypothesizes that the average tuition fee is at least Php20,000 per semester. He knows the variable of interest, which is the tuition fee, is measured at least in the interval scale or specifically in the ratio scale. He assumes that the variable of interest follows the normal distribution but both population mean and variance are unknown. The father asks, at random, 25 students of the university about their tuition fee per semester. He is able to get an average of Php20,050 with a standard deviation of Php500. •

In this problem, the appropriate null and alternative hypotheses remain the same as in the previous lesson and are stated as:

Ho: The average tuition fee in the targeted university is at least Php20,000. In symbols, Ho: µ ≥ 20,000 pesos. Ha: The average tuition fee in the targeted university is less than Php20,000. In symbols, Ha: µ < 20,000 pesos. •

With the assumption of unknown population variance (σ2) and the variable of interest is measured at least in the interval scale and follows the normal distribution, the appropriate test statistic, denoted as tC is computed as

where

and s, are the sample mean and

sample standard deviation, respectively, computed from a simple random sample of n observations; and µ0 is the hypothesized value of the parameter. The test statistic follows the Student’s t-distribution with n-1 degrees of freedom which means the tabular value in the Student’s t-table will be used as critical or tabular value. With this, the decision rule can be one of the following possibilities: 1. Reject Ho if tC < -t , n-1. Otherwise, we fail to reject Ho. 2. Reject Ho if tC > t , n-1. Otherwise, we fail to reject Ho. 3. Reject Ho if |tC|> t /2, n-1. Otherwise, we fail to reject Ho. α

α

α

For the problem, the first is the appropriate decision rule. Suppose the level of significance (α) is set at 0.05, then the decision rule for the problem can be stated as “Reject Ho if the tC < -t ,24 = 2.064. Otherwise, we fail to reject Ho.” Note that this test procedure is referred to as “one-tail t-test for the population mean” and the rejection region is illustrated as follows: α

rejection!region!

't ,!n'1='2.064! α



The third step in hypothesis testing procedure is to compute for the value of the test statistic based on a random sample of observations collected. It was stated in the problem that from a simple random sample of 25 students, a sample mean of PhP20,050 with standard deviation 500 pesos was obtained. Hence, the computed test statistic is

!

379$





The next step is to use the decision rule to make a decision. With the computed value of the test statistic equal to 0.50 and the rule dictates that our decision is not to reject or fail to reject the null hypothesis. Lastly, as a consequence of the decision, conclusions are to be stated. With the acceptance of the null hypothesis, the father can say that the average tuition fee at the university where he wanted his daughter to study is at least Php20,000.

We proceed to the next lesson by asking learners what they will do in case the variable of interest cannot be assumed to follow a normal distribution. Is there a way to test the hypotheses? The answer to this question is: Yes, there is a way to do it but they must be assured that the sample size is large enough to invoke the Central Limit Theorem they learned under the lesson on sampling distribution of the sample mean. Let us say that for the given problem, a random sample of size 36 is sufficient for us to invoke the theorem. Hence, we could restate the problem as follows. Notice that we emphasize the change in the sample size to invoke the theorem. The father of a senior high school student lists down the expenses he will incur when he sends his daughter to the university, where he wanted her to study. He hypothesizes that the average tuition fee is at least Php20,000 per semester. He knows the variable of interest, which is the tuition fee, is measured at least in the interval scale or specifically in the ratio scale. He assumes that the variable of interest follows a distribution with unknown population mean and variance. The father asks, at random, 36 students of the university about their tuition fee per semester. He is able to get an average of PhP20,200 with a standard deviation of 400 pesos. •

In this problem, the appropriate null and alternative hypotheses remain the same as in the previous lesson and are stated as follow:

Ho: The average tuition fee in the targeted university is at least Php20,000. In symbols, Ho: µ ≥ 20,000 pesos. Ha: The average tuition fee in the targeted university is less than Php20,000. In symbols, Ha: µ < 20,000 pesos. •

With the assumption of unknown distribution of the variable of interest as well as its population variance (σ2) but with a sample size large enough to invoke the Central Limit Theorem, the test statistic, denoted as tC which was used earlier, is still appropriate to use. This test statistic is computed as

where

and s, are the sample mean and sample standard deviation,

respectively, computed from a simple random sample of n observations; and µ0 is the hypothesized value of the parameter. However, this time with the Central Limit Theorem, we can assume that the test statistic follows the standard normal distribution which means the tabular value in Z-table will be used as critical or tabular value. With this, the decision rule can be one of the following possibilities: 1. Reject Ho if tC < -Z ,. Otherwise, we fail to reject Ho. 2. Reject Ho if tC > Z ,. Otherwise, we fail to reject Ho. 3. Reject Ho if |tC|> Z /2. Otherwise, we fail to reject Ho. α

α

α

For the problem, the first is the appropriate decision rule. Suppose the level of significance (α) is set at 0.05, then the decision rule for the problem can be stated as “Reject Ho if tC < - Z0.05, = 1.645. Otherwise, we fail to reject Ho.” The rejection region is illustrated as follows:

!

380$

rejection!region!

'#Z ,#='1.645! α



The third step in hypothesis testing procedure is to compute for the value of the test statistic based on a random sample of observations collected. It was stated in the problem that from a simple random sample of 36 students, a sample mean of PhP20,250 with standard deviation 400 pesos was obtained. Hence, the computed test statistic is

• •

The next step is to use the decision rule to make a decision. With the computed value of the test statistic equal to 3.75, the rule dictates that our decision should be to reject the null hypothesis. Lastly, as a consequence of the decision, conclusions are to be stated. With the rejection of the null hypothesis, the father can then say that the average matriculation tuition fee at the university where he wanted his daughter to study is less than Php20,000.

To summarize the lessons learned today, present the following table: Null Hypothesi s (Ho)

Alternative Hypothesi s (Ha)

Assumptions

Appropriate Test Statistic

Decision Rule and Rejection Region Reject Ho if |tC|> t /2, n-1. Otherwise, we fail to reject Ho. α

µ = µ0

µ ≠ µ0

Variable of interest follows a normal distribution with unknown population variance (σ2).

rejection!regions!

##########$t /2,n$1#########t /2,n$1# α

α

Reject Ho if tC > t , n-1. Student’s#t#Distribution#with#n$1#df# Otherwise, we fail to reject Ho. α

µ = µ0 or µ ≤ µ0

µ > µ0

Variable of interest follows a normal distribution with unknown population variance (σ2).

rejection!region!

##################################t ,#n$1# α

Student’s#t#Distribution#with#n$1#df#

!

381$

Reject Ho if tC < - t , n-1. Otherwise, we fail to reject Ho. α

µ = µ0 or µ ≥ µ0

µ < µ0

Variable of interest follows a normal distribution with unknown population variance (σ2).

rejection!region!

################$t ,#n$1# α

Reject Ho if |tC|> Z /2. ####Student’s#t#Distribution#with#n$1#df# Otherwise, we fail to reject Ho. α

µ = µ0

µ ≠ µ0

Variable of interest follows an unknown distribution but uses a large sample to invoke the CLT.

rejection!regions!

##########$Z /2##############Z /2# α

α

Reject Ho if tC > Z ,. #######Standard#Normal#Distribution# Otherwise, we fail to reject Ho. α

µ = µ0 or µ ≤ µ0

µ > µ0

Variable of interest follows an unknown distribution but uses a large sample to invoke the CLT.

rejection!region!

##################################Z # α

Reject Ho if tC < - Z . ######Standard#Normal#Distribution# Otherwise, we fail to reject Ho. α

µ = µ0 or µ ≥ µ0

µ < µ0

Variable of interest follows an unknown distribution but uses a large sample to invoke the CLT.

rejection!region!

##################$Z # α

########Standard#Normal#Distribution#

ASSESSM ENT The purpose of the assessment is to conduct the test of hypothesis using the appropriate components of the test procedure. Hence, ask learners to conduct the test of hypothesis for each of the following real-life problems. 1. The minimum wage earners of the National Capital Region are believed to be receiving less than Php500 per day. The CEO of a large supermarket chain in the region is claiming to be paying its contractual higher than the minimum daily wage rate of Php500. To check on this claim, a labour union leader obtained a random sample of 144 contractual employees from this supermarket chain. The survey of their daily wage earnings resulted to an average wage of Php510 per day with standard deviation of Php100. The daily wage of the region is !

382$

assumed to follow a distribution with an unknown population variance. Perform a test of hypothesis at 5% level of significance to help the labour union leader make an empiricalbased conclusion on the CEO’s claim. Step 1: Formulate the appropriate null and alternative hypotheses. Answer: Ho: The CEO’s claim is not true or the average daily wage rate of the contractual employees at the supermarket is less than or equal to Php500. In symbols, µ ≤ 500 while Ha: The CEO’s claim is true or the average daily wage rate of the contractual employees from at the supermarket is higher than Php500. In symbols, µ > 500. Step 2: Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. Answer: The appropriate test statistic is

. With 5% level of significance, the

decision rule is ‘Reject the null hypothesis (Ho) if tC > Z0.05 = 1.645. Otherwise, fail to reject Ho. The rejection region is found on the right tail of the standard normal distribution as shown below: rejection!region!

Z =1.645! α

Step 3: Using the sample statistics obtained from a random sample of size 144, compute for the value of the test statistic. Answer: The computed test statistic is Step 4: Make a decision whether to reject or fail to reject Ho. Answer: With the computed test statistic equal to 1.20, the null hypothesis is not rejected. Step 5: State the conclusion. Answer: We say the claim of the CEO is not true and that the daily minimum wage rate of the contractual workers at the supermarket chain in the region is at most Php500.

2. A brand of powdered milk is advertised as having a net weight of 250 grams. A curious consumer obtained the net weight of 10 randomly selected cans. The values obtained are: 256, 248, 242, 245, 246, 248, 250, 255, 243 and 249 grams. Is there reason to believe that the average net weight of the powdered milk cans is less than 250 grams at 10% level of significance? Assume the net weight is normally distributed with unknown population variance.

!

383$

Step 1: Formulate the appropriate null and alternative hypotheses. Answer: Ho: The average net weight of the powdered milk cans is equal to 250 grams. In symbols, µ = 250 while Ha: The average net weight of the powdered milk cans is equal to 250 grams. In symbols, µ< 250 Step 2: Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. Answer: The appropriate test statistic is

. With 10% level of significance, the

decision rule is “Reject the null hypothesis (Ho) if the tC < -t0.10,9 = -2.998.” Otherwise, we fail to reject Ho. The rejection region is found at the left tail of the Student’s t-distribution with 9 df shown below: rejection!region!

$t0.10,9='2.998! Step 3: Using the 10 observations, compute for the value of the test statistic. Answer: The sample mean and sample standard deviation are computed as and

respectively. The

computed test statistic is Step 4: Make a decision whether to reject or fail to reject Ho. Answer: With the computed test statistic equal to -1.23, the null hypothesis is not rejected. Step 5: State the conclusion. Answer: We can then say that the advertised average net weight of the powdered milk is indeed true or µ = 250 grams . M EETING LEARNERS’ NEEDS • Continue to use the examples/assessments in other lessons in the future.

!

384$

CHAPTER 5: TESTS OF HYPOTHESIS Lesson 5: Test on Population Proportion TIM E FRAM E: 60 minutes LEARNING COMPETENCIES At the end of the lesson, the learners should be able to: • Formulate appropriate null and alternative hypotheses on the population proportion • Identify the appropriate form of the test statistic on the population proportion when the sample size is large enough to invoke the Central Limit Theorem • Identify the appropriate rejection region for a given level of significance when the sample size is large enough to invoke the Central Limit Theorem • Conduct the test of hypothesis on population proportion when the sample size is large enough to invoke the Central Limit Theorem! LESSON OUTLINE a. Introduction of the possible null and alternative hypotheses on population proportion b. Steps in hypothesis testing on population proportion when the sample size is large enough to invoke the Central Limit Theorem c. Illustration of the test of hypothesis on the population proportion when the sample size is large enough to invoke the Central Limit Theorem DEVELO PM ENT O F THE LESSO N As in the previous lesson, start the lesson by reviewing the steps of hypothesis testing procedure: 1. Formulate the null and alternative hypotheses. 2. Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. 3. Using a simple random sample of observation, compute the value of the test statistic. 4. Make a decision on whether to reject or fail to reject (accept) Ho. 5. State the conclusion. As a motivational activity, you may use a problem in Lesson 3, which is stated as follows: The principal of an elementary school believes that this year there would be more students from the school who would pass the National Achievement Test (NAT), so that the proportion of students who passed the NAT is greater than the so that the proportion of students who passed the NAT is greater than the proportion obtained in previous year, which is 0.75. What will be the appropriate null and alternative hypotheses to test this belief?

385$

In this problem, the parameter of interest is the proportion of students of the school who will pass the NAT this year. In symbol, this parameter is denoted as P. As applied to the problem, the appropriate null and alternative hypotheses are: Ho: The proportion of students of the school who will pass the NAT this year is equal to 0.75. In symbols, Ho: P = 0.75. Ha: The proportion of students of the school who will pass the NAT this year is greater than 0.75. In symbols, Ho: P > 0.75. •

The variable as to whether a student passes the NAT this year or not is said to follow a Bernoulli distribution with parameter P. If we further say that out of n students, the number of students who will pass the NAT this year as the variable of interest, then this variable is distributed as binomial with parameters n and P. With the assumption of large sample to be able to invoke the Central Limit Theorem, the appropriate test statistic, denoted as ZC is computed as

where

is the sample

proportion computed from a simple random sample of n observations; and P0 is the hypothesized value in of the parameter. The test statistic follows the standard normal distribution which means the tabular value in the Z-table will be used as critical or tabular value. With this, the decision rule can be one of the following possibilities: 1. Reject the null hypothesis (Ho) if ZC < -Z . Otherwise, we fail to reject Ho. 2. Reject the null hypothesis (Ho) if ZC > Z . Otherwise, we fail to reject Ho. 3. Reject the null hypothesis (Ho) if |ZC|> Z /2. Otherwise, we fail to reject Ho. α

α

α

For the problem, the second option is the appropriate decision rule. Suppose the level of significance (α) is set at 0.05, then the decision rule for the problem can be stated as “Reject Ho if ZC > Z0.05 = 1.645. Otherwise, we fail to reject Ho.” Note that this test procedure is referred to as “one-tail Z-test for population proportion” and the rejection region is illustrated as follows: rejection!region! Z0.05=1.645!



The third step is to compute for the value of the test statistic using a random sample of observations gathered or collected for the purpose of the test of hypothesis. Suppose from a simple random sample of 100 students of the school, 78 students were able to pass the NAT. Hence, the computed test statistic is .

• •

With the computed value of the test statistic equal to 0.6928, the next step is to use the decision rule to make a decision: not to reject or fail to reject Ho. Lastly, as a consequence of the decision conclusions are made which are in relation to the purpose of the test of hypothesis. With the non-rejection of the null hypothesis, then

386$

it can be concluded that it is not true that more students of the school did perform better in NAT this year at 5% level of significance. To summarize the lessons learned today, present the following table: Null Hypothesis (Ho)

Alternative Hypothesis (Ha)

Assumptions

Appropriate Test Statistic

Decision Rule and Rejection Region

Reject Ho if |ZC|> Z /2. Otherwise, we fail to reject Ho. rejection!regions! α

P = P0

P ≠ P0

Variable of interest follows the binomial distribution with n and P as parameters

""""""""""#Z /2""""""""""""""Z /2" α

α

"""""""Standard"Normal"Distribution"

Reject Ho if ZC > Z ,. Otherwise, we fail to reject Ho. α

P = P0 or P ≤ P0

P > P0

Variable of interest follows the binomial distribution with n and P as parameters

rejection!region!

""""""""""""""""""""""""""""""""""Z " α

""""""Standard"Normal"Distribution"

Reject Ho if ZC < - Z . Otherwise, we fail to reject Ho. α

P = P0 or P ≥ P0

P < P0

Variable of interest follows the binomial distribution with n and P as parameters

rejection!region!

""""""""""""""""""#Z " α

""""""""Standard"Normal"Distribution"

ASSESSMENT Carry out a test of hypothesis to draw conclusions in relation to each of the following problems: 1. Previous evidences show that majority of the students are happy and contented with the univesity’s policies. This year, a random sample of 100 students was drawn. They were asked if they were happy and contented with the univesity’s policies. Out of 100 students, 65 said so. What conclusions could be made at 10% level of significance? Step 1: Formulate the appropriate null and alternative hypotheses. Answer: Ho: At most, half of the student population are happy and contended with the university’s policies. In symbols, P ≤ 0.50 while Ha: Majority of the student population are happy and contended with the university’s policies. In symbols, P > 0.50

387$

Step 2: Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. Answer: Having the variable of interest defined as the number of happy and contented students with the university policies out of n students, the appropriate test statistic is

.

With 10% level of significance, the decision rule is “Reject the null hypothesis (Ho) if ZC > Z0.10 = 1.28. Otherwise, we fail to reject Ho.” The rejection region is found on the right tail of the standard normal distribution as shown below:

rejection!region!

Z =1.28! α

Step 3: Using a simple random sample of observations, compute for the value of the test statistic. Answer: The computed test statistic is

Step 4: Make a decision whether to reject or fail to reject Ho. Answer: With the computed test statistic equal to 3.0, the null hypothesis is rejected. Step 5: State the conclusion. Answer: We then say that majority of the student population are happy and contended with the university’s policies. 2. An independent research group is interested to show that the percentage of babies delivered through Ceasarian Section is decreasing. For the past years, 20% of the babies were delivered through Ceasarian Section. The research group randomly inspects the medical records of 144 births and finds that 25 of the births were by Ceasarian Section. Can the research group conclude that the percent of births by Ceasarian Section has decreased at 5% level of significance? Step 1: Formulate the appropriate null and alternative hypotheses. Answer: Ho: The proportion of births that were delivered by Caesarean Section is not decreasing, that is, it is still at least equal to 0.20. In symbols, P ≥ 0.20 while Ha: The proportion of births that were delivered by Caesarean Section is decreasing, that is, it is less than 0.20. In symbols, P < 0.20 Step 2: Identify the test statistic to use. With the given level of significance and the distribution of the test statistics, state the decision rule and specify the rejection region. Answer: Having the variable of interest defined as the number of births out of n that were delivered through Caesarean Section, the appropriate test statistic is

. With 5%

level of significance, the decision rule is “Reject the null hypothesis (Ho) if ZC < -Z0.05 = 1.645. Otherwise, we fail to reject Ho.” The rejection region is found on the left tail of the standard normal distribution as shown below:

388$

rejection!region!

6Z =61.645! α

Step 3: Using a simple random sample of observations, compute for the value of the test statistic. Answer: The computed test statistic is

Step 4: Make a decision whether to reject or fail to reject Ho. Answer: With the computed test statistic equal to -0.3, we fail to reject the null hypothesis. Step 5: State the conclusion. Answer: We then say that the proportion of births that were delivered by Caesarean Section is not decreasing.

MEETING LEARNERS’ NEEDS • Continue to use the examples/assessments in other lessons in the future.

389$

CHAPTER 5: TEST OF HYPOTHESIS Lesson 6: More on Hypothesis Tests Regarding The Population Proportion TIME FRAME: 60 minutes OVERVIEW OF LESSON: In this lesson, learners participate in an activity to reinforce their understanding of hypothesis testing. The lesson is largely taken from a STatistics Education Web (STEW) lesson plan called “I Always Feel Like Somebody's Watching Me.” Learners perform an experiment in order to test the “Psychic Staring Effect,” i.e., the idea that people can sense they are being stared at. The activity can be used to illustrate largesample confidence intervals and hypothesis tests on proportions. LEARNING COMPETENCIES At the end of the lesson, learners should be able to: • Calculate a confidence interval estimate for the population proportion • Formulate appropriate null and alternative hypotheses on the population proportion • Identify the appropriate form of the test statistic on the population proportion when the sample size is large enough to invoke the Central Limit Theorem • Identify the appropriate rejection region for a given level of significance when the sample size is large enough to invoke the Central Limit Theorem • Conduct the test of hypothesis on population proportion when the sample size is large enough to invoke the Central Limit Theorem MATERIALS REQUIRED • Scientific calculators • Activity sheet (found at the end of this lesson) • Some mechanism or instructions for incorporating randomness (the scientific calculator can be used for this task). LESSON OUTLINE A. Point estimator of the population proportion B. Properties of the sample proportion as point estimator of population proportion C. Construction and interpretation of a (1-α)100% confidence interval estimator of the population proportion using large sample D. Illustration on the computation of a point and interval estimates of the population proportion and its interpretation DEVELO PM ENT O F THE LESSO N A. ACTIVITY: “PSYCHIC STARING EFFECT” This lesson involves an activity where learners collect and explore data. Students perform an experiment in order to test the “Psychic Staring Effect.” Explain first to students that the “Psychic Staring Effect” is the idea that people can sense they are being stared at. This has been studied heavily by many different researchers, but

390$

with different results. In 2003, Rupert Sheldrake wrote a book entitled “The Sense of Being Stared At,” which contained anecdotal evidence for the phenomenon: “Many people have had the experience of feeling that they are being looked at, and, on turning around, find that they really are. Conversely, many people have stared at other people's backs, for example in a lecture theater, and watched them become restless and then turn round.” In the late 19th century, psychologist Edward B. Titchener suggested that the effect was an illusion, and that when a person turned to check whether they were being watched, the initial movement of their head might attract the focus of somebody behind them who was previously only looking in their general direction. By the time the person had turned their head fully, the second person would be looking directly at them, giving the mistaken impression that they had been staring at them all along. After introducing the “Psychic Staring Effect,” explain to learners that the goal of the activity is to perform one of Sheldrake’s experiments, and perform a statistical analysis of the data collected to either support or refute Sheldrake’s claims. Data Collection Put learners into pairs and explain the data collection procedure. One person will be the Looker and the other, the Subject. • •

The Subject should sit with his or her back to the Looker and keep his or her eyes closed. The Looker either “looks” or does “not look” at the Subject in a series of 10 trials, according to a random sequence. This random sequence can be generated by a random number generator on a calculator or by tossing a fair coin. The Looker should stand about 1 meter behind the Subject’s back, and either “looks” or does “not look” at the Subject in accordance with a random sequence of trials. The teacher might wish to instruct all the Lookers to look down at the data collection sheet if they are not looking at the Subject on a particular trial. To signal the beginning of each trial, the teacher should give a signal to the entire class, so that all trials are performed simultaneously. The teacher should say “Trial one: Begin.” (Since all Lookers are following different random sequences of instructions, the teacher’s voice will give no relevant clues to the Subjects.) The Subject then says “looking” or “not looking,”and the Looker records the looking status and the Subject’s response on the Data Collection Sheet. The Subject should not spend long thinking, but guess quite quickly. 10 seconds are long enough. The Looker should record the Subject’s guess and then proceed to the next trial. For the next trial, the teacher should say, “Trial two: Begin.” And so on. The entire procedure is repeated for 10 trials. After the series of 10 trials has been completed, the Lookers hand in their Data Collection Sheets. The Lookers and Subjects then trade places. Each new Looker starts with a new data sheet. Assuming a class size of 40 students is split into 20 pairs with each student within a pair playing the role of the Looker, 400 total trials would be produced for the class. After all the trials have been completed, collect the results from the pairs via the Data Collection Sheets and tally the class results on the white board. 391$

Note: 400 total trials will be performed. However, we expect the Looker to be Looking at the Subject in only half of these trials. It is this half that will be used for a part of the data analysis that follows. Table 1 contains sample results obtained when this activity is performed with 28 students. Table 1. Example two-way frequency table for class data. Answer of Subject Looking

Not Looking

Row Totals

Looking

66

68

134

Not Looking

76

70

146

Column Totals

142

138

280

Status of Looker

After the class results are tallied onto the white board, students use the data to answer a series of questions designed to determine if the data support the existence of “Psychic Staring Effect B. DATA ANALYSIS Here are three ways to analyze the data: 1. Confidence Intervals Explain to learners that they need to make inferences regarding the population proportion, p. In the context of Sheldrake’s experiment p is the proportion of the time that someone who is being looked at can correctly identify that they are being looked at. Help learners recall the following (large sample) formula that can be used to construct a confidence interval for p:

pˆ ± z

pˆ (1 − pˆ ) n

where pˆ is the sample proportion of successes, n is the sample size, and z is the multiplier (critical value). The assumptions for using this procedure are: (1) The random variable of interest is categorical; (2) The data are obtained using randomization; (3) The sample size is sufficiently large so that the sampling distribution of the sample proportion is approximately normal. Specifically, npˆ ≥ 15 and n (1 − pˆ ) ≥ 15. Discuss the assumptions with the learners. Have them identify the random variable of interest. On each trial, a success occurs if a Subject correctly identifies being looked at. Thus, the variable of interest is whether or not a correct response is given by a Subject who

392$

is being looked at. Whether or not the Looker is looking at the Subject on a particular trial is obtained using randomization. The randomization is either done by flipping a fair coin or generating a random number in order to determine whether or not the Looker does look at a Subject on an individual trial. For a typical class of 40 students, the number of trials will approach 200, so it is not likely that the sample size requirement will not be met. Translate the problem from one concerning the random variable of interest to one that involves a population proportion. Ask students to identify what the unknown proportion is in this experiment’s context. In this experiment, the population proportion p is the proportion of those being looked at who correctly identify that they are being looked at. Instruct learners to compute for the sample proportion for the class results. Discuss what the value of the sample proportion indicates about the validity of the “Psychic Staring Effect.” Ask learners to construct a 95% confidence interval for p based upon the class data. Once the interval has been constructed, ask them to interpret the interval and to state whether or not they believe that the class data indicate that there is a “Psychic Staring Effect.” For the sample class data, pˆ , the sample proportion of the time that someone who is being looked at can identify that they are being looked at is pˆ =

66 ≈ .4925 or .49. For 95% 134

confidence, the z multiplier is 1.96. Thus, inserting the class values into the confidence interval formula gives: .49 ± 1.96

.49(1 − .49) = .49 ± .08 = (.41,.57). 134

Based upon this confidence interval, we can say, with 95% confidence, that the proportion of the time that someone who is being looked at can identify that they are being looked at is between .41 and .57. Notice that this interval includes the value of .50, (the probability of getting a head in flipping a fair coin), which does not support the existence of the “Psychic Staring Effect.” 2. Hypothesis Tests Next, discuss with learners the procedure for performing a large-sample hypothesis test on the value of a population proportion. If they wish to perform a hypothesis test on a proportion, p, the corresponding test statistic formula to test H 0 : p = p0 is

z=

pˆ − p0 p0 (1 − p0 ) n

where p0 is the null hypothesized value, pˆ is the sample proportion of successes, and n is the sample size. The assumptions for this are the same as the ones for large-sample confidence interval for p. Have a class discussion about the appropriate value to use for p0 and the appropriate alternative hypothesis to use in order to use the class data to test for the existence of the 393$

“Psychic Staring Effect.” Learners should agree that the null hypothesis should be H 0 : p = .50 and the upper-tailed alternative hypothesis should be H A : p > .50. Once the test statistic value has been calculated, ask learners to calculate and interpret the p-value and to make a conclusion in the context of the problem. That is, do the class data indicate the presence of the “Psychic Staring Effect”? For the sample class data, the test statistic is: z =

.49 − .50 ≈ −.23. .50(1 − .50) 134

The teacher may wish to note that this test statistic value is very close to zero (0). A z test statistic of zero (0) is obtained if the sample proportion are exactly equal to 0.50. Since the alternative hypothesis here is upper tailed, the p-value is equal to the area to the right of −.23 under the standard normal distribution curve. This area is approximately equal to 0.59. Based upon this p-value, the null hypothesis cannot be rejected. Therefore, the data do not provide significant evidence to indicate that the proportion of the time that someone who is being looked at can identify that they are being looked at exceeds .50. Or, in other words, the data does not support the existence of the “Psychic Staring Effect.” REFERENCES Many of the materials in this lesson were adapted from: Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Freedman, D., Pisani, R, and Purves, R. (2007). Statistics, Fourth Edition. New York: W. W. Norton & Company Richardson, M. and Stephenson, P. I Always Feel Like Somebody’s Watching Me. Grand Valley State University in Statistics Education Web (STEW). Retrieved from https://www.amstat.org/education/stew/pdfs/IAlwaysFeelLikeSomebodysWatchingMe.doc Schneiter, K. Exploring Geometric Probabilities with Buffon’s Coin Problem. Utah State University in Statistics Education Web (STEW) Online Journal of K-12 Statistics Lesson Plans. Retrieved from http://www.amstat.org/education/stew/pdfs/EGPBCP.pdf Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños, College Laguna 4031

394$

Activity Sheet 5-06 Background The Psychic Staring Effect (Adapted from: http://en.wikipedia.org/wiki/Staring) The “Psychic Staring Effect” is the idea that people can sense that they are being stared at. It has been studied heavily, by many different researchers, with different results. In 2003, Rupert Sheldrake wrote the controversial book “The Sense of Being Stared At,” which contained a great deal of anecdotal evidence for the phenomenon: “Many people have had the experience of feeling that they are being looked at, and, on turning around, find that they really are. Conversely, many people have stared at other people's backs, for example in a lecture theater, and watched them become restless and then turn round.” After students reported the phenomenon to him in the late 19th century, psychologist Edward B. Titchener suggested that the effect was an illusion, and that when a person turned to check whether they were being watched, the initial movement of their head might attract the focus of somebody behind them who was previously only looking in their general direction. By the time the person had turned their head fully, the second person would be looking directly at them, giving the mistaken impression that they had been staring at them all along. Goal: Perform one of Sheldrake’s experiments to see if a statistical analysis of the resulting data supports Sheldrake’s claims. Learners will perform Sheldrake’s “Method 1” which is an experiment with Lookers and Subjects within the same room. The Experim ent: The experiment involves working in pairs. One person is the Looker and the other the Subject. The Subject sits with his or her back to the Looker and keeps his or her eyes closed. Lookers either “look” or do “not look” at the Subjects in a series of 10 trials according to a random sequence. This random sequence can be generated by a random number generator on a calculator or a fair coin toss. The Looker stands about 3 feet behind the Subject’s back, and either “looks” or does “not look” at the Subject in accordance with the random sequence of trials. To signal the beginning of each trial, the teacher will give a signal to the entire class, so that all trials are performed simultaneously. The teacher will say “Trial one: Begin.” Since all Lookers are following different random sequences of instructions, the teacher’s voice will give no relevant clues to the Subjects. Then for the next trial, the teacher will say, “Trial two: Begin.” And so on. The Subject then says “looking” or “not looking,” and the Looker records the looking status and the Subject’s response on the Data Collection Sheet. The Subject should not spend a long time thinking whether the Lookers are looking or not looking, but guess quite quickly. 10 seconds are long enough. The Looker records the Subject’s guess and then proceeds to the next trial. The same procedure is repeated for all 10 trials.

395$

After the series of 10 trials has been completed, the Lookers hand in the Data Collection Sheets. The Lookers and Subjects then trade places. Each new Looker starts with a new data sheet. Questions: A. Large-Sam ple Confidence Interval on a Proportion 1. If a person can in fact determine whether they are being stared at, then what can we say about the numerical value of the proportion of those being looked at who should guess correctly and say “yes”? 2. For the class data: (a) Identify the sample size. n = __________ (b) Identify the number of successes (recall, a trial results in a success if the Subject has correctly identified being Looked at). x = __________ (c) Calculate the sample proportion of those being Looked at who correctly answered “yes.”!

pˆ = __________

3.!!Construct!a!95%!confidence!interval!for!the!proportion!of!correct!guesses!for!those!being!looked!at!(that!is,! the!proportion!of!time!that!those!who!were!being!looked at said “yes”). (a) Give the formula: (b) Plug the appropriate class data into the formula: (c) Give the confidence interval: (d) Interpret the interval. State whether or not the interval supports the “Psychic Staring Effect”:

B. Hypothesis Test on a Proportion 4. Does the class data provide significant evidence to indicate that the proportion of “yes” guesses is higher than 50% for those being looked at? (a) State the appropriate hypotheses: (b) Summarize the data into an appropriate test statistic: (c) Find the p-value: (d) Report a conclusion in the context of this problem: (e) In the context of this problem, explain what a type 1 error would be: (f) In the context of this problem, explain what a type 2 error would be:

396$

Data Collection Sheet Name of Looker____________________ Name of Subject ____________________

Trial

Looker

Subject

Looking (yes) /

Says Being Looked At (yes)/

Not Looking (no)

Says Not Being Looked At (no)

1 2 3 4 5 6 7 8 9 10

Sum m ary: Guess by Subject Looking Yes

No

Yes No

ASSESSMENT 1. Suppose a new treatment for a certain disease is given to a sample of 100 patients. The treatment was successful for 81 of the patients. Assume that these patients are representative of the population of individuals who have this disease. (a) Calculate the sample proportion that were successfully treated. Answer: The sample proportion is

397$

(b) Determine a 95% confidence interval for the proportion of the population for whom the treatment would be successful. Write a sentence that interprets this interval. Answer: The formula to use is Sample estimate ± Multiplier × Standard error. The sample estimate is

and the standard error is

. For 95% confidence

the multiplier is 1.96. The 95% confidence interval is

.83 ± (1.96 × .0266) ,

, which is .807 to .813. With 95% confidence, we can say that if the whole population with this disease received the treatment, the proportion successfully treated will be between .807and .813 (or 80.7% to 81.3%). 2. An ESP experiment is done wherein a participant guesses which of 4 cards the researcher has randomly picked, where each card is equally likely. This is repeated for 100 trials. The null hypothesis is that the subject is guessing, while the alternative is that the subject has ESP and can guess at higher than the chance rate. (i) What is the correct statement of the null hypothesis that the person does not have ESP? A. H0: p = 0.5 B. H0: p = 4/100 C. H0: p = 1/4 D. H0: p > 1/4 Answer: C (ii) The subject actually gets 35 correct answers. Which of the following describes the probability represented by the p-value for this test? A. B. C. D.

The probability that the subject has ESP. The probability that the subject is just guessing. The probability of 35 or more correct guesses if the subject is guessing at the chance rate. The probability of 35 or more correct guesses if the subject has ESP.

Answer: C (iii) Which of the following would be a Type 1 error in this situation? A. B. C. D.

Declaring somebody does not have ESP when they actually do. Declaring somebody has ESP when they actually don’t have ESP. Analyzing the data with a confidence interval rather than a significance test. Making a mistake in the calculations of the significance test.

Answer: B 3. The probability that a patient recovers from a certain stomach disease is .75. Suppose that 20 people who have contracted this disease are randomly selected. What is the probability that exactly 12 will recover from the disease? Answer: Let the random variable, Y, be the number of the selected people who recover from the disease. Then Y has a binomial distribution with n = 20 and p =0.75. Thus for

y = 0,1, 2,..., 20. 0.168609 !

398$

So

!

Chapter 6: Correlation and Regression Analysis Lesson 1: Examining Relationships with Correlation TIM E FRAM E: 120 minutes OVERVIEW OF LESSON: In this lesson, learners provide discussions on basic concepts and tools in exploring relationships between two variables. In addition, an activity is conducted to give learners hands-on experience in exploring relationships between two variables using a random sample of data collected from Lesson 1-01. Each student chooses a question to work on, then generates a scatter plot to show the relationship of interest and finally, estimate the correlation coefficient. After all learners have interpreted their correlation, they pool their findings as a class to explore the variability in the correlations found. As a class, they construct approximations to the sampling distributions of the correlation coefficient and use the sampling distributions to make assertions about the values of the population parameters. LEARNING COMPETENCIES At the end of the lesson, the learners should be able to: • • • • • •

illustrate the nature of bivariate data construct a scatter plot describe shape (form), trend (direction), and variation (strength) of bivariate relationships based on a scatter plot estimate strength of association between two variables based on a scatter plot calculate the Pearson’s sample correlation coefficient solve problems involving correlation analysis

LESSON OUTLINE 1. 2. 3. 4. 5.

Motivation: Preliminary Lesson: Correlation Measures Linear Association Main Lesson: How to Generate the Scatterplot and Calculate the Correlation Coefficient Enrichment: Sampling Distribution of Correlations Advance Lesson: Rank Correlation

DEVELO PM ENT O F THE LESSO N (A) M otivation In Chapter 1, learners were guided through the basic tools used for describing data pertaining to one variable. In practice, a number of variables are collected per data item, such as information on an individual, household, establishment, farm, country, etc. Although data can be described and explored one variable at a time, it is also important to

!

399#

! explore the relationship between two or among many variables. For instance, learners are to be asked on how information about a student’s daily allowance compares with the number of text messages he/she sends in a day, or how the weight of a student compares with the height of the student. We might wish to compare information on poverty in a particular region with information about crime. Such pairs of measurements are called bivariate data. Observations of two or more variables per individual (or object) are called m ultivariate data. Inform learners that Sir Francis Galton was one of the first scientists who investigated the relationships of variables within the context of studying family resemblances, particularly the degree to which children resemble their parents. Galton’s disciple Karl Pearson further worked on this topic through an extensive study on family resemblances. Part of this study generated the heights of 1,078 fathers and those of their respective first-born sons when they reached the age of maturity. A plot of these data is shown in Figure 6-01.1, where the pairs of dots represent the father’s height on the horizontal axis and the son’s height on the vertical axis. This is known as a scatterplot or scatter diagram . The scatter plot suggests a positive association between the father’s height and that of his son, i.e., tallerthan-average fathers tend to have taller-than-average sons and short fathers tend to have short sons.

Figure 6-01.1 Heights of 1,078 fathers and sons. Reproduced from Figure 1, Chapter 8, p. 120, Freedman, Pisani, Purves, 2007. Studies that involve comparing two variables are conducted to find some connection (perhaps even some suggestion of causality) between them. This analysis aids us to establish whether we can estimate the height of the son given the height of the father. If there is a weak association between the variables, then information about one variable will not help us in estimating the other variable. If there is a strong association between the variables, one way by which we can estimate one of them given the other is to fit a line passing through the point of averages (the point consisting of the averages of the two variables) with a slope equal to the ratio of the standard deviations of the variables. An alternative to this line is the regression line, which will be discussed in more detail in the next set of lessons in this chapter. (B) Prelim inary Lesson: Correlation Measures Linear Association Let learners know that when the relationship between two numerical variables is of interest, a scatterplot of these variables should be drawn. Tell learners that a scatterplot allows one !

400#

! to visualize an association between two variables, if and when it exists. Some of the questions that can be answered with the use of a scatterplot include: • • • •

Does one variable tend to be large when another is large? Does one variable tend to be small when the other is large? Does the relationship between the variables more or less follow a straight line? Is the scatter in one variable the same, regardless of the value of the other variable?

(C) M ain Lesson: How to Generate the Scatterplot Before starting with the activity, the teacher may review the concepts of population, sample, population parameter and sample statistic to reinforce the learner’s knowledge and competence on these basic concepts. Data obtained from all the learners in Lesson 1-01 may be used to create a scatterplot and to conduct a correlation analysis. For this lesson, the population of interest being discussed is the population of senior high school learners who have filled out the Activity Sheet Number 1-01a in Lesson 1-01. The statistical questions of interest for the activity are: 1. Does your daily allowance (x) increase or decrease with the number of text messages you send in a day (y)? 2. Does your daily allowance (x) increase or decrease with how happy you are (y)? 3. Does your weight (x) increase or decrease with your height (y)? Learners will be asked to choose the statistical question they would like to focus on. Note for teachers on classroom organization: For small classes with less than 30 learners, give only two questions to choose from. Make sure that the class is evenly divided among the questions. It would be ideal to have at least 10-15 people working on the same question. If the class size is too small, reduce the number of questions or have each student choose two out of the three questions to explore.

!

401#

! The teacher should share the entire database collected from Lesson 1-01 and divide the learners into groups. Ask the groups to take a random sample of 30 records from the entire database. Suppose that we have the following snapshot of a sample data set: Student Number

Sex

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

!

F F F F F F F F F F F F F F F M M M M M M M M M M M M M M M

Height (in meters)

Weight (in kg)

1.64 1.52 1.52 1.65 1.02 1.626 1.5 1.6 1.42 1.52 1.48 1.62 1.5 1.54 1.67 1.72 1.65 1.56

40 50 49 45 60 45 38 51 42.2 54 46 54 36 50 63 55 61 60 52 90 50 90 80 58 68 27 50 94 66 50

1.7 1.53 1.62 1.79 1.57 1.7 1.77 1.478 1.727 1.56 1.75

402#

Daily allowance in school

0 50 0 150 0 0 200 100 500 0 100 20 0 0 0 200 0 50 250 0 250 100 100 50 20 100 300 100 50 0

Usual number of text / day

10 5 18 4 60 20 200 15 10 2 25 30 60 80 30 1 80 30 100 0 6 0 0 55 50 5 3

Happiness

5 7 5 9 5 7 7 6 9 4 8 4 6 7 9 8 5 6 8 4 9 6 7 7 4 8 7 7 6 5

! Note for teachers regarding Outlier Analysis: The class should discuss whether the data are realistic or whether it will be necessary to delete some cases from the analysis (especially if these cases seem incorrect). Show learners how to obtain a scatterplot by hand. •

Tell them to draw a rectangular coordinate system and to label the x- and y- axes. Learners may be asked if they still remember that Rene Descartes (1596-1650) was the first to represent pairs of numbers with points in a coordinate system, and this is why the x and y coordinates are also called the “cartesian coordinates.”



Suppose that x and y represent the daily allowance of learners and the number of the typical number of text messages sent by learners, respectively.



Choose a range that includes the maximums and minimums of the two variables. For the sample data, our x-values go from 0 to 500 (pesos), while our y-values range from 0 to 200.

100

0

0

50

8 7 6 5 4 3 2 1

150

Usual number of text messages sent in a day

200

10 9

y

usual number of text messages sent in a day

Figure 6-01.2 Dataset in Excel 2013

Draw the first point, i.e. (0,10), on the graph, and then the remaining points as shown in Figure 6-01.3.



0

1

2

3

4

5

6

7

daily allowance in school

8

9

0

10

x

(a)

!

!

100

200 300 Daily allowance in school

(b)

403#

400

500

! Figure 6-01.3 Drawing Points on a Cartesian Coordinate System (a) First Point; (b) All Points To obtain a scatterplot of the dataset using spreadsheet applications, such as Excel 2013, merely put the dataset into a spreadsheet and highlight the columns of data that need to be used (here, columns F and G). Then on the Insert tab, under the Charts group, choose Scatter and then select the Scatter Icon. This yields a default scatterplot with a title and legend (mentioning the y variable), shown in Figure 6-01.4 (a). This can be improved by deleting the title and legend, and clicking on Layout, then Selecting Axis Titles, and then adding Primary Horizontal Axis Title (with a title “allowance”) and also Primary Vertical Axis Title (with a title “text messages”), thus yielding Figure 6-01.4 (b).

(a)

(b)

!

404#

! Figure 6-01.4 Scatterplots of Usual Num ber of Text m essages in a Day versus Daily Allowance (O btained from Excel 2013) The scatterplot is a very informative tool for determining the association between two variables, say X and Y. Let mx and my be the respective means of the variables X and Y, and sx and sy be the respective means of the variables X and Y. In Chapter 1, learners were reminded of statistical concepts learned in previous grades, that the mean and standard deviation of a list of data measure the center of and scattering in the list, respectively. Furthermore, Chebychev's inequality guarantees that at least 75% of the x- values (pertaining to the X variable) will be within ± 2 standard deviations of the mean of X, and likewise, that the y coordinates of at least 75% of the points will be within ± 2 standard deviations of the mean of Y. Inform learners that two variables are said to be associated if knowing the value of one variable provides information about the value of the other variable. More precisely, the two variables X and Y are (linearly) associated if the standard deviation of the y-values around a narrow range of x-values (ie, a "vertical slice" through the scatterplot) is smaller than the overall standard deviation sy of the variable Y; or if the standard deviation of the x-values of the x-variables around a narrow range of values of the y-values (ie, a "horizontal slice" through the scatterplot) is smaller than the overall standard deviation sx of the variable X. For example, consider the plot of allowance against the number of texts sent. The two variables are said to be associated if for particular values of X, say for the learners with allowance of Php 50 (or thereabouts), the spread of the values of Y is less than the spread of all the values of Y, and this is true for all the values of X. In that case, you may see that as the allowance of the student increases, the number of texts sent also increases.

!



If points with larger-than-average values of one variable tend to have larger-thanaverage values of the other, and points with smaller-than-average values of one variable tend have smaller-than-average values of the other, the scattering of the values of Y in vertical slices through the scatterplot will be smaller than sy. The scatterplot of the variables X and Y would show positive association. Simply put, the values of Y tend to increase as the values of X increase.



If points with larger-than-average values of one variable tend to have smaller-thanaverage values of the other, and points with smaller-than-average values of one variable tend to have larger-than-average values of the other, the scattering of the values of Y in vertical slices through the scatterplot will be smaller than sy; in this instance, there is negative association. Simply put, the values of Y tend to decrease as the values of X increase.

405#

! In conjunction with such an analysis of the scatterplot, we may need a summary measure that would inform us of whether or not there appears to be a relationship between two variables in our data set. One such measure is called the correlation coefficient (or correlation, for short), which, together with the other four summary measures mx, my, sx, and sy, provides a basic description of ttwo variables and their connection.

The correlation coefficient between two variables X an Y is a measure of association between the variables. It is obtained by firstly, getting the product of the standardized scores of the X’s and Y’s, and then, taking the average of the resulting products.

The correlation coefficient is denoted by the Greek letter r (rho). The correlation (often ascribed to Pearson) serves to measure linear association between two variables. TECHNICAL NOTES For a set of points

( x1 , y1 ) , ( x2 , y 2 ) , …, ( x n , y n )

the correlation coefficient is

• • 1 n ' xi − µ x $'% yi − µ y $" % " ρ= • n i =1 %& σ x "#%& σ y "# • • In practice, what we have is a sample data set. The sample correlation coefficient, denoted as • r, is computed in exactly the form given above with the data still treated as though we had a • population. • • Show learners that the correlation coefficient between daily allowance (x) and the number of the typical number of text messages (y) can be calculated by:



• • •

Firstly, obtaining the standardized values of these variables; then, computing the product of the standardized values; and finally, obtaining the average of these products, thus yielding the correlation coefficient.

Calculation of this whole process with Excel 2014 is shown in Figure 6-02.5, for pairs of data with full records of x and y. Here, the data on daily allowance (x) and the number of text messages (y) are put on the fifth and sixth columns. The eighth and ninth columns of the worksheet represent the standardized values of daily allowance (x) and the number of text messages (y), respectively.

!

406#

! The last two entries of the fifth and sixth columns represent the mean and standard deviation of the 27 data entries in the said columns which are treated as a population. For instance, the item in cell E29, the mean of x, is obtained using the command =AVERAGE(E2:E28) while the item in cell E30, (the population standard deviation of x), is obtained using the command =STDEVP(E2:E28) The eighth and ninth columns are the standardized values obtained by subtracting the mean from the data and dividing the result by the standard deviation. For instance, cell H2, which contains the standardized value of the first observation of x, is obtained using the command =(E2-$E$29)/$E$30 Each entry in the tenth column is the product of the corresponding entries from the eighth and ninth columns. The final cell in the tenth column, that shows the average of the entries in the column viz., 0.780283153, represents the correlation coefficient. Figure 6-01.5 Com puting correlation with Excel 2013 While the above calculations are obtained based on the definition of the correlation coefficient, there is a much quicker way of obtaining the correlation coefficient using Excel 2013 . Using Excel 2013, construct the database (i.e. the first two columns in the worksheet shown in Figure 6-01.6) and use the Excel function CORREL by entering: = CORREL(A2:A41, B2:B41) into any cell outside of the database, then generate the correlation 0.780283153 between daily allowance (x) and the number of text messages (y). !

407#

!

Figure 6-01.5 Alternative and faster com putation of correlation with Excel 2013 Alternative way of com puting the correlation coefficient (in case the use of Excel is not possible) List down all the values of X, Y, XY, X2, and Y2. On the bottommost row, calculate the total for each column.

X 0 50 150 …

Y 10 0 18

XY 0 0 2700

The correlation coefficient is given below:

!

408#

X2 0 2500 25200

Y2 100 0 324

!

If the numerator of the previous expression is divided by covariance, denoted by

, you will get the sam ple

. The denominator is the product of the standard deviations

of X and Y. Thus, the correlation coefficient can also be computed as:

If the linear association between the variables is strong, as shown in the worked example, we would be able to predict the variable y (here, typical number of text messages) from the variable x (here, daily allowance). Ask learners why daily allowance would be a good predictor of number of text messages sent. Learners should be able to say that the number of text messages sent daily is based on the socio-economic status of the student (that could be measured by daily allowance of the student). Learners should be guided that the correlation coefficient is unit-free because it is based on the standard units of the variables. Mention to learners also that correlation can only range from –1 to 1.







!

When the correlation is zero, the cloud of points is either without form (as in Figure 6-01.7) or has a nonlinear pattern (as in Figure 6-01.8). The latter is a result of correlation being a measure of linear association (and not just association) between the variables. Variables that have zero correlation are said to be uncorrelated. When the correlation coefficient is positive, the variables tend to increase together. As the (positive-valued) correlation coefficient gets closer to 1, the stronger is the linear association between the variables, and the more tightly we see a clustering of points around a line in the scatterplot showing a line sloping upward. o A correlation of exactly 1, referred to as a perfect correlation, has all the points falling exactly on a line that is sloping upward. The variables will tend to go on opposite directions if the correlation between them is negative. When the correlation is negative, either: (a) as X increases, Y decreases; or (b) as X decreases, Y increases. Also, as the (negative-valued) correlation coefficient gets closer to –1, the linear association between the variables gets stronger.

409#

! A perfect –1 correlation will have all the points lying exactly on a line sloping downward.

0

1

2

y

3

4

5

6

o

0.0

0.2

0.4

0.6

0.8

1.0

x

0

5

10

y

15

20

25

Figure 6-01.7 No apparent association between X and Y

0.0

0.2

0.4

0.6

0.8

1.0

x

Figure 6-01.8 A nonlinear relationship betwen X and Y The correlation is a unit-free measure of linear association between two variables, i.e. clustering of values of the two variables around a line that have values between –1 and +1. The closer the correlation between the variables is to –1, the stronger is the negative linear relationship; the closer it is to +1, the stronger is the positive linear relationship; and the closer it is to 0, the weaker the linear relationship.

!

410#

! Ask learners if they think that the correlation between two variables is sensitive to the order of the variables. The answer here should be no. Interchanging the variables would yield the same value for the correlation between the variables. Switching y (here, typical number of text messages) and x (here, daily allowance) in the worked example would still generate the same value for the correlation between the variables. Help learners notice that adding a constant to all the values of one variable will not change the correlation coefficient. If one is to add 100 to all values of daily allowance in the worked example, the correlation coefficient will not be affected. Let learners know also that multiplying one variable by a positive constant does not affect the correlation. In particular, had all learners been given an across the board increase of 10 percent in their daily allowances (which is equivalent to multiplying the current values by 1.10), then the correlation we calculated would remain unchanged. (But if a negative constant is multiplied to one variable, while the correlation does not change in absolute value, it changes direction from positive to negative, or negative to positive.) Learners may want to know whena correlation coefficient suggests strong association between the variables. Although there are no hard rules in determining the strength of the linear relationship based on the correlation coefficient, learners may want to use the following guide in order to interpret the correlation:

0 < r < 0.3

Weak Correlation

0.3 < r < 0.7

Moderate Correlation

r > 0.7

Strong Correlation

For instance, the correlation coefficient of 0.780283153 between daily allowance (x) and the number of text messages (y) indicates a strong positive correlation between the variables. Notes on interpreting the correlation coefficient: (i) A correlation of 70% does not mean that 70% of the points are clustered around a line. It should not also mean that there is twice as much linear association with a set of points that has a correlation of 35%. (ii) Furthermore, a correlation analysis does not imply that the variable X causes the variable Y; that is, association is not necessarily causation (although it may be indicative of cause and effect relationships). Even if polio incidence correlates strongly with soft drink consumption, this need not mean that soft drink consumption causes polio. If the population of ants increases (in time)

!

411#

!

0

5

10

y

15

20

with the population of persons, (and thus these numbers strongly correlate), we cannot adopt a population control program for people based on controlling the number of ants! (iii) The presence of outliers (see Figure 6-01.9) easily affects the correlation of a set of data, so it is important to take the correlation figure with a grain of salt if we detect one or more outliers in the data. In some situations, we ought to remove these outliers from the data set (especially those that are suspected to be poor quality data) and re-do the correlation analysis. In other instances, these outliers ought not to be removed as these records may be “correct data” and they contain information that should not be deleted. In any scatterplot, there will be more or less some points detached from the main bulk of the data, and these seeming outliers need not be rejected without due cause.

0.2 6-01.9 0.4 Outliers 0.6 Figure in0.8 Data 1.0 (iv) Moving from correlation to causation is often problematic as there are several x possible explanations for a correlation between X and Y (excluding the possibility of chance). It may be that X influences (or cases) Y; Y influences X; or both X and Y are influenced by some other variable. Thus, when performing correlation analysis of variables without being given any background knowledge or theory, inferring a causal link could not be justified regardless of the magnitude of the correlation. While there may be a causal link between alcohol consumption and deaths from liver cirrhosis, it would be difficult to infer from the high correlation between pork consumption and cirrhosis mortality. To further illustrate why correlation and causation are not equivalent, consider the high correlation of ice cream sales and the number of drowning cases; the high correlation of the absolute number of unemployed in a country across the years with the number of sunspots observed across the years; or the high correlation of the number of 0.0

!

412#

! diseases with the number of health professionals. Based on these examples, it is flawed to conclude that eating ice cream causes drowning or that the sun causes unemployment or the health professionals are causing diseases. The seemingly high correlation of ice cream sales and the number of drowning cases is due to the fact that ice cream sales increase as the temperatures increases (during summer!). During this time, more people go to the beach, which also increases the chance of drowning. Therefore, there seems to be an increase in drowning cases that coincides with the increase of the sales of ice cream due to the season. (D) Enrichm ent: Sampling Distribution of Correlations Note: Teacher may opt to break the lesson here for the following session day if class sessions are only done for at most 60 minutes in a day. 1. Using the “samples” obtained from the database, learners should be asked to analyze the sample on their own. Each student should create a scatterplot, examine if the existence of a linear relationship, if any, point out any outliers, and compute the correlation coefficient. Once the individual activities have been completed, learners may want to work together as a class. The teacher may arrange the class into groups according to the questions they worked on. The groups must generate dot plots for the three different questions for the slope, intercept, and correlation coefficient. Figure 6-01.10 is a sample interpretation of the dot plots. The dot plot for the correlation above reveals that in most samples the correlation is very high. The distribution illustrates that except for one outlier value of .63, the correlation tends to be greater than .85. Moreover, the majority of the samples revealed a correlation greater than .95. Thus, a good guess for the population correlation coefficient would be around .96. This would be a good guess because although the mode of the dot plot is .99, there are still some values that are a bit lower. A .96 correlation was found in three samples as well. The range of the dot plot is .36.

!

413#

! Figure 6-01.10. Dot Plot of Correlation of Usual Num ber of Text M essages with Daily Allowances 2. Break up each sample by gender and repeat the activity for each gender separately. Compare whether there are differences between males’ and females’ text messaging experiences. (D) Advance Lesson : Rank Correlation A less sensitive measure to outliers is the rank correlation or the correlation of the ranks of the x-values with the ranks of the y-values. The rank correlation may be computed instead of the typical (Pearson product moment) correlation coefficient especially if there are outliers in the data. This rank correlation is called Spearm an’s rho. Learners may want to determine Spearman’s rho using the data they have worked on. Make sure to determine ranks properly for ties, e.g., if the fourth, fifth and sixth smallest have the same values, then they are all given a rank of 5. KEY PO INTS • •



A scatterplot (or scatter diagram) can be used to show the relationship between two numerical variables. Correlation analysis is used to detect whether two variables are “linearly” related (or associated), i.e. does one variable increase when the other variable increases? or does one decrease when the other increases? Correlation measures the strength of association (linear relationship) between two numerical variables: o The correlation coefficient is a unit-free number that ranges from -1 to 1. It it is not affected by interchanging the order of the numbers, by adding the same number to all the values of one variable, or by multiplying all the values of one variable the same positive number o Correlation is only concerned with the strength of the relationship andcausal effect is not implied. In some cases, e.g., correlation of absolute number of unemployed in a country across the years and number of sunspots observed across the years, correlation may be simultaneously influenced by a third variable (here in the example, time is the third variable). o The correlation coefficient can be misleading when the data have outliers and if the underlying empirical relationship between the variables is nonlinear. Whenever possible, inspect the scatterplot for issues such as these.

REFERENCES Many of the materials in this lesson were adapted from:

!

414#

! Gibson, J., McNelis, M., and Bargagliotti, A., “Text Messaging is Time Consuming! What Gives?”Retrieved from: STatistics Education Web (STEW) through https://www.amstat.org/education/stew/pdfs/TextMessagingisTimeConsumingWhatGives.doc See also: Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore. De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc. Freedman, D., Pisani, R, and Purves (2007). Statistics. Fourth Edition. New York: W.W. Norton & Company. Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los Baños, College Laguna 4031

!

415#

! ACTIVITY SHEET 6-01 Introduction The advent of cellophanes has changed our life. Many of us send lots of text messages throughout a day. What factors could be related to the number of text messages a senior high school learner sends in a day? In this activity, we will explore the relationship between the number of text messages one sends in a day and a few other potential explanatory factors, such as daily allowance and the happiness of a learner. We will also explore the relationship of height and weight of senior high school learners. Each learner will work individually on a random sample of data collected from the class (or the whole school) and then later on share the results with the class. The class will then be grouped to work together in drawing further conclusions. Choose one of the following questions to explore: a. Does your daily allowance (x) increase or decrease with the number of text messages you send in a day (y)? b. Does your daily allowance (x) increase or decrease with how happy you are (y)? c. Does your weight (x) increase or decrease with your height (y)? To answer the question you chose, you are going to take a random sample of 30 observations from the database collected at beginning of the Statistics and Probability course. You will carry out the following steps on your own: 1. Based on the question you chose, generate a scatterplot to create a visual representation of the data. Does the relationship appear to be linear? Are there any outliers? What are some possible explanations for the existence of outliers? Should you eliminate the outliers in your data set? Why or why not? 2. Based on the question you chose, compute the correlation coefficient. 3. Interpret the correlation coefficient in the context of the question.! Now, the class will be divided into three groups and will be asked to choose one question to work on from the three provided above. Each group will complete the following table for their chosen question.

!

416#

! ENRICHM ENT 4. Collect the correlation coefficient found by each learner in the class and summarize the result in a table: Question Chosen: _________________________ vs._________________________________ Student Correlation

Student Correlation

Student Correlation

Student

1

6

11

16

2

7

12

17

3

8

13

18

4

9

14

19

5

10

15

20

Correlation

5. Create a dot plot for the correlations.! 6. Look at the dot plot that pertains to your question. This dot plot represents an approximation to the sam pling distribution of the correlation coefficient. What do you notice about the dot plot? What is the range of the correlation coefficient? What seems to be the most common correlation? If you are to guess what the correlation was for the entire population, what would be your guess? Explain wh ASSESSM ENT 6-01 I. A study was done to investigate the relationship between the amount of protix (a new protein-vitamin-mineral supplement) on fortified-vitamin rice, known as FVR, and the weight gain of children. Ten randomly chosen sections of grade one pupils were fed with FVR containing protix; different amounts (X) of protix were used for the 10 sections. The increase in weight of each child was measured after a given period. The average gain (Y) in weight for each section with a prescribed protix level (X) is as follows: Section 1 2 3 4 5

Protix 50 60 70 80 90

Gain

Section

92.6 97.5 96.5 102.3 105.8

6 7 8 9 10

a. Create a scatter diagram based on the data.

!

417#

Protix 100 110 120 130 140

Gain 106.2 108.9 108.4 110.2 110.8

! b. Does the scatter diagram suggest a linear relationship? What other relationships may be tenable? c. What is the correlation between the protix level and the average weight gain? Answers: (a) Scatterdiagram below

(b) Yes, a linear relationship is tenable. A quadratic relationship may be more tenable for the data. (c) 0.954608 II. At a large local high school, the principal wanted to ensure that her learners would perform well on this year’s standardized tests. As such, the principal came up with a list of factors that may negatively or positively impact test scores and aimed to prove it to the learners while giving a practice test out of 100 points. A month before the practice test, the principal asked learners to fill out a survey asking them how many hours per week they hung out with their friends and how many hours per week they spent in the study hall. Because the high school was very large, the principal only surveyed a sample of the learners. Below are the two scatterplots showing the survery results versus the students’ scores on the practice exam.

Scatter Plot

Collection 1 110 100 90 80 70 60 50 0

!

5

10 15 20 25 Hours_With_Friends

30

35

418#

! Yˆ = −2.69 X +122.87,R2 = .71

Scatter Plot

Collection 1 110 100 90 80 70 60 50 0.0

0.5

1.0 1.5 2.0 2.5 Hours_in_Study_Hall

3.0

3.5

Yˆ = 2.85X + 76.183,R2 = .02

Based on these two scatterplots, answer the following questions: 1. Is there a positive or negative relationship between the hours spent by students with with their friends and their test scores? How about the hours spent in the study hall and their test scores? 2. On average, what would be the students’ scores if they spent zero hour per week hanging out with friends? In the study hall? 3. On average, how many points on the test would be increased/decreased if each student spent one extra hour in the study hall? Hanging out with friends? When the learners heard the results of the study, they asked the principal to look at different samples of learners in the high school. To accommodate the request of the learners, the principal decided to randomly sample groups of 20 learners at a time, for 15 more times. The following dot plots provide the summary of the results. Dot Plot

Hours in Study Hall

1.5

!

2.0

2.5

3.0 Slope

3.5

4.0

Dot Plot

Hours with Friends

4.5

-4.0

419#

-3.5

-3.0 -2.5 Slope

-2.0

-1.5

! Based on the dot plots above, answer the following questions: 4. Should the learners believe that the principal’s decision to mandate spending an extra hour in the study hall every week would increase their scores on the test? Explain. 5. Should the learners try to decrease the number of hours they spent hanging out with friends before the test? Explain. Answers 1. There appears to be a negative linear relationship between the amount of time a student spends hanging out with their friends and their test scores. There does not seem to be a clear positive or negative relationship between the number of hours spent in the study hall and the test scores. 2. On average, a student would score 122.87 on the test if they spent zero hours per week hanging out with friends. This y-intercept does not have a practical interpretation since there is no way to score more than 100 on the test. Also note that 0 is not within the range of the collected data values for hours spent with friends. On average, a student would score 76.183 on the test if they spent zero hours per week in the study hall. 3. On average, a student’s score will change by -2.69 points for every hour they spend hanging out with friends. On average, a student will increase 2.85 points on the test for every hour they spend in study hall. 4. The dot plot illustrates that all the sampled slopes are positive. This means that for every one of the 50 samples of 20 subjects sampled, the slope of the regression line was positive showing that as the number of hours spent in thestudy hall increases, the scores on the test increase. In particular, the dot plot shows that the slopes tend to be for the most part between 2.6 and 3.6, meaning that on average, scores would be raised between 2.6 and 3.6 for every hour extra spent in the study hall. 5. The dot plot illustrates that all the sampled slopes are negative. This means that for every one of the 50 samples of 20 subjects sampled, the slope of the regression line was negative showing that as the number of hours spent with friends increases, the scores on the test decrease. In particular, the dot plot shows that the slopes tend to be centered around −2.5, meaning that on average, the scores would change by about 2.5 for every extra hour spent hanging out with friends.

!

420#

Biographical Notes Jose Ramon G. Albert, Ph.D. Team Leader Dr. Jose Ramon Albert Dr. Albert is a Senior Research Fellow of the Philippine Institute for Development Studies. He is a professional statistician who wrote topics spanning poverty measurement and analysis, education statistics, and statistical analysis of missing data. He has been a Consultant of development agencies, including the United Nations Statistical Institute for Asia and the Pacific and the World Bank Group. He has also served in government agencies such as the Philippine’s Commission on Population, Malaysia's Economic Planning Unit, and Lao's Department of Statistics. Dr. Albert has taught in different universities including the University of the Philippines and the De La Salle University. Dr. Albert served as President of the Philippine Statistical Association, Inc. For over fifteen years, Dr. Albert has written and co-authored various monographs, papers, and journal articles. He earned his Doctorate of Philosophy in Statistics and his Master of Science in Statistics from the State University of New York at Stony Brook. He was a Philippine Department of Science and Technology Scholar and graduated with a degree in Mathematics (Summa Cum Laude and Awardee for Excellence in Mathematics) from the De La Salle University.

Zita V.J. Albacea, Ph.D. Writer Dr. Zita Albacea is the current Executive Director of the Philippine Statistical Research and Training Institute. She has been teaching Statistics spanning survey operations, special problems, and statistical theory at the University of the Philippines Los Baños for 34 years. She served as Dean of the UPLB College of Arts and Sciences from 2011 to 2014 and UPLB Vice Chancellor for Administration from 2002 to 2004.

She co-authored several teaching workbooks, lecture handbooks, and laboratory manuals for various topics in Statistics. Dr Albacea completed her doctorate degree in Statistics at the University of the Philippines Los Baños. She also finished her master’s and bachelor’s (cum laude) degrees in Statistics at the same university.

Mark John V. Ayaay Writer Mr. Mark John V. Ayaay is the lead teacher in Statistics 1 at Philippine Science High School, Diliman, where he has been teaching for five years and considered an Outstanding Teacher for three school years. Prior teaching at PSHS, he was also an Instructor at the Ateneo de Manila University, where he taught Calculus and Applied Mathematics. He finished his bachelor’s degree in Mathematics at the Ateneo de Manila University, and is currently taking his Bachelor of Science in Public Health major in Biostatistics at the University of the Philippines Manila.

Isidoro P. David, Ph.D. Writer Dr. Isidoro P. David is one of the frontrunners of the Statistics community in the country. He served as President of the Philippine Statistical Association for two terms. He was consultant for the Asian Development Bank, National Statistics Office, Statistical Research, and theo Training Center among others. He also held a teaching position at the University of the Philippines . Dr. David finished his doctorate in Statistics at the Iowa State University of Science and Technology. He received his master’s degree in Statistics and bachelor’s (cum laude) degrees in Agriculture, majoring in Statistics at the University of the Philippines Diliman. He received different local and international citations, including Outstanding Social Scientist Award, Outstanding Researcher Award in Mathematical Sciences, and won the 3rd Mahalanobis International Award.

Imelda E. de Mesa, Ph.D. Writer Dr. Imelda E. de Mesa is a Senior Lecturer at the University of the Philippines School of Statistics, Diliman where she teaches undergraduate courses in Statistics. Before teaching at UP Diliman, she was an Assistant Professor VI at the De La Salle University Manila, where she taught courses such as Statistical Inference, Multivariate Analysis, and Nonparametric Statistics both on the graduate and undergraduate levels. She also held various leadership positions at the Philippine Statistical Association from 2003 to 2011, most notably as the Board Director for five years.

Nancy E. Añez-Tandang, Ph.D. Technical Editor Dr Nancy Tandang is Assistant Professor at the University of the Philippines Los Baños. She has served as Chairperson for Research Committee at the Institute of Statistics in UPLB and has collaborated on researches and lectures with partner universities and agencies, including the Ateneo de Naga, TESDA, DA, and DepEd. Dr Tandang co-authored academic resources on Statistics such as the Training Manual on Microcomputer based Statistics for the Social Science Research and all twelve editions of Workbook on Statistics. Dr. Tandang Completed her doctorate in Statistics, her masters in Statistics, and her bachelor’s degree in Statistics (cum laude) at the University of the Philippines Los Baños.

Roselle V. Collado Technical Editor Prof. Roselle V. Collado is Assistant Professor III at the University of the Philippines Los Baños and is currently the Program Development Associate at the Office of Institutional Linkages in UPLB. She completed her master’s degree in Statistics at the University of the Philippines Los Baños under the DOST-ESEP scholarship grant.

She graduated cum laude for her bachelor’s degree in Statistics also at the same university. Dr. Collado has served as statistician for different partner groups, including the Resources, Environment, and Economic Center for Studies, DAR, Inc. and SEAMEO-SEARCA. She has also served as resource speaker in lectures on Statistics around the Philippines.

Rea Uy-Epistola Copyreader Rea Uy-Epistola is currently Proprietress for the Material Recovery Facilities. She held consultant posts for writing projects including those of UNICEF Philippines, Save the Children Federation, Inc., and the Climate Change Commission. Rea Uy-Epistola graduated Cum Laude at the University of Santo Tomas with a bachelor’s degree in Political Science.

Michael Rey O. Santos Layout Artist MIchael Rey Santos is a freelance illustrator and graphic artist specializing both in traditional and digital media. He worked as illustrator for publications such as Summit Media’s K-Zone Magazine, and for government agencies including PAG-IBIG and CHED. Mr. Santos also taught graphic applications and animated media at the First Academy of Computer Arts for 7 years. He graduated at the De La Salle University with a bachelor’s degree in Psychology.

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF