STAT 1010
Short Description
statistics...
Description
STATISTICS STAT 1010
Centre for Professional Development and Lifelong Learning UNIVERSITY OF MAURITIUS
STATISTICS STAT 1010
SUPPORT MATERIALS
Centre for Professional Development and Lifelong Learning UNIVERSITY OF MAURITIUS
ii
AUTHORS STATISTICS – STAT 1010 was prepared for the Centre for Professional Development and Lifelong Learning, University of Mauritius. The ProVice Chancellor – Teaching and Learning  acknowledges the contribution of the following course team members:
Dr V Jowaheer

Faculty of Science
Mr S Kalasopatan

Faculty of Social Studies and Humanities
Dr F Khodabacus

Faculty of Engineering
Dr A Ruggoo

Faculty of Agriculture
Assoc. Prof P Veerapen

Faculty of Social Studies and Humanities
Assoc. Prof M J Pochun
August 2008
All rights reserved. No part of this work may be reproduced in any form, without the written permission from the University of Mauritius, Réduit, Mauritius.
iii
TABLE OF CONTENTS
STUDY GUIDE:Support Materials How to Proceed How to Use the Support Materials How to Use the Textbook Suggested Coursework Suggested Course Map Final Examination Suggested Grading Scheme Unit 1
Introduction
Unit 2
Data Collection l, OJ, Chapter 16
Unit 3
Organisation and Presentation of Data l, OJ, Chapter 1
Unit 4
Organisation and Presentation of Data ll, OJ, Chapter 2
Unit 5
Organisation and Presentation of Data lll, OJ, Chapter 3
Unit 6
Measures of Central Tendency, OJ, Chapter 5
Unit 7 Unit 8
Measures of Dispersion, OJ, Chapter 9 Time Series Analysis, OJ, Chapters 6 and 7
Unit 9
Index Numbers, OJ, Chapter 8
Unit 10
Probability, OJ, Chapter 11
Unit 11
Data Collection ll, OJ, Chapters 15 and 16
Unit 12 Unit 13
Linear Relationship Between Variables – l: Correlation, OJ, Chapter 23 Linear Relationship Between Variables – ll: Regression, OJ, Chapter 23
iv
STUDY GUIDE: Welcome to STATISTICS. This is a onesemester course designed to cover firstyear syllabuses of programmes of studies in the various faculties. The course provides an introduction to Statistics and the manual is designed to guide you through the course. The Study Guide contains important information on materials and procedures. We suggest that you spend some time to read it, and to familiarise yourself with what you will have to do to complete STATISTICS successfully. The suggested course map, p: (vii), indicates what you should be working one each week. If you have any questions arising from the instructions in the support materials, do not hesitate to contact your tutor.
SUPPORT MATERIALS AND TEXTBOOK This document can be used as SUPPORT MATERIALS. The module also include the following TEXTBOOK: Owen, Frank and Jones, Ron. (4th Edition) Statistics. Pitman. The textbook will be referred to as (OJ) in the Support Materials.
HOW TO PROCEED? You should begin by taking a look at the TABLE OF CONTENTS in both the SUPPORT MATERIALS and the TEXTBOOK. These tables provide you with a framework for the entire course because they outline the organisation and structure of the material you will be covering. You will notice that the Units in the support materials do not follow the same sequence as the Chapters in the textbook. However, in the Support Materials, you will be referred to the relevant parts of the various Chapters. The guidelines that follow are designed to help you most effectively work your way through the materials in this course. So, before you begin Unit 1 of the course, read the guidelines below carefully.
How to Use the Support Materials? v
The Support Materials provide you with study plans and commentaries on the textbook presentation. They introduce additional concepts and information, advise you to do particular practice activities, offer clarification, examples and solutions. Take a few minutes now to glance through the entire manual to get an idea of its structure. Notice that the format to deal with each unit is fairly consistent throughout the support material. For example, each unit begins with a UNIT STRUCTURE, an OVERVIEW and a list of LEARNING OBJECTIVES. The UNIT STRUCTURE and OVERVIEW identify the main topics in the Unit. You should begin your study of each unit by reading this brief introduction. You should then read the LEARNING OBJECTIVES. The importance of these objectives cannot be overstated. They identify the knowledge and skills you will have acquired once you have successfully completed the study of a particular unit. Keep the objectives in mind as you read the corresponding content in your textbook. The learning objectives also provide a useful guide for review. How to Use the Textbook? Studying requires that you take an active role. Therefore, use your textbook actively, recognising it for the useful “learning tool” that it is. You should be studying pencil in hand, circling an important concept, and making summary notes to crystallise your understanding. You may like to highlight or underline the key ideas. If so, remember that a rule of thumb is one quarter to one third of the material. If you overhighlight, you may be extracting more than the key ideas. Suggested Coursework STATISTICS is designed to be completed in one semester, with weeks one to thirteen for instruction, weeks fourteen and fifteen for review and with the final examination as scheduled by Faculty. Although you are free to work at your own pace, you should try to distribute your workload according to the suggested course map on page (vii). In order to complete STATISTICS you must read the instructional units. Generally, each of these will direct you to study specific Chapters in (OJ) though some of the units will be almost selfcontained. The objectives are tied to particular sections of the textbook. Review these objectives when you have completed a section to confirm that you have achieved the learning goals for it. If you realise that you are not clear about some aspects of the section, go back and redo relevant readings and exercises. It is important to build your understanding of Statistics patiently and thoroughly. The units contain directions to do various practice activities. You will find answers to these exercises either in the textbook or in the unit itself. The practice exercises are designed to
vi
reinforce the learning objectives for each part of the course. Thinking through these activities will train you in the skills you need for the examination and for later applications of Statistics.
SUGGESTED COURSE MAP Week
Unit
1
1
Introduction
1
2
2
Data Collection I
2
3
3
•
Organisation & Presentation of Data I
3
4
•
Organisation & Presentation of Data II
4
5
Organisation & Presentation of Data III
4
5
6
Measures of Central Tendency
5
6
7
Measures of Dispersion
6
7
8
Time Series Analysis
7
9
9
CLASS TEST* Index Numbers
8
10
10
Probability
9
11
11
Data Collection II
10
12
12
Relationship between Variables I
11
13
13
Relationship between Variables II
12
8
Topic
Tutorial
14 15
REVISION
*Week/date for Class Test to be confirmed during the semester.
FINAL EXAMINATION
vii
♦ Scheduled and administered by the Registrar’s Office ♦ A twohour paper at the end of the Semester. SUGGESTED GRADING SCHEME Invigilated class test Final Examination
: :
30% 70%
Now, it is time to get to work. Good luck and we hope you enjoy the course.
viii
UNIT 1
INTRODUCTION
Unit Structure 1.1
What Is Statistics?
1.2
Definition and Measurement
1.3
Nature of Statistical Data
1.4
A Last Word
1.1
WHAT IS STATISTICS?
In various aspects of life, we come across many questions whose answers are not immediately and accurately available. Very often, there is insufficient information or lack of knowledge or no information available: there may exist varying degrees of uncertainty with regards to possible answers for these questions. For example, we may ask ourselves many questions: Shall we have enough rainfall this summer? How many cyclones shall we have during this summer? What is the pass rate for the B.Sc. Management or B.Sc. Economics course? What is the level of unemployment in Mauritius? Are University students satisfied with the canteen facilities available on campus? Are our industries able to compete on the world market? Statistics is that branch of knowledge which provides us with tools/techniques to answer, at least to some extent, the above questions and many more such questions. To do so, we need, on the one hand, a minimum level of knowledge (i.e. understanding) and, on the other, information/data already available. If information/data are not available, then the first step is to collect them. Statistics deals thoroughly with the collection of data/information with a prime objective in mind: the quality of data collected should be of high standard. The data collected constitute the raw materials of any statistical analysis. Thus, if we would like to know whether university students are satisfied with available canteen facilities, we may choose to collect the views of all students or of a small percentage of students, provided that this small percentage of students is selected in an unbiased manner 1
and is representative of the whole student body. Whatever may be the approach, much can be learnt from the data, provided we are sufficiently careful about what is being collected and about how data are being collected. Once data are collected, there is the need to organise and present them in a manner calculated to reveal their salient features and any underlying pattern. Thus, the organisation and presentation of data are most important for the interpretation of data. This interpretation may be very basic and sometimes rather advanced. We then have to analyse the data to uncover with precision patterns which exist in the data set and relationships unheard of previously.
Uncertainties can then be handled with some
precision and can even be assessed, using probability theory and related ideas. Sophisticated analysis of the data can be carried out if necessary; statistical models are developed. Finally, comprehensive reports together with conclusions and recommendations are produced so that, in turn, ultimately the relevant authorities may take appropriate policy decisions. Statistical data and their statistical analysis are essential ingredients for decision making in almost any sphere of life: for government, business, community and individuals. The above definition of Statistics can be summarised by the following diagrammatic representation.
Our
starting
point
is
always
the
need
to
study
a
specific
issue/problem/phenomenon concerning people/society at large (e.g. students’ problems) or nature (e.g. the weather) or any interaction between people and nature (e.g. agriculture).
2
Organisation and Presentation of Data
Analysis of Data Collection of Data
Conclusion and Recommendation
People
Nature
Figure 1.1 : Diagrammatic Definition of Statistics This process continues indefinitely since implementation of the recommendations will ultimately create a new situation and most probably a better understanding of the problem/issue/phenomenon under consideration. Then, at a later stage, the need for new information/data will be felt, if only to assess the impact of these same recommendations over time. In a similar manner, scientific experiments/observations are carried out to help us to study and understand the world around us and to develop science and technology in general. The scientific data collected are then analysed accordingly. Statistics indeed plays a key role not only in the collection of scientific data but in the very development of scientific knowledge. So much so that Professor A.F.M. Smith of Imperial college of Science, Technology and Medicine, UK defines Statistics as “The Science of doing Science” (1996).
1.2
DEFINITION AND MEASUREMENT 3
Let us give some thought to one of the questions raised in the previous section: What is the level of unemployment in Mauritius? Statisticians, scientists and many other people take much time to measure a particular variable or set of variables. It is relatively easy to measure the length of a table; but it is entirely a different matter to ‘measure’ the level of unemployment in Mauritius. To be able to do so, we must know with precision what the term ‘unemployment’ means not only in broad general terms but in precise operational terms. In other words, ‘unemployment’ must be precisely defined before it can be ‘measured’. Thus, how do we define an unemployed person? The Central Statistical Office would define someone as unemployed if that person was not employed and was available for work and looking for work. But then, this raises other questions. For how long was the person not employed  for a day, for a week, ....? Is a fulltime student who holds no job unemployed? Is a worker on strike unemployed? In this introduction, we are not going to provide answers to all these questions. But they drive home the point that, in Statistics, precise definition of a variable is most important not only in broad general terms but in operational terms as mentioned above. The definition must be such that measurement is then possible. Sometimes good theoretical definition of a variable does not lend itself easily to measurement; it has to be adapted from a practical point of view so that the measurement is possible. Furthermore, definitions of a given variable may vary over time and methods of measurement may vary too!
Hence particular care must be given to the problems of definition and
measurement of a given variable so that these measurements are comparable over time and space as well.
4
1.3
NATURE OF STATISTICAL DATA
The discussion in the two preceding sections will most probably help us to become aware of the fact that available data must be used with some caution. For this reason, data are categorised in two ways: primary data and secondary data. Primary data are data which have been collected for a specific purpose and are being used for that purpose. That would include, amongst others, data collected by someone by means of a sample survey or an experiment with some clearly defined objectives in mind. Secondary data are data available in many statistical publications produced by the Central Statistical Office and by other institutions whether governmental or from the private sector. They include data which have been collected for a specific purpose but which are being used for various other studies. Thus, government departments may collect data for administrative reasons, not gathered specifically for the particular study which is being carried out. It is obvious that secondary data must be used with much caution. To start with, the sources of secondary data must be known. It helps to ascertain that the data are genuine and that they have been produced by competent institutions having the required expertise. Various valuable pieces of information would then be available: (i)
the definitions of variables used and problems of measurement involved;
(ii)
the method of data collection, for this will give us an idea of the degree of accuracy of the available data;
(iii)
the date of collection which would be relevant with respect to possible change in definition used over time and with respect to uptodate character of the data collected as well;
(iv)
the units of measurement used; for example the average monthly salary of a Mauritian in rupees is not comparable to the average monthly salary of an English person in
5
pounds sterling. Similarly, the month as a measure of time is not constant since each month does not have the same number of working days. Finally, it may be appropriate to note that data may be collected on, for example, the whole student body or on a fraction of the student body, as mentioned in section 1.1. Sometimes a statistical investigation is carried out on the entire group of units/individuals about which information is wanted; such an entire group is known as the statistical population. We have thus the population of students, population of cattle, population of buildings etc. A sample, however, is a part of the population used to gain information, which, after proper statistical analysis, can be generalised to the whole population. More will be said on samples and different types of statistical investigations in other units.
1.4
A LAST WORD
Statistics is a fast developing subject, having a wide range of applications: biometrics, econometrics, psychometrics, statistical quality control, etc. Over the last sixty years or so, there has been a constant flow of new ideas in Applied Statistics as well as in Theoretical Statistics and probability. So much so that different schools of thought have emerged in Statistics. This is a healthy sign in a developing subject. For our purposes, we may say, in simple terms, that the objective of Statistics is the understanding of information contained in data characterised mainly by uncertainty. That understanding demands one essential ingredient on your part: common sense! Everything else would be straightforward. In fact, the psychologist S. S. Stevens referred to Statistics as “..... a straightforward discipline designed to amplify the power of common sense in the discernment of order amid complexity.”
6
UNIT 2
DATA COLLECTION I
Unit Structure 2.0
Overview
2.1
Learning Objectives
2.2
The Collection of Data I 2.2.1 Introduction 2.2.1 Quantitative v/s Qualitative Approach
2.3
Routine Data Collection(as byproduct of Administrative Procedures) v/s Special Investigations
2.4
Censuses v/s Sample Surveys 2.4.1 Introduction 2.4.2 Comparative Advantages of Sample Surveys over Censuses 2.4.3 Sources of Errors in Censuses and Sample Surveys 2.4.3.1 Sampling Errors 2.4.3.2 Non Sampling Errors
2.5
Mode of Administration of a Questionnaire 2.5.1 Face to Face Interviewing 2.5.2 The Postal Method 2.5.3 The Telephone Method
2.6
Stages in a Sample Survey
2.7
Summary
2.0
OVERVIEW
This unit introduces you to the various approaches to data collection, the basic principles and various ways of collecting quantitative data. Comparisons of the relative strengths and weaknesses of alternative methods are included. Data collection is covered in OJ in Chapters 15 and 16. However, note that the material in OJ on sampling (Chapter 15 ) is not considered appropriate for this course. You may find Chapter 16 of OJ useful supplementary reading to the material in this manual.
7
2.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following: 1.
Identify the various methods of collecting quantitative data
2.
Differentiate between censuses and sample surveys as means of collecting quantitative data
3.
Explain the various ways of administering a survey questionnaire and analyse their relative strengths and weaknesses
4.
Identify the various stages involved in a sample survey
2.2
THE COLLECTION OF DATA I
2.2.1
Introduction
In Unit 1, the importance of statistical data for informed decision making and planning was mentioned. However, data do not just exist. They have to be collected. And data collection can be a complex and technical task. It can also be very costly and time consuming. The coverage of data collection in this course therefore is not intended to equip you to embark on a complex and large scale data collection exercise on your own (much further study will be required for this!) but rather to provide you with a basic appreciation of the general principles of data collection, the various stages involved, the dangers to avoid and the precautions to take. Additionally, this unit should encourage you to examine published data with a more critical mind, to appreciate their limitations as well as their strengths and to exercise caution in their use. 2.2.2
Quantitative v/s Qualitative Approach
There are two broad approaches to collecting data: the qualitative approach and the quantitative one. Each of these approaches has its merits and limitations. The distinguishing feature of the quantitative approach is that it uses standard instruments and procedures (e.g. a standard questionnaire, with fixed sequence and phrasing of questions and uniform field
8
procedures). This makes responses comparable and allows them to be aggregated so as to produce percentages, rates, averages etc. Hence it is possible, for example, to estimate the proportion in a given population who possesses a certain characteristic or to quantify the extent to which specific views or attitudes are held. Sample surveys using standard questionnaires and uniform field procedures represent a major example of the quantitative approach. By uniform field procedures we mean, for example that
questionnaires
administered to all respondents in the same way, say by face to face interview, that interviewers are trained to ask the questions and to deal with any problems arising on the field in exactly the same way. The great advantage of the quantitative approach is that the results are quantifiable and generalisable. In the qualitative approach, instruments and procedures are more flexible and informal. There is usually no standard questionnaire: the ordering of questions may vary and the phrasing of questions is not rigid. Examples of the qualitative approach are the key informant approach (where persons having specialised knowledge of the subject of interest, by virtue of their occupations, are interviewed) and the focus group approach (where people are interviewed in groups, in a rather informal way). Further examples (by no means an exhaustive list) of the qualitative approach are participant observation and case studies. Certain qualitative approaches have the advantages of low cost, rapidity, depth but the emphasis with qualitative approaches is not on quantitative information. Thus, for example, interviews of trade union representatives and focus groups of a small number of workers may indicate that the majority of workers are against a proposed measure and that men are more strongly opposed than women. However, it would not be possible, with any confidence, to generalise these conclusions to all workers and still less quote percentages of those for and against the measure. In this course we shall deal only with the quantitative approach.
9
2.3
ROUTINE DATA COLLECTION (AS BY PRODUCT OF ADMINISTRATIVE PROCEDURES) V/S SPECIAL INVESTIGATIONS
Often there exist opportunities for collecting quantitative data in the course of administrative control procedures. For example, every person entering Mauritius has to go through the immigration authorities, as is the practice in all other countries. This provides an opportunity for collecting information on tourist arrivals, which is in fact done through the well known disembarkation card. Similarly, anyone importing goods into the country has to go through customs for control and taxation purposes, but this also provides an opportunity for collecting data on imports such as the type of product, the origin, etc. Collection of data as a byproduct of administrative control is generally inexpensive. Often the same forms or schedules are used for both administrative and statistical purposes. However, much care must be taken in designing these forms or schedules, as what is suitable for administrative purposes may not always be relevant for statistical purposes. In particular, attention must be given to the definitions of terms used. Also, care must be taken not to burden the administration too much by making the forms too long or complicated. Sometimes separate forms for statistical purposes are necessary. It is not always possible to obtain the data one needs as a byproduct of administrative procedures. It then becomes necessary to conduct special, dedicated investigations, with the specific purpose of collecting the required data. This process can be quite costly, but the importance and potential use of the data may well justify the expenditure. Two alternative approaches are possible. The investigation may involve collecting data in respect of every member of the population of interest (i.e. a Census). Alternatively, it may involve collecting data in respect of a sample of the population. We discuss these two approaches next.
10
2.4
CENSUSES V/S SAMPLE SURVEYS
2.4.1
Introduction
A census involves the collection of data in respect of every member of a population of interest. Familiar examples of censuses are the Housing and Population Censuses carried out in Mauritius by the Central Statistical Office every ten years. A sample survey involves the collection of data in respect of only some of the members of the target population but with the purpose of learning about the whole of that population. Examples of important national sample surveys carried out by the Central Statistical Office in Mauritius are the Family Budget Sample Survey and the Labour Force Sample Survey. This idea of examining a part to learn about the whole, which is what a sample survey is all about, is familiar and intuitively appealing, and we apply it in our every day lives, often unwittingly. For example, when buying grain for the household, we usually examine a handful to check the quality before making our purchase. Of course, in order for observations on the part to provide a valid basis for conclusions about the whole, certain precautions must be taken in the selection of that part. We simply cannot use any part. We should ensure that every member of the population has a fair chance of selection and this is achieved by a method of selection which we call random selection. We should also aim at drawing a sample that is likely to be representative of the whole population. We shall not pursue the matter further here, but in Unit 11, we shall discuss the basic principles involved in selecting valid samples. 2.4.2
Comparative Advantages of Sample Surveys over Censuses
As a means of collecting quantitative data, the sample survey has a number of advantages over the census approach. (i)
Sample surveys require less resources and are far less costly than censuses.
11
When the population of interest is large, a Census becomes a very costly exercise. For a small population, a Census could be considered as the cost may be moderate. For large populations, it is avoided. Nevertheless, for certain purposes, although the population may be very large, a Census is absolutely necessary and a sample survey would not be appropriate. In such cases, Censuses are carried out at infrequent intervals. The Population and Housing Censuses are carried out in Mauritius at 10 year intervals. (ii)
Sample surveys are less time consuming and hence, results are more timely. For a Census, because of the sheer scale of the data collection, the processing of data takes a lot of time. Not so long ago, data from Population Censuses used to take years to process, even in developed countries, at times dragging on almost to the next census. Under these circumstances, the results from the census were largely obsolete by the time that they were out. With the advent of electronic processing, things have improved a lot but it still takes a number of months to process data from a population or housing census.
(iii)
In a sample survey, because only a small portion of the population is involved, that portion can be studied intensively. In investigations of human populations, one important consideration is the need to limit the burden on the respondent, i.e. on the individual contacted to provide the data. In a census of a large population therefore, the questions must be simple and factual and their number must be kept small because many people would have the burden of answering the questions. In a sample survey, since only relatively fewer people are involved, we can ask more questions and the questions can be more complex if necessary. Furthermore a census is inappropriate for asking indepth questions and questions on opinions and attitudes. Highly skilled interviewers are required in these cases. A census of a large population requires a very large number of interviewers and it is usually not easy to find such a large number of highly skilled interviewers.
12
As an illustration of the above, it may be noted that typically, the Population Census carried out in Mauritius involves 25 to 30 simple factual questions. However, there are sample surveys that have been carried out by the University of Mauritius involving at times over 150 questions, many of them complex ones, often dealing with attitudes, opinions and perceptions. (iv)
In certain contexts, data collection may involve destruction of the individual from whom the data are collected, in which case, a census is then out of question. For example, studying the life of electric bulbs would involve lighting them until they burn out. Hence, if a bulb manufacturer used a census to study the life of his bulbs, he would soon be left with no bulbs to sell! In spite of the above, censuses are sometimes necessary because of the level of detail required. For example, for local planning purposes, detailed information about all towns and villages of the country are required. A national sample survey will not contain enough members of each town or village for accurate information in respect of each of them to be obtained. Indeed, certain towns or villages may not even appear in the sample at all.
Censuses may also provide the sampling frame for future sample surveys.
2.4.3 Sources of Errors in Censuses and Sample Surveys 2.4.3.1 Sampling Errors Suppose that we want to find out the average weight of all students of the University of Mauritius, and we do this by selecting a sample of say 200 students in accordance with the principles of scientific sampling (to be discussed in Unit 11). We then find the average weight of our sample of students. What we get is an estimate. We may expect this estimate to be close to the true average weight of all students but we cannot expect it to be exactly equal,
13
except by coincidence. Differences between the estimates based on samples and the true population values are called sampling errors. If we were to start anew and repeat the process, i.e. draw a sample again (putting back the 200 students), we will most likely have a different sample, although there may well be some students who appeared in the first sample as well. If we now compute the average weight of the sampled students again, we expect the result to be different from the first time, except for a coincidence. Such differences are called sampling variation. Thus the estimates based on samples are not in general exactly equal to the true population values and they also vary from one sample to another. However, if the size of the sample is sufficiently large, we can be reasonably certain that the estimate will be close to the true population value. The theory of sampling (which is beyond the scope of this course) gives us this guarantee. This guarantee gives sampling its power and makes it a viable alternative to complete enumeration (census). Thus national surveys using samples of between 1000 and 3000 individuals are carried out in many countries (with populations of several millions). Actually, sampling theory enables one to estimate the required sample size for a given degree of precision. We shall not go into this but please note that the common belief that the larger the population, the larger the required sample, is not quite true. In fact, the sample size hardly depends on the population size, even when the population is large. Of course, because censuses involve complete enumeration, they are not subject to sampling errors. 2.4.3.2 NonSampling Errors It is commonly believed that because a census is an exhaustive exercise and is therefore not subject to sampling errors, it must be more reliable than a sample survey. This is not necessarily the case. Both the census and the sample survey are subject to other errors called nonsampling errors. Important types of non sampling errors are the following: (i)
errors of omission: omission occurs when individuals who belong to the target population are forgotten or somehow not reached. For example, homeless people or people who have no fixed abode may easily be missed.
14
(ii)
nonresponse: nonresponse occurs when people contacted are not at home or refuse to participate in the survey. Non response is a serious problem because people who refuse to participate may have different opinions on aspects pertinent to the subject of the survey from those who cooperate. For example, suppose we carry out a survey on leisure and we have a lot of nonresponse. It is quite possible that those who did not respond are very busy and have little leisure time. Therefore conclusions based on those who responded would be misleading.
(iii)
interviewer bias: interviewer bias occurs when the responses obtained are influenced by the interviewers. This may happen in a number of ways: an unskilled interviewer may by his/her intonation or facial expression during interviewing, by the way he or she tries to clarify a question which has not been understood or probes for more information in case of an ambiguous or incomplete answer, influence the respondent to answer in a particular way. It may also occur by misinterpretation and misrecording of answers, caused by the interviewer’s preconceptions.
(iv)
coder bias:
coder bias may occur when answers to questions which have been
recorded verbatim by the interviewer are coded in the office for the purposes of analysis. Interpretation given to answers and the codes assigned as a result may be influenced by the coders’ preconceptions. All of these errors can occur with a census, as well as with a sample survey. However, because a sample survey involves only a small number of respondents, the efforts made to minimise nonsampling errors can be more intensive than they could be for a census.
2.5
MODE OF ADMINISTRATION OF A QUESTIONNAIRE
The process of implementing a questionnaire designed for data collection, i.e. of getting the questionnaire completed, is called administering the questionnaire. There are four basic ways of doing this:
15
(i)
by observation
(ii)
by face to face interviewing using interviewers
(iii)
by mailing the questionnaire to all individuals from whom the data are to be collected and asking them to complete and return it to the investigator
(iv)
by interviewing individuals by telephone.
The scope for collecting information by observation is rather limited as the method requires that the phenomenon being studied be observable. Some interesting possibilities do nevertheless exist. It is thus possible to study the intensity of traffic flow by standing at a particular spot and observing the number of vehicles that go by. However, in the discussion which follows, we shall restrict ourselves to the other three modes. Choosing among these alternative modes requires a thorough knowledge of their relative strengths and limitations. 2.5.1
Face to Face Interviewing
The face to face method of administering a questionnaire has a number of advantages: (i)
The response rate tends to be high, as possibly people find it hard to refuse when the interviewer is standing right in front of them. Several sample surveys carried out by the University of Mauritius using face to face interviewing have easily reached 95% response.
(ii)
The face to face approach, because it uses trained interviewers, makes it possible to administer a complex questionnaire (e.g. a questionnaire which contains attitude and opinion questions and a lot of skip instructions). When the questionnaire is selfadministered (i.e. filled by the respondent) as in a postal survey, the questionnaire must be kept simple.
(iii)
The face to face method provides an opportunity for the interviewer to find out the reasons for any reticence on the part of the person contacted and to persuade the person to cooperate.
16
(iv)
The face to face method has practically no restrictions on the type of population that can be investigated. With the face to face method, the interviewer reads out the questions and records the answers. It is therefore not necessary for the respondent to be literate as is the case with the postal method. The telephone interview method, however, requires that the respondent be reachable by phone.
(v)
With the face to face method, there is more control over the identity of the respondent. In the postal survey, the person to whom the questionnaire is addressed may decide to pass over the questionnaire to someone else to fill in his/her place.
(vi)
The face to face method can be used for practically any topic of enquiry. Some people believe that for sensitive or embarassing topics, postal surveys are better because of the relative anonymity. However experience shows that, given trained interviewers and the appropriate precautions, the face to face method works very well even for sensitive topics. Moreover, it is difficult to see why people would bother to answer an embarrassing questionnaire sent by mail.
(vii)
The face to face method provides an opportunity for clarifying questions which the respondent finds to be unclear.
(viii)
The face to face method provides an opportunity for
probing (i.e. asking for
additional information) if the answer given by the respondent is incomplete or ambiguous. (ix)
With the face to face method, the interviewer can ensure that the sequence of the questions as it appears on the questionnaire is respected. This is usually very important. With the postal method, respondents have the opportunity to see all the questions before answering any of them.
The great disadvantage of the face to face method is that it is costly, much more costly than either the postal method or the telephone interview method. It also requires trained interviewers.
17
2.5.2
The Postal Method
The main advantage of the postal method (also called the mail method) of administering a questionnaire is its relatively low cost. The cost, it must be noted however, is not limited to the initial cost of mailing out the questionnaires: usually reminders have to be sent out and sometimes there are followup phone calls and even personal visits which raise up the cost. Also the postal method does not require trained interviewers. The postal method has a number of disadvantages, relative to the face to face method:(i)
The response rate is usually low, often of the order of 30 to 40%, if not less. This is a very serious disadvantage.
(ii)
The questionnaire must be kept simple.
(iii)
There is less opportunity to persuade people who are reticent to answer the questions. Followup by phone is a possibility but it is not as effective as the face to face presence of an interviewer.
(iv)
Respondents can see all the questions before they answer any. This is usually not desirable.
(v)
The method is of course restricted to a target population that is literate.
(vi)
It is important that the information obtained relates to the person selected and not someone else. However, with the postal method, control over who actually answers the questions is difficult. The person to whom the questionnaire is addressed may pass it on to another family member or a friend for completion. It is not certain, in that case, that the family member or friend would provide the same information had the questionnaire been filled by the selected person. Especially in the case of opinion or attitude questions, there is a high risk that the friend or family member would substitute his or her own views in the questionnaire.
18
(vii)
If a respondent finds a question unclear, he or she may ignore it or give an irrelevant answer. There is no opportunity to detect that a respondent has misunderstood a question as with the face to face method.
(viii)
If the answer to a question is incomplete or ambiguous, there is no opportunity for probing as in the case of the face to face method.
2.5.3
The Telephone Method
In terms of advantages and disadvantages, the telephone method is intermediate between the face to face method and the postal method in many respects: (i)
The telephone method is less costly than the face to face method. However, it is generally more costly than the postal method.
(ii)
The telephone method does require trained interviewers. However travel costs and travel time are eliminated. Interviewers spend all their time in an office doing interviews by phone. Each interviewer can thus do more interviews.
(iii)
The questionnaire can be more complex than with the postal method but it is not advisable to attempt to administer a very long questionnaire by phone.
(iv)
There is more control over the identity of the respondent than with the postal method, although less than with the face to face method.
(v)
There is opportunity for persuading reticent respondents, although the face to face method is probably more effective in doing that.
(vi)
There is opportunity for clarification if questions are not clear to respondents, although, here again, this is more difficult to do over the phone than face to face
(vii)
There is opportunity for probing if respondents’ answers are incomplete or ambiguous but the same qualification as for (vi) applies.
19
(viii)
The sequence of questions on the questionnaire can be respected. Respondents do not have the opportunity of knowing all questions appearing on the questionnaire before they start answering as in the case of the postal method.
The great disadvantage of the telephone method is that it can only be used when all members of the target population are reachable by phone.
2.6
STAGES IN A SAMPLE SURVEY
From earlier discussion, it is clear that sample surveys are an important means of collecting data. We conclude this unit with a list of the main stages involved in a sample survey: (i)
Clear definition of the objective of the survey
A clear definition of the objective is fundamental for a survey. This will help make key decisions in the subsequent stages. It is not sufficient to just define a broad objective although one must start by that. It is necessary to break down this broad objective into finer objectives for subsequent operationalisation. (ii)
Clear definition of the target population
It is necessary to be clear about what constitutes the target population and the unit of investigation. For example, if we are doing a survey among the students of the University, do we wish to cover part time students or only full time ones. If we are doing a survey on consumer expenditure, is our unit of enquiry the household or the individual? (iii)
Sample design and determination of sample size
This will be dealt with in detail in Unit 11. (iv)
Questionnaire design
20
This will be dealt with in detail in Unit 11. (v)
Recruitment and training of field staff
The quality of data collected depends critically on the competence and dedication of the field staff involved. Therefore great care should be applied in the recruitment and training of such staff. (vi)
Pilot survey and pretesting the questionnaire
A pilot survey consists of a rehearsal of all the survey procedures on a small number of respondents. This process is very important as it permits the identification of any flaws or weaknesses in the questionnaire, which can thus be remedied. It also provides a lot of information about field procedures e.g. whether the method of approaching the respondent is satisfactory, how long it takes to administer the questionnaire, how easy it is to locate the respondents, how many call backs are required on average, etc. This information helps to organise the full scale survey. (vii)
Conduct of interviews
This stage applies when face to face interviewing is used. We have mentioned the danger of interviewer bias before. Interviewers need to possess a variety of skills, ranging from approaching the respondent, establishing rapport, persuading respondents to cooperate, asking questions in a neutral manner and recording the answers correctly. Training and experience are important but supervision and control are also necessary. (viii)
Editing of completed questionnaires
Completed questionnaires may contain a number of problems, such as blanks (i.e. questions which have not been answered), ambiguous or irrelevant or inconsistent answers. Therefore, before the data are processed and analysed, it is necessary to screen the questionnaires for such problems and remedy them. It is advisable to have a first edit carried out on the field by the interviewer immediately after the interview, as any problems can then be remedied
21
immediately. A second edit need to be done by field supervisors to detect any mistakes that may have gone unnoticed by the interviewer. Further edits can be done in the office, including a computer edit stage. (ix)
Coding of answers where required for data entry
Where a questionnaire contains open ended questions i.e. questions where no precoded answers are proposed, the answers must be coded before processing. Such coding must be done carefully, ensuring that there is consistency both across and within coders i.e. the same code is used for similar answers by different coders or by the same coder on different occasions. (x)
Data Entry
The data collected must in general be captured on computer for eventual processing and analysis. At this stage, it must be ensured that no errors are made during the transfer. (xi)
Data Processing and analysis
The data processing and analysis are usually done with the help of appropriate statistical software. The objectives as defined in the very first stage will guide the analysis. (xii)
Interpretation and report writing
Care must be taken to ensure correct interpretation and the report writing needs to take into consideration the readers targeted. The success of a survey depends on strict observance of precautions and meticulous attention to quality control at every stage.
22
2.7
SUMMARY
In this unit you have studied the various methods of collecting quantitative data, the differences between censuses and sample surveys, the various ways of administering a survey questionnaire (including their strengths and weaknesses) and the various stages of a sample survey.
23
UNIT 3
ORGANISATION AND PRESENTATION OF DATA I
Unit Structure 3.0
The Aim and Forms of Presenting Data
3.1
Overview
3.2
Learning Objectives
3.3
Organisation and Presentation of Data I 3.3.1 Data types 3.3.2` Tabulations 3.3.3 The Stem and Leaf Diagram 3.3.4 The Time Series
3.4
Secondary Statistics
3.5
Interpretation of Tables
3.6
Summary
3.0
THE AIM AND FORMS OF PRESENTING DATA
The aim of presenting figures is to communicate information.
Therefore the type of
presentation depends on the requirement and interests of the people receiving the information. Effectively, there are different types of presentation: •
Tabulation is covered in Unit 3.
•
Chart and Diagram are covered in Unit 4.
•
Graph is covered in Unit 5.
3.1
OVERVIEW
Chapter 1 of your textbook (OJ) introduces the methods of arranging data in tabular form. The Textbook, as well as Unit 3 of this course manual, cover the key aspects of tabular presentation, the different types of tables and secondary statistics.
24
3.2
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following:
1.
Explain the importance of tabular presentation.
2.
Identify the general principles of general tabulation.
3.
Use the different types of tabulation.
4.
Explain the importance of secondary statistics.
5.
Use correctly the different secondary statistics to shed light on data.
6.
Interpret information contained in tables and other forms of presentation.
3.3
ORGANISATION AND PRESENTATION OF DATA I
3.3.1
Data Types
Read pp 1 4 of textbook (OJ). Activity 1 3.3.2
Attempt Questions 1.2(a), 1.3 from textbook (OJ).
Tabulations
Read pp 4  13 of textbook (OJ).
3.3.2.1 Construction of tables In the construction of tables, there are important guidelines to consider: •
Be sure what you want the table to show.
•
All tables should have a title which is an explanatory title.
•
The source of the data must be included (usually below the table) so that the original sources can be checked.
25
•
Tables should be neat, tidy and you should use a good handwriting.
•
To improve the quality of the table, make judicious use of different types of print.
•
Column and row headings should be brief but selfexplanatory.
•
Units of measurement should be shown clearly.
•
Approximations and omissions can be explained in footnotes. However, footnotes should be kept to a strict minimum.
•
Double lines or thick lines, can be used to break up a large table and make it easier to read.
•
Two or three simple tables are often better than one very large table.
•
Sets of data which are to be compared should be close together.
•
Secondary statistics, such as percentages and averages, should be beside the figures to which they relate.
•
In the particular case of frequency tables, the construction of classes should be done judiciously, with particular attention to the class boundaries and class widths. 3.3.2.2
Class boundaries, class limits, class widths and class midpoints
Two important principles that must be observed when classifying data into categories are that the categories should be (i) mutually exclusive  i.e. there must be no overlap among categories and (ii) the categories should be jointly exhaustive  i.e. together the various categories should cover the whole range of the data.
These principles apply to the
construction of frequency tables. Conversely, when studying frequency tables prepared by others, it is important to be clear about the boundaries of each class. The correct determination of class boundaries and hence class widths and class midpoints are, as you will discover later on, pertinent for the computation of the mean, median etc. These boundaries often are not what they seem to be at first sight. At this stage, it is useful to draw a distinction between class boundaries and class limits:•
Class limits identify the inclusive values in a class of a frequency table.
26
•
Class boundaries are the specific points along a measurement scale that separate adjoining classes. These can be different from the class limits.
We cannot give general rules for determining class boundaries. These have to be determined on a case by case basis, applying some common sense. The key is to try and figure out what are the smallest and largest values that would have been placed in each of the classes when the table was compiled. A consideration of whether the variable involved is discrete or continuous is also useful. Once the class boundaries have been correctly determined, the class width is obtained simply from the difference between the upper and the lower boundaries, whereas the class mid point is obtained by averaging the same boundaries. Example 1 Table 3.1 Length of rod
Number of rods
(nearest cm.) 11  15
5
16  20
12
21  25
23
etc.
etc.
Since the lengths are given to the nearest centimetre, the boundaries of the first class extend from 10.5 to 15.4999....., which for practical purposes you can take as 10.5 to 15.5 so that the class width is 5 and the class midpoint is 13.0.
27
Example 2 Table 3.2 Hours of sunshine
Number of days
0 and under 2
3
2 and under 4
15
4 and under 6
59
6 and under 8
92
etc.
etc.
In this case, the class boundaries coincide with the class limits. Example 3 Table 3.3 Number of calls made
Number of subscribers
110
9
1115
12
1620
24
2125
16
2640
14 __ 75
Total
__ Here the variable is intrinsically discrete (i.e. calls can take only whole numbers) and only class limits are visible; for example, the lower class limit (l.c.l) and upper class limit (u.c.l) of the first class are respectively 1 and 10. Those for the second class are respectively 11 & 15 and so on for the other classes.
28
Often, for analytical purposes, discrete variables are converted into continuous variables, i.e., rather than the variable taking countable number of values in a particular interval, we assume that in this interval, it takes all the possible values. Later, you will see why we do that, especially when we construct the histogram and compute the median. So, for this case, the lower class boundary (l.c.b) and upper class boundary (u.c.b) of the second class, for example, are taken respectively as 10.5 and 15.5. Similarly those of the third class are 15.5 and 20.5. Note that the class boundaries are obtained by subtracting and adding 0.5 respectively to the lower and upper class limits. Note that u.c.b of a class coincides with l.c.b of the next class. Class width
=
u.c.b  l.c.b
Class mid point
=
u.c.b + l.c.b = u.c.l + l.c.l 2
2
For example, the class width of the second class is 5 and the class midpoint is 13.
Example 4 Table 3.4 Age
Number of club members
1019
185
2029
263
3039
325
4049
442
etc.
etc.
29
The boundaries of the first class are 10 and 20 respectively. Try to figure out why. Note that ‘age’ is usually quoted as ‘age at last birthday’. 3.3.2.3 Types of tables There are many types of tables, as you may have noticed in publications, journals and magazines and in company reports. Tables can be divided into •
Frequency tables
•
Twoway tables or contingency tables or cross tabulation
•
General tables
•
Examples of frequency tables are clearly illustrated in your textbook (OJ). An example of a twoway table is provided in this unit.
Two way tables Example 5 Table 3.5 Student Marks in English and Maths Student 1 2 3 4 5 6 7 8 9 10
English 35 32 41 31 65 42 58 71 82 64
Maths 40 41 50 27 66 66 72 80 58 59
Student 11 12 13 14 15 16 17 18 19 20
English 47 61 63 58 72 69 58 55 48 50
Source: University X, 1971
30
Maths 49 54 61 73 82 76 69 54 58 44
The table above gives the marks in English and Mathematics gained by twenty students. Arrange these results into a twoway grouped frequency distribution. Answer to Example 5 Table 3.6 Student Marks in English and Maths Eng\Maths →
020
2130 3140 4150 5160
6170 7180 8190 91100
↓ 020
Total
A
D
2130
0 0
3140
1
1
4150
1
3
111
1
1
5
5160
1
1
6170
11 (2)
11 (2) 1
(3)
7180
11 (2)
1
8190
4 5 1
2
1
91100
C
Total
0
1
1
4
5
1
4
4
1
B
0
0
20
Source: University X, 1971 We observe a direct relationship between the scores in English and Maths as the diagonal moves from A to B. i.e. students doing well in Maths will do well in English. Had the rend been from C to D, then we would have said that an inverse relationship exists i.e. students scoring high marks in English do not necessarily score high marks in Maths.
• General Tabulation
31
Example 6 (a) According to the 1972 Census data published by the Central Statistical Office, out of a total of 246,000
males aged 15 and over, 169,000 were employed and 35,000 were
unemployed. The remainder were inactive (i.e. were either retired, rentiers, homemakers, students, disabled or voluntarily idle). According to the same data, out of a total of 249, 000 females aged 15 and over, 44,000 were in employment, 7,000 were unemployed and the rest inactive. The Central Statistical Office estimated that in 1986, there were 238,000 employed males and 106,000 employed females. The number of unemployed males and females were 37,000 and 18,000 respectively. The total number of males aged 15 and over was estimated at 339,000. The corresponding number of females was estimated at 343,000. (Note : The data have been rounded to the nearest thousand). 
Tabulate the above information, including in your table any secondary statistics you consider useful for the interpretation of the data.

Comment on the data, especially in relation to what they reflect on the role of women. What are the main social and economic implications?
Answer to Example 6
32
Table 3.7 Population aged 15 and over by activity status and sex, Mauritius, 1972  1986 Year and Sex
1972 †
1986 ‡
Male Activity
Number %
Female
Male
Female
%
Number %
Number %
('000)
('000)
Numbe r ('000)
Status
('000)
Employed
169
68.7
44
17.7
238
70.2
106
30.9
Unemploye
35
14.2
7
2.8
37
10.9
18
5.2
Total Active
204
82.9
51
20.5
275
81.1
124
36.2
Inactive
42
17.1
198
79.5
64
18.9
219
63.8
Total
246
100.0
249
100.0
339
100.0
343
100.0
d
Source : Central Statistical Office. † Census figures (74) ‡ Estimates (86)
The table reflects the considerable changes that have taken place between 1972 and 1986, in particular the large number of jobs created and the increased demand for female employment. The reduction in male unemployment probably implies a reduction in the social evils associated with unemployment : crime, violence, drug abuse, alcoholism, suicides etc. The greater participation of women in economic activity implies a changing role for women, showing a movement away from the traditional idea of home as the proper place for women. The greater employment of women also probably means increased prosperity for households but may be accompanied by difficulty in reconciling domestic and occupational responsibilities with the attendant consequences: strained relationships between spouses, neglect of children, etc. (The increased female unemployment is due not to low job creation but rather to the increased demand for jobs among women).
33
Activity 2 (a)
In a recent survey, 7381 children were studied, of whom 219 attended private schools. 78% were the children of manual workers but only 40 of these children attended private schools. 1 out of every 9 children were the only child in the family (“enfant unique”); among private school attenders, the proportion of children from families with only child was 20.1%, of whom 7 were the children of manual workers. Of the families with only one child, 567 came from the manual class. Arrange these figures in a table, calculating any secondary statistics you consider necessary and comment on the results.
(b)
Attempt Questions 1.5, 1.6, 1.17 from textbook (OJ)
3.3.3
The Stem and Leaf Diagram
Read p 14 of textbook (OJ). Activity 3 3.3.4
Attempt Questions 1.15 and 1.16 from textbook (OJ).
The Time Series
Read pp 1415 of textbook (OJ).
3.4
SECONDARY STATISTICS
Secondary statistics are those simple calculations which are performed using given data, to help us in our interpretation. Some examples of secondary statistics are subtotals, totals, rates, ratio and percentage.
34
Ratio A ratio is a relationship between two quantities expressed in a number of units to enable comparison. Example 7 Threequarters of the annual output of a factory consists of product A and onequarter of product B. The ratio of the output is then 3:1. For every 3 units of A produced in a year, 1 unit of B is produced. Percentage "Percentage" (or percent) means per hundred. Therefore 50 per cent is 50 out of a hundred, that is, one half. The symbol for percentage is % . For example, to convert a fraction to a percentage, multiply by 100 : ¼ equals 25% (25 = ¼ x100)
3.5
INTERPRETATION OF TABLES
When data are presented, it is important that tables provide information clearly and at the same time make an impact. Interpretation is a matter of judgement based on knowledge of the terms used in the table. It is not enough that a figure or the result of calculation is accurate, the result has to be understood. There is little point in arriving at a correct answer to a calculation if it is not known what it means..
3.6
SUMMARY
In this unit, you have learnt about presentation of data using the different types of tables namely frequency tables, two way tables and general tabulation.
35
UNIT 4
ORGANISATION AND PRESENTATION OF DATA II
Unit Structure 4.0
Overview
4.1
Learning Objectives
4.2
Organisation and Presentation of Data II 4.2.1 Introduction 4.2.2 The Bar Chart 4.2.3 The Pie Chart 4.2.4 The Histogram
4.3
Summary
4.0
OVERVIEW
This unit introduces you to the methods of organising and presenting data, using various charts and diagrams. Part of Chapter 2 of your textbook pp. 2839 (OJ) covers the relevant topics.
4.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you will be able to construct, interpret and use the following: 1.
the Bar chart.
2.
the Pie chart.
3.
the Histogram and Frequency Polygon.
36
4.2
ORGANISATION AND PRESENTATION OF DATA II
4.2.1
Introduction
Study pp 2829 of your textbook (OJ). There are some guidelines which are important for the construction of various charts, diagrams and graphs, in the same way as we discussed for the construction of tables in Section 3.3.2.1 of Unit 3. Some of these guidelines are common: •
Be sure what you want your chart or diagram or graph to show.
•
All charts, diagrams or graphs must have a title which is, as far as possible, selfexplanatory.
•
The source of the data must always be included (usually below the chart/diagram/graph).
•
Units of measurement must be shown clearly.
•
Axes should be labelled clearly and scales must be made convenient, explicit and clear.
•
Where appropriate, a key must be given so as to explain clearly what each shading etc. represents.
•
Charts, diagrams or graphs must be neat and tidy.
4.2.2
The Bar Chart
Study pp 2933 of your textbook (OJ). Your textbook covers adequately the discussion on the bar chart; however, certain points need to be added with regards to various charts developed from the idea of a bar chart. It is desirable that the compound or component bar chart does not contain too many components, or else, the impact on the reader may be blurred. Whenever there is a need to compare two data sets using component bar charts, it is advisable to use percentages rather
37
than actual numbers : percentages make comparison easier, especially when charts or diagrams are used. Think why! The example given in Fig. 2.5 of p 32 of your textbook is an example of what is commonly known as a multiple bar chart.
Multiple bar charts are very useful when different
characteristics [e.g. % of labour force employed in agriculture, agrarian output as % of GNP of various units of interest (e.g. countries)] need to be simultaneously presented.
It is
however desirable that not too many characteristics are included in the diagram; the chart might otherwise contain too much information and can become rather confusing. Sometimes, bar charts or component bar charts are drawn with the bars horizontal; in some cases, the variable on the horizontal axis is time. Such adaptation of the bar chart is known as the Gantt Charts. It is used especially at the time of planning a project over time and monitoring the implementation of the project with regards to the assigned time schedule. Activity 1 4.2.3
Attempt Questions 2.8 and 2.9 of your textbook (OJ).
The Pie Chart
Study pp 3334 of our textbook (OJ). Your textbook tends to be too sceptical about the pie chart. In fact, the main objective of the pie chart is to show the relative importance of the component parts of a total. And the pie chart does this extremely well, provided there are not too many components. The pie chart is used widely to present statistical data to the general public as well as to highlight any shift in the relative importance of the component parts of a total over time. In the latter case, two pie charts can be drawn for data available at two different points in time. Activity 2 The urban population, as enumerated at the 1972 and 1983 censuses, was as follows:
38
Table 4.1 URBAN POPULATION FOR ISLAND OF MAURITIUS Municipal Council Area
1972
PortLouis BeauBassin  RoseHill QuatreBornes VacoasPhoenix Curepipe
133,996 80,318 50,770 47,638 51,956
133,702 90,577 63,682 53,090 62,200
TOTAL
364,678
403,251
Source:
1983
Annual Digest of Statistics, C.S.O., 1988
Represent the above information by means of piecharts. 4.2.4
The Histogram
Study pp 3439 of your textbook (OJ). Note that a histogram can only be constructed for continuous variables; thus a given discrete variable needs to be transformed into the appropriate continuous form before the histogram is constructed. Consider the following example of a simple frequency distribution in the discrete form:
39
Table 4.2 Number of faults 1 2 3 4 5 6
Number of cars (frequency)
or more
18 25 19 8 3 0 __ 73 
The variable ‘number of faults’ is discrete and is first transformed into the continuous form as follows:
Table 4.3 Number of faults 0.5 and under 1.5 1.5 and under 2.5 2.5 and under 3.5 3.5 and under 4.5 4.5 and under 5.5 5.5 and above The histogram is then constructed with the first rectangle having its base between 0.5 and 1.5 inclusive. The second rectangle will have its base between 1.5 and 2.5 inclusive, etc. Thus there is no gap between the rectangles. The rectangles must be contiguous i.e. touching each other. Similarly, a discrete grouped frequency distribution should first be transformed in its continuous form. Thus the following discrete grouped frequency distribution from p 38 of your textbook (OJ) can be transformed in the continuous form as follows:
40
Table 4.4 Discrete form
Continuous form
Number of calls
Number of calls
10  19
9.5 and under 19.5
20  29
19.5 and under 29.5
30  39
29.5 and under 39.5, etc.
Another important point needs to be highlighted in the construction of histogram. Occasionally, we come across frequency distributions with class intervals being very different from almost each other. Consider the following data relating to infant deaths in Table 4.5. Infant Deaths (deaths of Children Under 1 Year of Age) by age and sex Island of Mauritius, 19861988 1986 Age
Both
1987
Male
Female
Sexes
Both
1988
Male
Female
Sexes
Both
Male
Female
Sexes
Under 1 day
91
48
43
60
31
29
87
53
34
1 – 6 days
191
118
73
183
111
72
117
117
57
7 – 27 days
75
40
35
91
63
28
30
30
21
28 days – under 2 months
25
13
12
38
21
17
12
12
16
2 – 3 months
35
28
7
35
19
16
17
17
20
4 – 5 months
22
12
10
21
16
5
14
14
11
6 – 7 months
11
4
7
13
6
7
8
8
5
8 – 9 months
16
8
8
12
8
4
9
9
7
10 – 11 months
14
8
6
10
8
2
6
6
4
Under 1 year
480
279
201
463
283
180
266
266
175
Under 7 days
282
166
116
243
142
101
170
170
91
Under 28 days
357
206
151
334
205
129
200
200
112
28 days – under 1 year
123
73
50
129
78
51
129
66
63
Source:
Central Statistical Office, Annual Digest of Statistics 1989. Table 4.5
41
In such cases, we first compute the frequency density which is defined as follows: frequency density = frequency (class width. Then the frequency density is used on the vertical axis, and the variable of interest is used on the horizontal axis as usual. Table 4.6
Age
Number of deaths
Frequency density
for both sexes, 1986 (frequency)
Under 1 day 1 
91
91
6 days
191
191 ( 6 = 31.8
7  27 days
75
75 ( 21 = 3.6
etc.
etc.
etc.
Thus the frequency density gives the number of deaths per unit time (i.e. per day) and renders all frequencies comparable. The fundamental principle underlying the histogram is that what matters is the area of rectangle and not the height of rectangle. The examples considered in your textbook are merely specific applications of this fundamental principle. In these examples all or most class intervals have the same widths except two or three. Can you see the link? The histogram and the frequency polygon give us a view of the shape of a given frequency distribution. In particular, they help to (i)
identify to what extent a particular distribution is asymmetrical, and
(ii)
compare two distributions.
42
For the latter case, we may use, for example, two histograms (using the same scales) to highlight the change in age structure of the population of Mauritius which has occurred between 1972 and 1990 (at which times a population census was carried). Activity 3 (i)
Attempt questions 2.7 and 2.14 of your textbook (OJ).
(ii)
The age distributions of the population as enumerated at the censuses of 1972 and 1990 for Island of Mauritius are as follows: Table 4.7 AGE DISTRIBUTION OF POPULATION FOR ISLAND OF MAURITIUS Age Group
1972
1990
(years)
(000’s)
(000’s)
________________________________________________ 9 and less 10  19 20  29 30  39 40  49 50  59 60  69
220.0 211.9 133.0 83.9 74.5 52.8 31.8
191.2 201.6 202.5 171.0 102.5 68.1
( ( 70  79 13.4 ( 85.4 ( 80 and above 3.8 ( ________________________________________________ TOTAL 825.1 1,022.3 ________________________________________________ Source: (a) Annual Digest of Statistics (b) 1990 Census report, Volume II
43
(a)
Illustrate, by means of histograms, the age distributions of the Island of Mauritius for 1972 and 1990. Comment on your findings.
(b)
Draw the respective frequency polygons.
4.3
SUMMARY
In this unit, you have learnt about the presentation of data by using some charts/diagrams, namely the bar chart and its various adaptations, the pie chart and the histogram.
44
UNIT 5
ORGANISATION AND PRESENTATION OF DATA III
Unit Structure 5.0
Overview
5.1
Learning Objectives
5.2
Organisation and Presentation of Data III 5.2.1 The Ogive Curve 5.2.2
Plotting the Time Series
5.2.3 Logarithmic Graphs 5.2.4 The Lorenz Curve 5.2.5 The ZChart 5.2.6 The Scatter Diagram 5.2.7 Some Examples of Bad Practice 5.3
Summary
5.0
OVERVIEW
This unit further introduces you to some graphical methods of presenting data and to their interpretation.
5.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following: 1.
2.
Construct, interpret and use a.
ogive curves.
b.
time series graphs.
c.
logarithmic graphs.
d.
Lorenz curves.
e.
Zcharts.
f.
scatter diagram.
Identify some examples of bad practice whilst displaying data.
45
5.2
ORGANISATION AND PRESENTATION OF DATA III
5.2.1
The Ogive Curve
Study pp 3941 of your textbook (OJ) A minor point that has been overlooked in the textbook (example on the distribution of 100 metal pipes on p 39 of OJ) is what has actually been plotted on the xaxis of the ogive curve. Though it may be clear to some of you, for the sake of completeness, we mention that we usually plot the cumulative frequencies versus the corresponding upper class boundaries (that is why it is sometimes called the “less than ogive” curve). Equivalently we do have the “more than ogive” curve but for our discussion, we shall refer to the one in your textbook. Ogive curves in general can be used to compute various secondary statistics of importance, e.g. median, quartiles etc...(you’ll get to learn about these later). You should also note that the point (10,0) has been plotted; well, this is so, as no pipes are under 10 cm and consequently the c.f. is also zero. It might be more useful to plot the cumulative % frequency rather than the cumulative frequency. The cumulative % frequency is obtained by converting the c.f. into percentages. In doing so, it enables comparison of various frequency distributions (Try to think why!) and also the various secondary statistics like the median, quartiles and percentiles etc.. can be read off directly from the graph . Some additional points to bear in mind concerning the ogive curve is the case when we deal with a discrete grouped frequency distribution which should first be transformed in its continuous form before proceeding to construct the curve e.g. (refer to Table 4.4, page 32 of manual):
46
Discrete form
Continuous form
Number of calls
Number of calls
10  19
9.5 and under 19.5
20  29
19.5 and under 29.5
30  39
29.5 and under 39.5, etc...
Solved Example On Time Ltd has a unit of 130 workers all performing exactly the same task : the assembly of watches. The output of the workers for the first week of October 1997 was recorded and is reproduced on the following page: Watches Assembled
Number of workers
up to 449
3
450  469
7
470  489
18
490  509
25
510  529
36
530  549
27
550  569
10
570  589
4
(a)
Construct a proper presentation table from the data including percentages, cumulative frequencies and cumulative percentages.
(b)
Construct an ogive from the data.
(c)
Using your ogive curve, estimate (i)
the number of workers producing less than 500 watches
(ii)
the value of x, if 40% of the workers produced x watches or more.
47
SOLUTION :(a)
Output of workers at On Time Ltd, First week of October 1997
Watches
Number of
Assembled
Percentage
workers
Cumulative
Cumulative
frequency
Percentage
up to 449
3
2.3
3
2.3
450  469
7
5.4
10
7.7
470  489
18
13.8
28
21.5
490  509
25
19.2
53
40.8
510  529
36
27.7
89
68.5
530  549
27
20.8
116
89.2
550  569
10
7.7
126
96.9
570  589
4
3.1
130
100.0
130
100.0
Source : (b)
Company Records, On Time Ltd
To construct the ogive curve, we first convert the discrete grouped frequency distribution into its continuous form (see section 3.3. 2.2, Ex 3).
Watches Assembled
c.f
up to 449.5
3
449.5  469.5
10
469.5  489.5
28
489.5  509.5
53
509.5  529.5
89
529.5  549.5
116
549.5  569.5
126
569.5  589.5
130
48
The ogive curve is shown below :
Ogive Curve for On Time Ltd
140 120
No of Workers
100 80 60 40 20 0 440
460
480
500
520
540
560
580
600
Watches Assem bled
(c)
(i)
Around 42 (using ogive curve as above)
(ii)
Please try yourself. x ≅ 524)
(Ans : Activity 1
The following is a record of marks scored by candidates in an examination Table 5.1 77
59
84
73
51
43
50
81
61
53
69
37
58
63
67
61
90
61
50
60
84
56
77
57
42
43
41
49
37
21
24
35
34
50
11
52
30
16
33
67
87
64
47
59
37
92
88
30
38
22
22
49
46
50
64
23
73
73
48
26
36
51
85
71
57
45
49
(a)
Tabulate the marks in the form of a frequency distribution, grouping by suitable class intervals.
(b)
Construct an ogive curve for the data.
5.2.2
Plotting the Time Series
Read pp 41  45 of your textbook (OJ) Activity 2 A large food store is open six days a week. Its sales, in thousands of kilograms, during a five week period are as follows: Table 5.2 Week
Monday
Tuesday
Wednesday Thursday
Friday
Saturday
1
45.8
47.4
49.8
49.9
53.5
53.6
2
45.4
48.5
49.9
49.4
50.9
52.1
3
44.2
45.4
48.5
46.2
49.3
49.7
4
41.4
45.9
46.7
46.0
51.3
48.4
5
43.7
45.5
46.0
45.1
50.1
48.4
(a)
Plot the values as a time series.
(b)
Comment on your graph.
5.2.3
Logarithmic Graphs
Read pp 56  63 of your textbook (OJ).
5.2.4
The Lorenz Curve 50
Read pp 63 65 of your textbook (OJ) You have learnt how the Lorenz curve is useful to illustrate the inequality prevailing in the distribution of income. In that case, equality is perceived as follows: 10% of household earn 10% of the income, 50% of households earn 50% of total income or, more generally x % of households earn x % of total income. In a similar manner, the Lorenz curve is used to illustrate the disparity which exists in the distribution of a certain variable in relation to the distribution of another variable. Intuitively, the further is the Lorenz curve from the ‘line of equality’, the greater is the disparity or the inequality. We shall now introduce the Gini’s coefficient which is an index to measure the disparity of income. Consider the following diagram which illustrates Lorenz Curve for income distribution:
Line of Equal Distribution A
B
X O
D
Y C
Figure 5.2: Lorenz Curve for Income Distribution
51
Gini’s Coefficient is defined as ρ =
Area BDOB Area ∆BOC
=
X ......................... X +Y
(Equation 1)
As the curve tends towards the line of equal distribution, then X ¤ 0 so that in turn
ρ ¤ 0 from (Equation 1) above. Thus the greater the value of ρ , the greater the disparity of income distribution. Hence the Gini Coefficient can be perceived as an Index of Inequality as it measures the degree of departure from the ‘line of equality’. Activity 3 Attempt questions 3.8 and 3.9 from (OJ). 5.2.5
The Z  chart
Read pp 6667 of your textbook (OJ). As you have noted on p 66 in your textbook, the Zchart consists of three curves on the same axes as shown in Figure 3.8 on p 67 . Usually, the chart covers a period of one year, by months. One curve shows the monthly figures, another shows the cumulative figures from the beginning of the year, while the third shows the total for the twelve months ending with each month. This last curve is generally called the moving annual total curve; more specifically, it is a 12month moving total for a period of twelve months ending with each designated month.
52
The concept of a moving annual total is important : it tends to smooth out fluctuations to some extent. Note that previous year figures are used exclusively for computing the moving annual total. To illustrate the computation of the Moving Annual Total, we shall refer to the solved example on p 66 in your textbook which represents the output of ABC limited. Let us see how the January moving annual total is obtained: 
As stated in the textbook, the January figure is the total sales achieved during the period
1st February last to 31st January this year, i.e.,
January Figure =
8 + 9 + 13 + ........... + 11 + 144444244443
Previous yr. figures from Feb. to Dec.
11 144424443 Current yr. figure for Jan
Similarly,
February figure =
9 + 13 + ..... + 11 +
11 + 14
144444424444443
144424443
Previous yr. figures from Mar to Dec.
Current yr. figure for Jan & Feb.
You can try for yourself and get the figures for the other months. One last point before we leave this topic, is that it is imperative that you understand and be able to interpret the three arms of the Z chart. Note: There is a slight modification in Fig. 3.8 (p. 67, OJ). The monthly totals have been wrongly plotted at the mid points of the respective months. In fact, they should be plotted at the end of the respective months. Thus, both the cumulative total and monthly total for January should coincide and subsequently, figures for the remaining months should be plotted accordingly.
53
Activity 4 1.
Attempt Question 3.11 from (OJ).
2.
The table that follows refers to the monthly production of electricity (in Gigawatts/hour) by the Central Electricity Board for the years 1993 and 1994. Table 5.3 Monthly production of electricity by the CEBfor the years 1993 & 1994
1993
1994
January
69
80
February
68
58
March
74
83
April
73
81
May
73
79
June
70
77
July
69
76
August
70
79
September
69
77
October
74
83
November
78
83
December
82
90
TOTAL
869
946
Source: Digest of Industrial Statistics, CSO
Attempt the questions that follow:
54
(a)
From the information given in Table 5.3, construct a Zchart for the year 1994.
(b)
Explain briefly the three components of the chart.
5.2.6
The Scatter Diagram
Read pp 6768 of your textbook (0J). Example 1
You are advised to attempt the example that follows. The example enables you to further understand the usefulness of the scatter diagram for exploratory data analysis, i.e. as a prelude to further statistical analysis later. Efforts are being made to cultivate a variety of upland cotton in Bangladesh. Cotton yield is known to be directly related to the time of planting. Previous work suggests that the optimum time of planting is May, in the wet season and September in the dry season. However, the dry season crop is more economical as it takes six months to mature as against one year for the wetseason crop. A study was therefore undertaken to find out if late planting can be economical, especially because heavy rains occasionally interfere with cotton planting in September. The variety D5 was planted at fortnightly intervals between September 1973 and January 1974. The yields of cotton obtained in terms of kilograms/hectare are given below: (1st September taken as Day 1).
55
Table 5.4 Yield of cotton planted at different fortnightly intervals between Sep 1973 to Jan 1974 Date of Seeding Day Number
Cotton yield
1st Sep 73 16 Sep 73 1 Oct 73 31 Oct 73 15 Nov 73 30 Nov 73 15 Dec 73 30 Dec 73 14 Jan 74
17.39 17.74 16.02 13.88 9.78 7.38 6.09 4.29 3.92
1 16 31 61 76 91 106 121 136 Source: X
(a)
Plot a scatter diagram to illustrate the above figures.
(b)
Comment on what the diagram reveals.
You can clearly note that a plot of yield against day number shows evidence of an inverse relationship i.e. as the day number increases, yield decreases (the later the planting, the smaller the yield). So, you can see that though the scatter diagram is simple, it can reveal some interesting results. 5.2.7
Some Examples of Bad Practice
Read pp 6971 of your textbook (0J).
56
5.3
SUMMARY
In this unit, you have learnt about the construction, interpretation and usage of various graphical methods of presenting data.
57
UNIT 6
MEASURES OF CENTRAL TENDENCY
Unit Structure 6.0
Overview
6.1
Learning Objectives
6.2
The ∑Notation
6.3
Measures of Central Tendency 6.3.1 The Arithmetic Mean 6.3.2 The Median 6.3.3 The Mode 6.3.4 The Harmonic Mean 6.3.5 The Geometric Mean
6.4
Summary
6.0
OVERVIEW
Often in real life, we are confronted with a mass of data. We are then interested to find a single representative value that captures the order of magnitude of the whole. It seems reasonable in some sense, that the tendency should be around some central value. This unit will introduce you to different measures that will enable you to do so. 6.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to compute, interpret and use the following: 1.
the arithmetic mean.
2.
the median.
3.
the mode.
4.
the harmonic mean.
5.
the geometric mean.
58
6.2
THE
∑
NOTATION
As addition is a commonly used mathematical operation in Statistics, a special notation is found useful to represent it, especially when we need to add many numbers. Let x be a variable which takes values x1, x2, x3, ......., xn. Then, the sum of the xi’s starting with x1, ending with xn , and including all values in between these limits is obviously given by x1 + x2 + ..... + xn. The special notation,
n
∑x
i
∑
, (the Greek letter read as sigma) is used instead. Thus, we have
= x1 + x 2 + x 3 + ....... + x n , where the limits of the summation are incorporated in
i =1
the notation itself.
Note: Sometimes the
∑
sign appears without the limits of summation: the latter should be
taken as extending over all the values, from the first to the last one; i.e. in this case n
∑ x = ∑ xi i =1
SOME PROPERTIES OF THE n
•
∑x
=
i
i =1
m
∑x
i
 NOTATION:
n
∑x
+
i =1
∑
i
i = m +1
For example, 6
∑x i =1
i
= x1 + x 2 + x 3 + x 4 + x 5 + x 6 = ( x 1 + x 2 ) + ( x 3 + .....+ x 6 ) =
2
∑ xi + i =1
6
∑x
i
i=3
i.e. in this case, m =2 .
59
n
∑k
•
=
n.k (k is a constant)
i =1
n
∑ (xi
•
n
± yi) =
∑ xi
i =1
n
∑ k xi
•
i =1
±
i =1
n
∑y
i
i =1
n
= k ∑ xi
(k is a constant)
i =1
CAUTION
n
(i)
∑x y i
i
i =1
n
(ii)
∑ xi2 ≠ i =1
n n ≠ ∑ xi ∑ yi i = 1 i =1
n
( ∑ x i )2 i =1
Activity 1
Let x, y and f be variables taking the following values: xi
:
5, 1, 0 , 4, 7
yi :
0, 2, 7, 8, 5
fi
0, 1, 3, 4, 7
:
Calculate 5
(i)
∑x i =1
5
i
,
∑f
5
i
∑y
,
i =1
i =1
60
i
4
(ii)
4
∑
i =1
i =1
i =1
8∑ xi
,
i =1
5
i =1
∑
,
i =1
5
∑ ( xi − yi )
5
2
∑(x
,
i =1
i
i =1
+ yi )
2
i =1
5
∑k ,
(vii)
where k is a constant
i =1
5
5
∑
(viii)
,
f i xi
i =1 5
yi
5
∑f
i
i =1
5
∑f
i
i =1
∑f
(ix)
∑f
i
i
i =1
5
xi 2
,
i =1
∑f
i
yi 2
5
,
i =1
∑f
i
i =1 5
∑f i =1
5
(x)
∑(x
− k ) , where k is a constant 2
i
i =1
5
∑ yi2
,
5
4 ∑ yi
,
i =1
5
∑ xi 2
yi
5
∑ 4 yi
,
i
i =1
5
∑ 8 xi
(vi)
∑x
,
i =1
5
(v)
5
∑ ( xi − yi )
,
i
i =1
5
∑ ( xi + yi )
(iv)
∑x
,
i=2
5
(iii)
8
∑ xi
,
xi
i =1
61
i
xi
2
5
f i xi
,
∑f i =1
i
yi
6.3
MEASURES OF CENTRAL TENDENCY
Read pp 9394 of your textbook (OJ). To have further insight why averages are called indicators of central tendency, consider the following example: we rarely find an adult who is as tall as 7 feet or as short as 4 feet, the height of most people ranging around a point located centrally between these extremes. Because so many measurements cluster near the middle of the distribution, we say they have a central tendency. Since so large a portion of the group clusters near this central level, we think of that as representing the typical characteristic for the group  the most typical point is what we compute when we find an average. 6.3.1
The Arithmetic Mean
You are strongly advised to go to section 3.3.2.2 on class widths and class midpoints for better understanding of the topic. Read pp 94  103 of your textbook (OJ). An important issue that needs to be highlighted is the case when we are dealing with open ended class intervals. We take this up in the next section. 6.3.1.1 Open Ended Class Intervals
As clearly stated on p101 of your textbook (OJ), there are no hard and fast rules to deal with that. There is obviously a degree of arbitrariness in the choice of the boundaries of the openended classes. You should thus ponder on the specific situation you are dealing with and decide on the limits using some common sense.
You may feel a bit perplexed with what was just said, so let us consider the following example which gives the age distribution of the management department of a large
62
private company. Table 6.1 Age
Frequency (years)
Under 20
2
2029
12
3039
31
4049
39
5059
26 60 and above
10
Suppose you have to compute the mean of the distribution. Obviously, you have to make some assumptions before carrying out the calculation, as we have openended classes, and also at the same time justify the choice you make. It would be ridiculous to have the first class interval as 1019, as realistically we would not expect managers of the company to be aged around 10!!. Thus it would appear reasonable to take for the first class interval, viz. under 20 as say 1819. For the class interval 60 and above; well, 60  69 seems appropriate, taking into account that the age of retirement is around 69 in the private sector. You can now compute the mean by the methods you have just learnt (Direct method or using Assumed Mean). Since you will probably be using a calculator to compute the mean, you may not find it necessary to use the Assumed Mean, but the idea behind this method should be kept in mind. To end up, p103 of your textbook mentions that “Any alternative to the Arithmetic Mean cannot be used for advanced analysis, so their uses are descriptive rather than analytical.”
63
Well, this statement may be valid for distributions that are symmetrical, but may not be so for skewed (nonsymmetrical) distributions, where the median (which you shall learn very soon) is widely used for advanced statistical analysis. Activity 2
(i) Show that n
∑(x
i
− x) = 0
i =1
n
x =
where
∑x i =1
i
is the mean.
n
(ii) Show that n
∑(x i =1
i
n
− x) = ∑ x 2
2 i
(∑ x ) −
i =1
2
i
n
HINT FOR (ii) Expand the term in the bracket of the L.H.S of the equality and simply use
the various properties of the simplify it. Activity 3
Attempt Questions 5.1, 5.6, 5.9 in textbook (OJ). 6.3.2
The Median
Read pp 103  107 of your textbook (OJ)
64
∑
 notation you learnt previously to
Pay particular attention to p. 106107 of OJ where the limitations of the mean are addressed and also the situations where the median would be a more appropriate measure of central tendency than the mean. We further illustrate the use of the ogive curve to estimate the median. Let’s use the solved example provided in subsection 5.2.1, p 37 of your course manual. We reproduce the data and the ogive curve on the following page: Watches Assembled
c.f
up to 449.5 449.5  469.5 469.5  489.5 489.5  509.5 509.5  529.5 529.5  549.5 549.5  569.5 569.5  589.5
3 10 28 53 89 116 126 130
The ogive curve is shown below :
Ogive Curve for On Time Ltd
140 120
No of Workers
100 80 60 40 20 0 440
460
480
500
520
540
Watches Assem bled
We have n = 130 65
560
580
600
∴ Rank of median =
130 = 65 2
Using the ogive curve above, we have Estimate of median = 516 Example 1
We illustrate below, using a further example, the computation of the median. The data below gives the time taken by 200 female students to solve a problem. Table 6.2 Time
Frequency
(nearest(s))
118  126
15
127  135
25
136  144
45
145  153
60
154  162
25
163  171
20
172  180
10
You are required to calculate (a) the Mean, (b) the Median, time taken to solve the problem. (a)
You can calculate the Mean
(b)
Calculation of Median
(Ans: 147.0).
We first construct the cumulative frequency column:
66
Table 6.3 Time (nearest(s))
f
c.f.
118  126
15
15
127  135
25
40
136  144
45
85
145  153
60
145
154  162
25
170
163  171
20
190
172  180
10
200
200 For the calculation of the median, the variable under study is assumed to be continuously distributed; thus in this case, the time variable is redefined as follows(as per section 3.3.2.2, Example 1): 117.5 and under 126.5 126.5 and under 135.5 135.5 and under 144.5 144.5 and under 153.5 Etc., etc., The next step is to identify the median class; i.e. the class which contains the median value. The rank of the median value is given by
200 = 100 (i.e. total frequency divided by 2). From 2
the above table and the cumulative frequency column in particular, we notice that the fourth class interval. 144.5 and under 153.5, contains the 100th value, i.e. the median value. Hence, it is the median class. The problem now is to locate the median value, i.e. the 100th value for the whole data set, within the median class. Make sure this is clear to you; what follows will then be simple and straightforward!
67
Thus, the median value = 144.5 + “something”. What is that “something”? We note that a)
In the median class, 144.5 and under 153.5, there are 60 values and the class width = 9 units.
b)
the cumulative frequency for the first three class intervals is 85
c)
The rank of the median value for the whole data set is 100, so that the rank of the median value is within the median class is 100 85 = 15. Thus we need to locate the 15th value within the median class out of the 60 values it contains.
To do that, we make an assumption. The 60 values within the median class are evenly spreadout, this assumption being sensible when the data set is large. Then using simple direct proportion, we can locate the 15th value within the median class, as follows:If 60 values are spread over 9 units
Therefore 1 value is spread over
And 15 values are spread over
9 unit 60
9 x 15 units = 2.25 60
Hence the median value = 144.5 + 2.25 = 146.75 = 146.8 (nearest to one decimal place)
It should be clear to you that this answer corresponds to the working out of the formula given in your textbook (p 105 of OJ) viz. n + 1 class int erval − c. f to median group 2 Median = LCB + frequency of median class 68
Note that since n is large, n +1 ~ n − 2 2
So that we have median = 144.5 +
=
9 [100 − 85] 60
146.8 (same as above)
You may also try estimating the median by first constructing the ogive curve. As you notice, the mean and the median nearly coincide. What does this indicate?
Activity 4 Attempt Questions 5.14, 5.22 (a) and 5.23 in textbook (OJ).
6.3.3
The Mode
Read pp 107109 of your textbook (OJ). An important condition whilst computing the mode using the histogram, is that the modal class and the two classes adjacent to it should necessarily be of equal width.
69
Activity 5 Attempt question 5.24in textbook (OJ).
6.3.4
The Harmonic Mean
Read pp 109110 of your textbook (OJ).
6.3.5
The Geometric Mean
Read pp 110112 of your textbook (OJ).
Activity 6 Attempt Question 5.27 in textbook (OJ).
6.4
SUMMARY
In this unit , you have learnt about the different measures of central tendency, viz., the arithmetic mean, the median, the mode and the harmonic and geometric means. These numerical descriptive measures enable us to create a mental image and summarise data sets we usually encounter in practice.
70
UNIT 7
MEASURES OF DISPERSION
Unit Structure 7.0
Overview
7.1
Learning Objectives
7.2
Measures of Dispersion 7.2.1 Introduction 7.2.2 Measures of Range 7.2.3 Measures of Average Deviation 7.2.3.1 Mean Deviation 7.2.3.2 Standard Deviation 7.2.4 Coefficient of Variation/Measure of Relative Dispersion
7.3
Measures of Skewness 7.3.1 General Considerations 7.3.2 Coefficients of Skewness
7.4
Summary
7.0
OVERVIEW
Variation is an important consideration in life. In this unit, you study its importance as well as various ways of measuring its magnitude and understanding its nature.
7.1
LEARNING OBJECTIVES
When you have successfully completed this unit, you should be able to do the following: 1.
Explain the importance of studying variation: both its magnitude and its nature.
2.
Compute, interpret, and use a.
the range.
b.
the interquartile range.
c.
the quartile deviation. 71
d.
the mean deviation.
e.
the standard deviation.
f.
the coefficient of variation.
g.
the quartile coefficient of variation.
h.
the Pearson’s coefficient of skewness.
i.
the Bowley’s coefficient of skewness.
7.2
MEASURES OF DISPERSION
7.2.1
Introduction
Study the top of p 181 of (OJ). This section draws attention to the inadequacy of measures of central tendency as summary measures for data, and highlights the need for measures of dispersion as well. Measures of central tendency try to capture a sense of the order of magnitude of the data, whereas measures of dispersion attempt to capture a sense of the variability in the data. Homogeneity (small variation) and heterogeneity (wide variation) are important considerations in many situations encountered in real life as they have implications for decisions and action. The various measures of the magnitude of dispersion are presented in your book in a rational progression, from the crudest to the most refined. Therefore, in studying the various measures, pay attention to how each successive measure presented improves on the previous one.
Activity 1 Ponder over the following: (i)
What would be the relevance of examining the variation in the marks scored by students from a given class at an examination? What would be the relevance of comparing such variation with that of marks for students from another class taking the
72
same exam? What would be the potential implications of your findings for decision and action? (ii)
What would be the relevance of examining the variation in the output (measured by number of items produced) of workers of a factory, given that all workers are manufacturing the same item? What would be the relevance of comparing such variation with that of output of workers of another branch producing the same item? What would be the potential implications of your findings for decision and action?
7.2.2
Measures of Range
Study pp 180183 of textbook (OJ). Ensure that you can define the range, the upper and lower quartiles, the interquartile range, the quartile deviation. Note that the quartiles can be obtained either by graphical method from the cumulative frequency curve (as illustrated on p53) or by calculation using the same kind of reasoning as for determining the median value. The latter happens to be the second quartile Q2. Ensure that you can compute the range and the quartile deviation (a fuller name for the latter is semiinterquartile deviation, for obvious reasons) Pay attention to the strengths and weaknesses of the range, the interquartile range and the quartile deviation as measures of dispersion. One of the weaknesses of all three is that they are all based on only two values. We sometimes say, because of this, that they are not comprehensive measures as not all observations have been taken into consideration in their computation. From this point of view, their representativeness is questionable.
Activity 2 1.
Answer the following : In what way does the quartile deviation improve on the range as a measure of dispersion?
73
2.
Attempt parts (a), (d), (e) and (f) of Question 9.2 in (OJ).
3.
Attempt Question 9.3 in (OJ).
7.2.3
Measures of Average Deviation 7.2.3.1 Mean Deviation
Study pp 184185 of OJ. Note that where the Σ sign appears without the limits of summation, the latter should be taken as extending over all the values, from the first to the last one.
Activity 3 Attempt part (g) of Question 9.2 in textbook (OJ).
7.2.3.2 Standard Deviation Study pp 186189 of textbook (OJ). The underlying idea for this measure of dispersion is the same as for the mean deviation, viz. the averaging of deviations from the mean. However, the problem that arises from the fact that the sum of such deviations is zero is overcome in a different
way: not by ignoring the signs of the deviations but by squaring them before averaging. However, this process inflates the order of magnitude and to offset this, we then take the square root . The term “standard deviation” has become the established appellation because of its simplicity but a more explicit name for the standard deviation would be the root mean squared deviations. The logic underlying the calculation of the standard deviation is well brought out
74
in the form of the formula given at the bottom of p186 of OJ. However, as pointed out in OJ, this form of the formula is computationally more difficult than the alternative, but equivalent, formula given at the top of p 187. Note that both these formulae apply to ungrouped data (i.e. data where the individual values of the variable are given). For grouped data, the appropriate formula is
∑ f (x − x) ∑f
2
where x is the centre point of the class, x the overall mean and f is the frequency. Try to figure out the underlying logic of this formula. Again there is an alternative formula which is computationally easier. This formula is given in your book on p 187. Note that just before the formula, there is a slight misprint: you should read “ .....simply replace x² with fx², x with f(x) and n with Σf.” Note that the square of the standard deviation i.e. the average of the squared deviations from the mean (without taking the square root) is also used as a measure of variation and is called the variance. Activity 4
1.
Prove the equivalence of the alternative formulae for the standard deviation:
(a)
in the case of ungrouped data
∑ (x − x) n
∑x
2
and
n
75
2
∑ x  n
2
(b)
in the case of grouped data
∑ f (x − x) ∑f
2
and
∑ fx ∑f
2
∑ fx  ∑f
2
[Hint: Refer to Unit 6 Activity 2(ii)].
2. On Time Ltd has a unit of 130 workers, all performing exactly the same task: the assembly of watches. The output of the workers for the first week of October 1997 was recorded and is reproduced below:
Watches Assembled
Number of Workers
up to 449 450  469 470  489 490  509 510  529 530  549 550  569 570  589
3 7 18 25 36 27 10 4
In Unit 5, you were asked to construct a table and an ogive for the above data. (a)
(b)
Using the ogive, estimate (i)
the interquartile range
(ii)
the semi interquartile deviation
Also, obtain by calculation, the above measures as well as the standard deviation for the same data.
3.
Attempt Questions 9.13 and 9.16 in textbook (OJ).
76
7.2.4
Coefficient of Variation/Measure of Relative Dispersion
Study p 189 (second half) and p 190 (except last five lines). The Standard Deviation has a major weakness. It may have the same value for very different data sets. Thus, from the example given in your textbook (0J) on p189, two data sets, viz., A:
8, 9, 10, 11, 12, 13, 14
B:
1008, 1009, 1010, 1011, 1012, 1013, 1014
have the same standard deviation 2. Common sense would suggest that there is something wrong. Intuitively, we feel they do not have the same degree of spread. Thus, for data set A, the increase from the smallest to the largest value in A is an increase of 14  8 × 100 = 75% 8 but the corresponding increase for data set B is only 1014  1008 × 100 = 0.595% 1008 This is due to the fact that the standard deviation is independent of a sense of order of magnitude to the data set (on an additive scale). To correct this weakness, the standard deviation is related to a measure of the order of magnitude of the data set. This is the idea underlying the Coefficient of Variation. We have thus the S.D. × 100
Coefficient of variation, C.V. = Arithmetic Mean (provided that the Arithmetic Mean is not 0).
77
Can you compute the coefficient of variation for the two data sets A and B given above? Sometimes, the coefficient of variation is referred to as a measure of relative dispersion. Additionally, the standard deviation is not “dimensionless” : it is expressed in appropriate units such as rupees, centimetres, grams etc. Thus, if we were measuring variation in temperatures, the value of the standard deviation would differ depending on whether temperatures were measured in degrees Celsius or in degrees Fahrenheit. We call this scale dependence. The dependence on units of measurement also makes it impossible to use the standard deviation to compare the variation of variables measured in different and mutually inconvertible units. It is interesting to note that division by the mean also removes the dependence on units. Thus in contrast to the standard deviation, the coefficient of variation is “dimensionless”. Moreover, from the practical point of view, we need to be cautious when using the standard deviation or the coefficient of variation as measures of dispersion or spread. Common sense must prevail! Three illustrative examples are provided below:
Example 1 Suppose that cylindrical pins (and their corresponding sockets) are being manufactured to different diameters, say 5 and 15 millimetres and that it is desired to achieve the same fit irrespective of the diameters. Then the tolerable absolute variation in the diameters is the same and the standard deviation is appropriate in comparing the variability in diameter of a batch of 5 mm pins to that of a batch of 15 mm ones.
Example 2 Suppose we are comparing two groups of households: a low income group that spends around Rs 1000 a month and, a high income group that spends around RS 10,000 a month. A variation of RS. 200 in the first group would be considered substantial but a similar variation
78
among the second group would be considered minor. When this kind of consideration applies, it is obvious that we are interested in the dispersion relative to the order of magnitude of the values and not in the absolute dispersion. The standard deviation, which is a measure of absolute dispersion, is therefore inappropriate in such situations.
Example 3 Suppose that we are interested to compare the variability in the weights of a group of students to the variability in their heights. It is impossible to use the standard deviation as the latter is dependent on units of measurement and is expressed in relevant units such as rupees, centimetres, grams etc. and there is no way to convert kilograms, say, to centimetres. However, the coefficient of variation is dimensionless, i.e. independent of units, and is therefore appropriate here.
Activity 5 1.
Answer the following: What would happen to (a) the mean (b) the standard deviation (c) the coefficient of variation of wages, if the wages of all workers of a factory were to be increased (i) by a uniform amount of Rs. 400 (ii) by 10%? Justify your answers.
2.
Attempt Questions 9.19 and 9.21 in (OJ).
7.3
MEASURES OF SKEWNESS
Study from last but one paragraph of p190 to top of p 191 in (OJ).
79
7.3.1
General Considerations
You were introduced to the notion of skewness in Unit 6. Skewness is another aspect of the variation in the values of a variable. Whereas measures of dispersion focus on the extent of the variation, measures of skewness focus on the nature of that variation: as the values vary, do they tend to be symmetrically distributed around the centre, or do they tend to cluster more at the lower end or more at the upper end of the range of values? Such considerations are of interest as they have important implications. For example, income distributions are notoriously positively skewed. Excessive skewness in an income distribution is often criticised as reflecting an unfair distribution of income (Figure out why!) although some skewness in such distributions usually exists.
Activity 6 Governments usually have a means of reducing the skewness of income distributions if this is perceived as excessive. What is the instrument for doing that and how does it operate?
7.3.2
Coefficients of Skewness
You know from Unit 6 that: (i)
for a perfectly symmetrical distribution, the median and the mean coincide.
(ii)
for a positively skewed distribution, the mean exceeds the median.
(iii)
for a negatively skewed distribution, the mean is less than the median.
A measure of skewness can therefore be based on the deviation of the mean from the median. However, it is considered desirable that the value of the measure of skewness should not change merely because of a change in location ( e.g. if the wages of all workers in a factory were increased by the same amount ) or a change in scale (e.g. if the wages of all workers in the factory were increased by a constant percentage). It is also considered desirable that the measure of skewness should be independent of units.
80
These objectives are achieved by dividing the difference between the mean and the median by the standard deviation, hence the Pearson’s coefficient of skewness defined in OJ at the bottom of p 190. 3(meanmedian) Pearson’s Coefficient of Skewness = Standard Deviation An alternative measure of skewness (known as Bowley’s coefficient of skewness) is based on the fact that in a symmetrical distribution, the lower and upper quartiles are equidistant from the median. However, in a skewed distribution, the deviations of the upper and lower quartiles from the median will be unequal. Thus the difference between such deviations can be used as a measure of skewness. Again, it is desirable that the value of the measure of skewness be independent of a change in location or scale and be free of units. These objectives are achieved by dividing the difference by the interquartile range. Bowley’s Coefficient of Skewness:
sk =
=
(Q3 − Me) − ( Me − Q1 ) (Q3 − Q1 ) (Q3 + Q1 − 2 Me) Q3 − Q1
Activity 7
81
Refer to the data in Question 2 of Activity 4. Calculate (i)
Pearson’s coefficient of skewness the coefficient of variation for the data.
(ii) Bowley’s coefficient of skewness.
7.4
SUMMARY
In this unit, you have appreciated the universality of variation and learnt about measures of dispersion and skewness: their importance, computation, application and interpretation. The following measures have been covered: range, interquartile range, quartile deviation, mean deviation, standard deviation, coefficient of variation, quartile coefficient of variation, Pearson’s coefficient of skewness, Bowley’s coefficient of skewness. In Unit 6, you saw how measures of central tendency provide a partial summary of certain data sets. In this unit, you have seen how measures of dispersion complement that summary.
82
UNIT 8
TIME SERIES ANALYSIS
Unit Structure 8.0
Overview
8.1
Learning Objectives
8.2
Components of a Time Series and Time Series Models 8.2.1 Components of a Time Series 8.2.2 Time Series Models
8.3
Calculation of Trend 8.3.1 Using Moving Average Method 8.3.2 Using Exponential Smoothing
8.4
Seasonal Variation
8.5
The Residual Component
8.6
Forecasting from the Time Series 8.6.1 Projecting the Trend 8.6.2 Using the Average Rate of Change
8.7
The Multiplicative Model
8.8
Summary
8.0
OVERVIEW
Time Series is a type of data set which tells us how a given variable varies over time; such data sets exhibit very particular pattern and fluctuations. They exist in almost any sphere of human activity: in economics, business, engineering, medicine, meteorology, agriculture, etc. In this unit, you will study the analysis of time series and you will be introduced to elementary forecasting.
83
8.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following:
1.
2.
Explain a.
the components of a time series.
b.
a time series model.
c.
hence, the underlying structure of a times series.
Calculate, interpret and use d.
the trend.
e.
the seasonal component.
f.
the residual component.
3.
Carry out elementary forecasting from time series analysis.
8.2
COMPONENTS OF A TIME SERIES AND TIME SERIES MODELS
8.2.1
Components of a Time Series
Study pp 125128 of your textbook (OJ).
8.2.2
Time Series Models
From subsection 8.2.1, you have learnt that there are four components which interact among themselves in some way to generate the data set under consideration.
84
Let
X denote the time series variable (e.g. imports) T the trend S the seasonal component C the cyclical component R the residual component.
For our purposes, as explained in subsection 8.2.1, the time series under consideration will not be covering a sufficiently long period for us to be able to capture the contribution of the cyclical component. Thus we shall ignore this component in our discussion; the point should be made that this unavoidable omission, under the present circumstances, would influence our calculations. Can you think why? Then the three components are assumed to interact in mainly two ways to produce the time series, giving rise to two time series models for short term data: (i)
The Additive Model X=T+S+R
(ii)
The Multiplicative Model X=T×S×R
The terms are self explanatory. Moreover, in this unit, we shall consider mainly the additive model; the multiplicative model will be dealt with briefly in subsection 8.7.
8.3
CALCULATION OF TREND
8.3.1
Using Moving Average Method
Study pp 128131 of your textbook (OJ).
85
Your attention is drawn to two main points. Firstly, the idea of a moving total, used in the construction of a Zchart (subsection 5.2.5), is used here again to smooth out the fluctuations before a moving average is computed. Secondly, sometimes it may be desirable to draw a line of best fit by the “eye” from the values obtained, for the trend. For example, from the graph given on p 129 of your textbook (OJ), a straight line which fits best the points (hence the line of best fit) can be drawn. A line of best fit is drawn such that, overall, the points are ‘uniformly’ distributed around the line.
Activity 1 1. Draw the graphs shown on p 129 of your textbook (OJ). Then draw the line of best fit for the trend values. Can you forecast the trend values for years 16, 17 and 25? Comment on your results. 2. Attempt questions 6.4 and 6.7 in your textbook(OJ),
8.3.2
Using Exponential Smoothing
Study pp 131135 of your textbook (OJ). It is appropriate to note that if Xi and Ti denote the corresponding values of the time series and the trend respectively at time i, then the trend values are given as follows:
T1 = X1 T2 = αX2 + (1α)T1 Note that α+ (1α) = 1
= αX2 + (1α) X1 T3 = αX3 + (1α)T2 = αX3 + (1α) {α X2 +(1α) T1} = αX3 + α (1α) X2 + (1α)2 X1 ,
Note that α+α (1α) + (1α)2 = 1
Tt = αXt + α(1α) Xt1 + α (1α)2 Xt2 + ...+ α(1α)r Xtr + ... + (1α)t1 X1 Note that the sum of the weights α, α (1α), α (1α)2 ,..., (1α)t1 equal to 1 or tend to 1 as t becomes very large. 86
assigned to the Xt’s is
Thus, when using exponential smoothing, we use all relevant values of the time series to compute the trend, with the most recent value having more weight and the most distant value having the least weight. Furthermore, the sum of the weights adds up to 1 or tend to add up to 1, so that then the trend value is a weighted average of all relevant values of the time series.
Activity 2 1.
Attempt Questions 6.4 and 6.7 in your textbook (OJ).
2.
For the data in question 6.4, obtain exponentially smoothed trend line with α = 0.2
8.4
SEASONAL VARIATION
Study pp 140143 of your textbook (OJ). Recall that our additive model is given by X=T+S+R . Thus, we can have X  T = S + R and the values for XT give the “deviation from trend” as mentioned on p 139 of your textbook (OJ); alternatively, these values are also known as detrended values. Similarly, we can have X  S = T + R and the values for X  S give the “series with seasonal variation eliminated” as mentioned on p 142 of your textbook; alternatively, these values are also known as the seasonally adjusted series or deseasonalised series. Further, on top of the assumption made regarding the residuals (p 140 of your textbook) and which underlie the calculation of the seasonal component, two more assumptions should be made explicit:
87
(i)
With reference to the example discussed on p 140 of your textbook, the sum of the four seasonal components for a given year is equal to zero. This seems to be a sensible assumption, in line with the very notion of seasonal variation.
(ii)
Each seasonal component (i.e. each one of the four considered in the example) is assumed to be constant over time. This may not always be true, the more so when the time series is spread over a long period of time. Moreover, it is somewhat difficult for us to handle such cases; and these cases are not within the scope of this unit (and course). It is to be noted that the multiplicative model referred to in 8.7 can cope to a certain extent with this situation.
It is to be noted that, as mentioned on p. 141 (OJ), we expect the sum of the averages of S+R for each quarter to be equal to 0. This is due to the fact that (i)
in a given year, the sum of the four seasonal components is expected to be equal to 0;
(ii)
The residual fluctuations are assumed to be random so that, in the long run, we expect them to cancel out each other. Thus, if the sum is not zero, then an adjustment is necessary as explained on p. 141 (OJ).
Activity 3 (i)
Obtain seasonal components for the data set from Question 6.4 in your textbook (OJ) .
(ii)
Attempt Question 7.6 in your textbook (OJ)
.
8.5
THE RESIDUAL COMPONENT
Study pp 143144 of your textbook (OJ). Recall that our assumed time series model is defined by
88
X=T+S+R . We must bear in mind that (i)
the exclusion of the cyclical fluctuations from our model and
(ii)
the possible failure, at times, of assumptions underlying the computation of the seasonal component
have a bearing on the accuracy of the calculation of the various components in some way or the other. In particular, on top of the various points made in your textbook regarding the nature and content of the residual component, the residual component captures the resulting errors to some extent. And this may lead to further inaccuracy. For our purpose, the model and the method are satisfactory.
8.6
FORECASTING FROM THE TIME SERIES
8.6.1
Projecting the Trend
Study pp 144147 on your textbook (OJ). Some comments on the method used to project the trend in your textbook are pertinent. As it is already written in your textbook on p 147, projecting the trend by eye is neither accurate nor consistent. Different people may give different forecasts. The alternative is to draw a line of best fit as explained in subsection 8.3.1 and then to carry out the needed forecasts. This method is obviously approximate and holds, provided the trend is linear or at least approximately linear. Moreover, sometimes it happens that there is a marked change in slope of the trend at a given point in time, so that the trend may be approximated by two linear parts, as shown in Diagram 8.1:
89
Trend
x x B
x
x
x C
x x x A
Time
Figure 8.1 In such cases, the trend is estimated by extrapolating only the second linear part (BC) of the graph.
Activity 4 Reproduce the graph found on p 145 of your textbook (OJ). Carry out the forecast of the trend for 196 Quarters 1 and 2 over again, using a line of best fit by eye.
8.6.2
Forecasting the Trend Using the Additive Model
The trend values obtained from the moving average method are plotted against time on graph. We shall assume, for our purposes, that we have a linear trend. Thus a line of best fit by the eye is drawn. Thereafter, that line is extrapolated and corresponding forecasts for the trend can be read off. In your textbook, the method used to obtain forecasts of imports from the forecast trend values (pp 146147) has the merit that we do not have to disentangle the seasonal component
90
from the residual component. This is particularly valid when it is believed that the residual fluctuations are pronounced. Moreover, if they are of minor importance (as is normally expected), then we can simply forecast the imports by using our model X=T+S . Substituting the estimated trend value and the corresponding seasonal component in the model gives the forecast value for the time series.
8.6.3
Using the Average Rate of Change
Study pp 147149 of your textbook (OJ).
8.7
THE MULTIPLICATIVE MODEL
Study pp 149151 of your textbook (OJ). In a multiplicative model, as mentioned on p. 150 (OJ), the total of the average ratios is expected to be four in the example under consideration. It is appropriate to understand why this is so. As in the additive models, the seasonal components at times produce an increase in the trend or at times a decrease in the trend. In a given year (as per the example), we expect these increasing and decreasing effects to cancel out each other. But in the multiplicative model, the sum of the seasonal components do not add to 0 as in the additive model. But instead, they add up to 4. Why? The ratios are calculated using Actual data/Trend. If there were not seasonal fluctuations, this ratio would be equal to 1 (assuming the residual fluctuations to be insignificant). Moreover, some seasonal components would produce an increasing effect, so that the ratio is more than 1 (e.g. 1.0633 or 1.1071 as per p. 150 OJ). Other seasonal components would produce a decreasing effect, so that the ratio is less than 1 (e.g. 0.8471 or 0.9873). Thus, as per the
91
definition of seasonal fluctuation, we expect them to cancel out each other in one year (as per the example), so that their total is expected to be 4. If not, we then have an adjustment as explained on p. 150 (OJ).
Activity 5 Attempt Question 7.10 of your textbook (OJ).
8.8
SUMMARY
In this unit, you have learnt about time series analysis. You should now have a good grasp of the nature of the various components of a time series and of their calculation and interpretation. You have also learnt how to carry out elementary forecasting from time series analysis.
92
UNIT 9
INDEX NUMBERS
Unit Structure 9.0
Overview
9.1
Learning Objectives
9.2
Index Numbers
9.3
Methods of Construction of Index Numbers 9.3.1 Index Numbers of Prices 9.3.2 Two Approaches 9.3.3 Worked Examples 9.3.4 Fisher’s Index of Prices 9.3.5 General Formulae for Index Numbers 9.3.6 Index Numbers of Quantities (or Volume)
9.4
Further Concepts 9.4.1 Splicing Two Series of Index Numbers 9.4.2 ChainBased Index Numbers 9.4.3 Using an Index to Deflate a Time Series
9.0
9.5
General Problems of Index Number Construction
9.6
Uses and Limitations of Index Numbers
9.7
Summary
OVERVIEW
This Unit introduces you to statistical tools called index numbers, which attempt to measure the magnitude of changes in any variable over time. Here, we shall be more concerned with changes in economic variables over time. The unit will cover the different types of price and quantity index numbers, the general problems of index number construction, interpretation, uses and limitations of index numbers. Chapter 8 of your textbook covers some of these topics. However, the different types of index numbers are not introduced in a proper order and also the different types of index number construction are not well defined (pp 158164). This unit therefore, consists of a complete writeup of the topics covered in these pages in
93
your textbook as well as of other topics which are omitted in the chapter. We shall make special reference to these pages where necessary. Treatment of topics on pp 165170 is quite satisfactory and, therefore, these topics are not rewritten in the unit.
9.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following: 1.
Compute, interpret and compare the different types of price and quantity index numbers
2.
Explain the importance of weights in an index number
3.
Identify the main practical issues to be considered when constructing an index number
4.
Change from fixed base to chain base and vice versa, splice and deflate an index number series
5.
Identify the uses and limitations of index numbers.
9.2
INDEX NUMBERS
As mentioned in the overview, index numbers are devices for measuring the magnitude of changes in a variable over time. Such changes could be in the price of commodities, in the quantity of goods produced, marketed, or consumed, or in such concepts as productivity, efficiency, etc. The comparisons may be between different time periods, between places, or between like categories. In many of these situations, the volume of data that has to be analysed is huge and also has other characteristics that you did not come across in data for averages. Index numbers are special types of averages which can make such masses of complex data more manageable and better understood, and thus enable us to compare different sets of data. Thus, we may have index numbers comparing the consumer prices in different years or in different countries, the volume of production in different years, the productivity of different sectors of the economy, or, the efficiency of different school systems. Read pp. 157158 of your textbook for further details.
94
This unit is mainly concerned with index numbers of prices and of quantities comparing changes over time. At first, methods of construction of index numbers of prices and of quantities are considered. We then introduce further concepts such as chainbased index numbers, splicing of index numbers and use of an index number to deflate a time series. You can read about splicing and deflating in your textbook (OJ). Finally, we deal with general problems of index number construction, and, uses and limitations of index numbers.
9.3
METHODS OF CONSTRUCTION OF INDEX NUMBERS
We shall consider methods of construction of price indices at first and later on show how quantity indices can be obtained similarly.
9.3.1
Index Numbers of Prices
To illustrate the construction and interpretation of index numbers of prices, we have used the data of example on p. 158 (OJ), which uses a list of three commodities.
Al Coholic throws a (rather unusual) party each Christmas for his friends. Details of prices and quantities of the three food and drink items purchased by him in 1992 and 1993 are as follows:
Table 9.1 1993
1992 Price
Quantity
Price
Quantity
Po
Qo
Pn
qn
Lager (per bottle)
£1.00
40
£1.15
50
Crisps (per packet)
£0.20
100
£0.27
90
Cake
£2.00
1
£2.20
1
Commodity
95
The objective of A1 Coholic is to know the changes in the prices of these commodities taken as a whole in 1993 as compared with those in 1992. The time period that serves as the basis for comparison, is called the base period, whereas, the time period that is compared with the base period is called the current period. Thus, here 1992 is the base year and 1993 is the current year.
Note: The base period is sometimes indicated by calling that period as equal to 100. For example, here, 1992 = 100 will show that 1992 is the base year.
9.3.2
Two Approaches
There are two main approaches to handle the problem of determining the changes in the
prices of a group of commodities taken as a whole in a given year as compared with those in another year:
One approach is to consider the change in the price of each commodity initially and then to try, in some way, to bring together these changes, by, for example, using an average.
The second approach is to consider the prices of all the commodities at one point in time and then relate them is some way, with those at the other point in time under consideration.
We shall now consider these two methods for the rest of this section.
Method I – Price Relatives Method
A very simple way of finding the change in the price of each commodity would be by calculating what is known as a price relative for each commodity.
96
The price relative of a commodity is defined as its price in the current period expressed as a percentage of its price in the base period.
If po and pn denote the prices of a commodity during the base period and given period respectively, then, symbolically,
Price relative =
pn × 100 p0
Equation 9.1
In the above example, the price relatives for the three commodities in 1993 with 1992 as the base year are then given by:
1.15 × 100 = 115 1.00 0.27 Price relative for Crisps = × 100 = 135 0.20 2.20 Price relative for Cake = × 100 = 110 2.00 Price relative for Lager =
Thus it is observed from all the above price relatives that the prices of Lager, Crisps and Cake have risen by 15, 35 and 10 percent respectively.
However, as mentioned earlier, Al Coholic is interested in a single measure which would compare the change in prices, of all the three commodities taken as a whole, in the two years.
We are tempted to believe that one way of combining the changes in the price of a group of commodities is to find an average of price relatives of these commodities.
97
An average of price relatives is calculated, most frequently using the arithmetic mean, as follows:
pn
ArithmeticMean of PriceRelatives=
∑p N
0
×100
Equation 9.2
Where N = the number of commodities included in the index.
Table 9.2 Price (£)
Price Relative
1992
1993
po
pn
Larger (per bottle)
1.00
1.15
115.0
Crisps (per packet)
0.20
0.27
135.0
Cake
2.00
2.20
110.0
Commodity
Total
pn × 100 p0
360.0
Substituting the above calculated value in (9.2), the arithmetic mean of price relatives of 1993 with 1992 as the base year is
pn
∑p N
0
× 100 =
360.0 = 120.0 3
Thus, in 1993, the average percentage increase in the prices for this group of commodities as compared with 1992 is 20%.
98
Moreover, this method does not recognise the relative importance of different commodities in the consumption pattern. Thus, according to this method the weight or importance given to Lager, Crisps and Cake is equal. Before we tackle this weakness, let us consider the second approach first.
Method II – Aggregative Method
The second approach is to relate the sum (aggregate) of the unit prices of a group of commodities in the current year to the sum (aggregate) of the unit prices of these commodities in the base year in some way. We are tempted to consider the following expression:
∑p ∑p
n
× 100
Equation 9.3
0
In the example given above, the overall percentage change in the prices of the three commodities is
∑p ∑p
n 0
× 100 =
3.62 × 100 = 113.1, 3.20
Substituting the calculated values from Table9.3 in (9.3)
99
Table 9.3 Price (£) Commodity
1992
1993
P0
Pn
Lager (per bottle)
1.00
1.15
Crisps (per packet)
0.20
0.27
Cake
2.00
2.20
Total
3.20
3.62
Thus, in 1993 Al Coholic’s total cost of one bottle of Lager, one packet of crisps and one cake was 113.1% of the total cost of these commodities in 1992. In terms of percentage change, it cost 13.1% more than in 1992 to purchase this ‘basket’ of goods.
Yet again, this method does not recognise the relative importance of different commodities in the consumption pattern.
WEIGHTS
As mentioned above, the various items in a group do not generally have equal relative importance, and both the methods considered above suffer from the drawback that they do not take into account the relative importance of the different items.
Thus, although we were tempted to use these simple methods to determine the average change in a variable over time, we should not use them; we should include, in each method, measures of the relative importance of the various items known as weights, denoted by w, as done below.
100
Method II (Aggregative Method)
Method I (Relative Method)
pn
∑p W × 100 ∑W
∑ P W × 100 ∑P W n
0
0
Which Weights to Use ?
In the consumer price example used here, the weights considered are usually
Method I (Relative Method)
Method II (Aggregative Method)
Expenditure, i.e. the product of
Quantities, i.e., q.
price and quantity, pq.
Thus w = q
Thus w = pq
Obviously, there exist two sets of values of prices, quantities and expenditures: those of base year (i.e. p 0 , q 0 , p 0 q 0 ) and those of current year (i.e. p n , q n , p n , q n ) . Effectively, each set of values of expenditures and quantities can be used as weights within the corresponding methods giving the following formulae for the index numbers:
Method II (Aggregative Method
Method I (Relative Method)
(a)
Base Weighting
(a)
101
Base Weighting
Pn
∑P P q ∑P q 0
0
(b)
∑P q ∑P q
0
× 100
0
0
Current Weighting
(b)
Pn
∑ P .P q ∑P q n
n
0
0
0
× 100
Current Weighting
∑P q ∑P q
n
0
n
× 100
n
n
n
0
0
× 100
We note that when weights of base year are used, we refer to base weighting. Laspeyre is the person who introduced this technique and hence the corresponding formulae are referred to Laspeyre’s.
Similarly, we have current weighting; Paasche is the person who introduced this technique and hence the corresponding formulae are referred to Paasche’s.
9.3.3
Worked Examples
The corresponding calculations for the different formulae are given below:
102
Method I (Relative Method)
Laspeyre’s Index Number (Using Base Weighting)
Pn
∑P P q As per page 83, the Price Index = ∑P q 0
0
0
0
× 100
0
The table below shows the calculations for using the above formula.
Table 9.4 Price (£)
Quantit y
Commodit y 1992
Pric e Rela tive
Base Expend iture
PR
(PR)
(weight)
Weigh t
Pn P0
1992
X
1992
1993
P0
Pn
Lager (per bottle)
1.00
1.15
40
1.15
40
46.0
Crisps (per packet)
0.20
0.27
100
1.35
20
27.0
Cake
2.00
2.20
1
1.10
2
2.20
62
75.20
q0
Total
Thus, Pn
∑P P q ∑P q 0
0
0
0
0
× 100 =
75.20 × 100 = 121.3 62
103
P0 q 0
Read relevant parts of p 161 of your textbook (OJ).
(b)
Paasche’s Index Number (Using Current Weighting)
We compute the corresponding current year expenditures Pn q n and then we obtain the Pn
∑P P q ∑P q n
Price Index from
n
0
n
× 100 (as per page 83).
n
Table 9.5 Commodity
Price Relative (PR)
Current
Pn P0
Expenditure (Weight)
PR × Weight
(Pn q n ) Lager (per bottle)
1.15
57.50
66.125
Crisps (per packet)
1.35
24.30
32.805
Cake
1.10
2.20
2.42
84.00
101.350
Total
The Price Index (current weighting) =
101.35 × 100 84
= 120.65
Method II (Aggregative Method)
104
(a)
Laspeyre’s Index Number (Base Weighting) The index number is given by (as per page 83),
∑Ρ q ∑P q n
0
0
0
× 100 , and is calculated
below:
Table 9.6 Price (£) Commodity
Quantity
Price × Base Quantity
1992
1993
1992
P0
Pn
q0
P0 q 0
Pn q 0
Lager (per bottle)
1.00
1.15
40
40.00
46.00
Crisps (per packet)
0.20
0.27
100
20.00
27.00
Cake
2.00
2.20
1
2.00
2.20
62.00
75.20
Total
Laspeyre’s Price Index =
75.20 × 100 = 121.3 62.00
i.e. prices in 1993 have increased by 21.3% as compared with those in 1992. Paasche’s Index Number (Current Weighting)
Paasche’s Index Number is here given (as per page 83) by
105
∑p
n
q
poq n
Table 9.7 Price (£)
Quantity
Price
1992 po
1993 pn
1993 qn
poqn
x Current Quantity pnqn
Crisps (per packet)
1.00 0.20
1.15 0.27
50 90
50 18
57.5 24.3
Cake
2.00
2.20
1
2
2.2
70
84.0
Commodity Large (per bottle)
Total
Paasche’s Price Index =
84 × 100 = 120.0 70
Comments: 1.
We note that the answers for Laspeyre’s Price indices, whether using expenditures or quantities as weights are exactly equal to 121.3. This is not a coincidence. In fact, if we study the corresponding formulae carefully, we find out that the formulae for base weighting in the case of the Method I (Relative Method) simplifies to the formula used for base weighting in the case of the Method II (Aggregative Method).
Thus pn
∑ P .p q ∑p q o
o
o
2.
o
o
× 100 =
∑p q ∑p q n
o
o
o
× 100
The choice of weights is of utmost importance in the construction of index numbers. Here when the variable study is price, the weights chosen are expenditures and quantities. For other variables, different weights would be used. They should be appropriate to the purpose for which they are meant: they should measure the relative importance of the items under consideration.
106
3.
There is a tendency to prefer base weighting to current weighting because it helps in the comparability of the indices over time and also because, with current weighting, there may be problems of interpretation of the index over time. Also, it is easier to use and understand base weighted indices. Moreover when prices are changing sharply over a short period of time, causing changes in the pattern of consumption, then current weighting will be obviously more appropriate.
4.
There is a tendency to use more often aggregative indices because they are rather easy to compute, use and understand. Bearing in mind the point made (3) above, aggregative indices with base weighting tend to be used quite often.
5.
Finally, it is to be noted that indices obtained by the relative method are independent of the units of measurement whilst those obtained by the aggregative method are not.
Interpretation/Discussion: 1.
Generally speaking, Laspeyre’s Index tends to overstate and Paasche’s Index tends to understate changes in prices or quantities. Read Parliament 161162 (OJ), Section on Paasche’s Index.
2.
If we were to consider the formulae for the aggregative prices indices, it would be of interest to give some thought to the possible interpretation of these indices. Thus consider the aggregative Laspeyre’s Price Index.
∑p q ∑p q n
o
o
o
The denominator,
∑p q o
o
, is in fact, the effective expenditure incurred in the base
period; whilst the numerator,
∑p
n
qo
, presents the expenditure that would have been
107
incurred in the current period if the pattern of consumption in year n is the same as that of year 0 (as measured by the quantity q o ).
We can therefore interpret the index number as the ratio of the “would have been expenditure in current year keeping pattern of consumption constant” to “effective expenditure in base year”. 3.
Consider now the aggregative Paasche’s Price Index
∑p q ∑p q n
n
o
n
Can you interpret this index number? 4.
One last word: Given the different approaches involved in the construction of index numbers and given the different prevailing systems of weights, it is inevitable that there are more than one possible index number to measure the change in a variable over time. The preceding discussion and your common sense should guide you in choosing and interpreting the appropriate index number in a given context.
Activity 1
1.
Attempt Question 8.3 and 8.10 from you textbook (OJ).
2.
Using data of Question 8.13 in OJ and, taking 194 = 100, calculate Laspeyre’s and Paasche’s index numbers for 199 prices, using both approaches.
3.
A basic Food Price Index (F.P.I) comprises the undermentioned items, weighted for the average family taking a normal diet as follows:
108
Weighting
Price Bread
Rs 1/loaf
7 loaves
Potatoes
Rs 4/lb
20 lbs
Milk
Rs 5/pint
15 pints
Eggs
Rs 18/dozen
2 dozen
Meat
Rs 40/lb
10 lbs
It is expected that during the next year, the cost of bread will rise by 10%, potatoes will rise by 25%, milk will fall by 10%, egg will fall by 5% and meat will increase by 30%. Calculate the F.P.I expected in one year’s time, if the present F.P.I is 112. Calculate the F.P.I expected in three years’ time if prices continue to change at the same average rate. Suppose that it is predicted that people will spend rather more on milk and eggs and somewhat less on meat during the coming year. In what way would you expect your answer to part (1) to be affected, if a current weighted index were used? Why could a current weighted F.P.I be unsatisfactory? 4.
A factory produces togs, clogs and pegs, each of these three products having a different work content. The proportions of these products vary from month to month and the factory requires an index for assessing productivity changes. Each tog, clog and peg produced is to be weighted according to its work content, these weights being 6,8 and 5 respectively. Also, because some months contain more working days than others, the index should offset the effect of this.
109
Data for the months of May, June and July are as follows: May
June
July
23
22
16 (due to factory closure for 2 weeks)
Output (thousands) togs
19
16
10
clogs
12
20
15
pegs
22
15
10
It is intended that May should be the base month for comparison, with a productivity index of 100. Design a simple productivity index, calculate its value for June and July, and comment briefly on the results. Now, due to a change in the type of peg produced, a new weight is required. Production data are shown below for two days when productivity was judged to be about equal. Day 1
Day 2
Togs
921
811
Clogs
800
747
Pegs
1042
1206
Output
Use these data to estimate a suitable weight for the new pegs, to 1 decimal place, assuming that the weightings of 6 for togs and 8 for clogs are as before.
110
9.3.4
Fisher’s Index of Prices
Read pp 165166 of your textbook (OJ) 9.3.5
General Formulae for Index Numbers
The examples considered above in measuring changes in the prices of a group of items can now be generalized to the measuring of changes in any other variable (quantities, productivity, efficiency, examination marks etc.) for a group of items. The two methods used are always applicable, together with the possibility of base and current weighting. Thus suppose we need an index number to measure the changes in a variable X for a group of items over time. Then using subscripts n and o for current and base periods respectively, and denoting weights by w, in a manner similar to that used on page 83, we have Method I (Relative Method)
Method II (Aggregative Method)
xn
∑ x .w × 100 ∑w
∑x ∑x
o
(a)
w
o
w
n
w
o
w
n
wn
o
wn
× 100
Base Weighting xn
∑ x .w ∑w
∑x ∑x
o
o
× 100
o
(b)
n
× 100
Current Weighting xn
∑x w ∑w o
∑x ∑x
n
× 100
n
111
× 100
The effective choice of weights will depend on the variable under consideration.
Activity 2
Attempt Question 8.14 from your textbook (OJ). 9.3.6
Index Numbers of Quantities (or Volume)
Index numbers of quantities (or volume) measure changes in physical quantities such as the quantities of goods and services consumed, volume of industrial production, volume of imports and exports, etc. As per the discussion of Section 9.3.5, these indices are calculated using similar methods that you have learnt so far in the case of price indices. You will note that now quantity is the variable for which the magnitude of changes over time is to be measured. Consequently, weights are values of commodities (or expenditures), or prices, depending upon the index
used. Thus the quantity relative for a commodity =
qn × 100 qo
The formulae for the various quantity indices are as follows:Method I (Relative Method)
Method II (Aggregative Method)
qn
∑ q .w × 100 ∑w
∑q ∑q
o
112
n
w
o
w
× 100
(a)
Base Weighting qn
∑q p q ∑p q o
o
(b)
∑q ∑q
o
× 100
o
o
n
po
o
po
n
pn
o
pn
× 100
Current Weighting qn
∑ q .p q ∑p q n
n
∑q ∑q
n
o
× 100
n
× 100
Activity 3
Attempt Question 8.20 from your textbook (OJ).
9.4
FURTHER CONCEPTS
9.4.1
Splicing Two Series of Index Numbers
It is general practice to change the base year after a certain time period to take into account any changes in the consumption pattern, or, in the weights, i.e. the relative importance of different items included, or both. Thus, in Mauritius, for the Consumer Price Index (CPI) calculated over the last 20 years, the base period has been regularly changed at an interval of five years. For the purpose of historical comparison, however, it may be desirable to have a single series of index numbers with either the ‘old’ base period or the ‘new’ base period. The process by which the two series of index numbers with different base years/periods are combined is called splicing. For splicing two such time series of index numbers to form one continuous series, it is necessary that the two series have one common year so that both types of index numbers have been calculated for that year (or period). The index numbers revised in the process of splicing 113
are generally the index numbers of the old series, with the overlapping year being used as the base for the combined series. The first step in the process of splicing is to determine the quotient obtained by dividing the new index number for the overlap year by the old index for this year. The overlap year is generally the new base year, so that the new index = 100, and the quotient is determined by
Q=
1(new )overlap 100 = 1(old)overlap 1(old)overlap
The next step in the process of splicing is to multiply each of the index numbers in the old time series by the quotient given by the formula above to calculate the new index numbers as follows:
1(new ) = 1(old) × Q
Read pp 166167 of your textbook (OJ).
Activity 4
Attempt Question 8.19 from your textbook (OJ). 9.4.2
Using an Index to Deflate a Time Series
Read pp 168170 of your textbook (OJ).
Activity 5
Attempt Question 8.22 from your textbook (OJ). 9.4.3
ChainBased Index Numbers
114
The index numbers calculated so far are called fixedbase indices. Read the relevant parts of pp 163, 164 and 165 of your textbook (OJ) to know about the chainbased index numbers. This method is suitable when the relative importance of items and the consumption pattern are changing rapidly. Because, with this method, new items can be introduced and old ones removed with ease. Its disadvantage is that comparisons over long periods of time are not possible. Activity 6
Convert the fixedbased system of the Index Number of Prices given in Question 8.19 of your textbook (OJ) to a chain based system for the years 1985 to 1991.
9.5
GENERAL PROBLEMS OF INDEX NUMBER CONSTRUCTION
The first essential point to be considered is the purpose for which the index number is to be constructed. What is the index number intended to measure? For example, the Consumer Price Index (CPI) attempts to answer a question concerning the average movement of certain prices over time. The index of industrial production among other things, is constructed to show the trend of economic activity. In general, there are four main problems that can arise in the construction of a new index number. They are: 1.
Selection of items to be included.
2.
Selection of a suitable base period.
3.
Choice of appropriate weights measuring the relative importance of various items included in the index.
4.
Choice of a suitable average or index number formula.
115
These major problems need to be tacked very cautiously in the light of the object of the index number. Here, we shall consider the construction of a price index and mostly refer to the CPI as an illustration.
However, you should know that similar problems can occur in the
construction of quantity or volume indices. •
Selection of Items to be Included
Since it is not practicable, if not impossible, either from the consideration of cost or time, to measure changes in the prices of all the relevant commodities, a selection must be made of items to be included in the index. The movements in the prices of the selected items should be representative of the movements of prices of all the relevant commodities. The items selected for a CPI should also be representative of tastes, habits, customs and necessities of the people to whom the index relates. The number of items should be fairly large, consistent with the ease of handling item. The CPI of Mauritius includes 230 item classes after ascertaining that these items accurately reflect the average change in the cost of the entire market basket. By means of the periodic Household Expenditure Surveys, the Central Statistical Office (C.S.O) determines the representative basket of goods and services purchased by households on average and also how the total expenditure is spread over these items. The items selected are classified in nine major commodity groups (for example, Food, Fuel and Lighting, Housing, Medical care, etc.) so that separate indices can be calculated for these major groups, in addition to an overall CPI. •
Selection of a Suitable Base Period
A second problem in the construction of a price index is the choice of a base or reference period, that is, a period relative to which the prices in the current year are compared. For a general purpose, for price index such as wholesale price index or consumer price index, it is desirable to have a base period of relative economic stability that is not too distant in the past. Thus the time period selected as base should be one with ‘normal’
116
price levels, since the use of a base with unusually high or low price levels could distort comparisons of price changes for subsequent years. On the other hand, the problem with a distant base is that the economic conditions prevailing at that time could be quite different and the comparisons with such remote periods are not of any interest. Further, the base period once selected should be regularly shifted to more recent period. Thus the base year for the CPI of Mauritius is regularly shifted after every five years, since the year 1976. Finally, a relatively recent base facilitates the inclusion of new commodities, as well as dropping of obsolete ones. You will note that the above discussion relates to the fixedbase method. As seen earlier, in specific cases index numbers are constructed using the chainbase method. If this method is used, then the year preceding the current year is automatically taken as the base year. Thus in the chainbase method, the problem of selection of base does not arise. •
The Choice of Appropriate Weights
As mentioned earlier, it is necessary to give weights to individual items included in an index number so as to show their relative importance in the comparison of price changes. Surely, in the construction of the CPI a 25% increase in the price of bread will have more significance than a 25% increase in the price of jam. The weights to be used depend on the purpose of the index to be constructed. It is necessary to adopt some system of ‘rational’ weighting, that is, according to some logical basis. Thus, for the CPI, proportionate expenditure upon different items found from a Household Expenditure Survey would constitute appropriate weights if price relatives of different commodities are to be averaged; the weights in this case are known as value weights. If, on the other hand, prices rather than price relatives are used, reasonable weights would be given by the quantities of individual items purchased, and are known as quantity weights.
117
The types of quantities to be used in a price index would depend on the nature of the index computed. Thus, an index of export prices would use quantities of commodities and services exported, whereas an index of import prices would use quantities imported. To conclude, weighting of an Index Number is essential, weights should be rational and should be renewed after a few years. In the case of the CPI, the Household Expenditure Surveys carried out regularly at intervals of about five years would provide changes in weights of the commodities, if any. •
The Choice of a Suitable Average
As seen earlier in the index number calculations, there are different types of averages from which a selection of suitable average could be done. The form of the average selected generally depends more on practical considerations than on their mathematical properties. Thus the Weighted Price Index with base period quantity weights (qo), ie. Laspeyre’s Price Index, and the Weighted Arithmetic Mean of Price Relatives with base period value weights (poqo) are the averages which are mostly used.
You will recall that both these methods are actually equivalent. The CPI is constructed by using the Weighted Arithmetic Mean of Price Relatives with base period value weights, since such value weights are easily available from the Household Expenditure Survey and can be used for a few years till the weights are changed following the next Household Expenditure Survey. Current year weights, although theoretically better in rapidly changing economic situations, pose the problem of data collection every time the index is to be calculated. Thus the basedweighted index is preferred to the currentweighted index. Similarly, the Arithmetic Mean is preferred to the Geometric Mean due to its simplicity in calculations.
9.6
THE USES AND LIMITATIONS OF INDEX NUMBERS
•
Uses of the Index Numbers
118
Index number series are very useful for analysis of economic activity and for decision making. Thus the CPI is used to calculate the rate of inflation in a country, and as a basis for wages negociations in the collective bargaining processes. Again, it can also be used to deflate current incomes with a view to ascertaining the real incomes and for adjusting National Income Accounts. Index numbers are useful for showing trends in economic activity of a country. Thus comparisons can be made between movements in the levels of prices of different groups of commodities, or between the price levels and wages, or between the levels of production and wages, between the import prices and the consumer prices, etc. Index numbers are also used for international comparisons of socioeconomic development.
Thus index numbers are extremely useful tools for governments,
businessmen, economists, as well as in other fields of human endeavour. •
Limitations of Index Number
Index numbers, however, have their own limitations. An index is only an approximate indicator of the change it is attempting to measure. Errors can be committed at the various selection processes mentioned earlier concerning problems of construction of an index number. The index number is usually based on a sample, so that sampling errors are bound to occur. It is not possible to include all changes in quality or product. Unless the base period is a fairly recent one, comparisons are not reliable. Different methods of computation give different results, some overstating the upward movement in prices while others understating them. However, unless an index is deliberately distorted, it will show correctly at least the trend of the phenomenon which it is measuring, except when there are rapid changes in conditions.
9.7
SUMMARY
In this unit, you have learnt about statistical tools called index numbers. You should now have a clear understanding of the different types of price and quantity index numbers, of their
119
calculation and interpretation, and of the main practical issues involved in the construction of an index number. You have also learnt how to change from fixed base to chain base and vice versa, splice, and deflate an index number series. Lastly, you should know that any index number has its merits and limitations. Recommended Readings 1.
Household Budget Survey, July 1991June 1992. Vol. I, Methodological Report July 1993
2.
Household Budget Survey, July 1991June 1992. Vol. II, Analytical Report July 1994.
120
UNIT 10
PROBABILITY
Unit Structure 10.0
Overview
10.1
Learning Objectives
10.2
Introduction
10.3
Mathematical Preliminary: Elementary Theory of Sets and Venn Diagrams
10.4
The Sample Space and Events
10.5
The Probability of an Event
10.6
General Law of Addition
10.7
Conditional Probability
10.8
Law of Multiplication of Probability
10.9
Independent Events
10.10 Tree Diagrams 10.11 Joint Probability Tables 10.12 Summary
10.0
OVERVIEW
This unit introduces you to the concept of uncertainty involved in almost every realworld problem. Probability is a measure of uncertainty associated with the occurrence/non occurrence of an event. In everyday language, it is synonymous with chance. This unit covers the mathematical concepts of Sets, Venn Diagrams, basic concepts in probability, different rules of probability and methods of analysis useful in statistical applications. Chapter 11 of your textbook (OJ) covers some of these topics. However, the presentation of different topics and the explanation provided are not found to be appropriate for the present course. This unit, therefore, consists of a complete writeup of the topics covered in Chapter 11(except the Bayes’ Theorem, which is not included in the present syllabus) as well as of those topics which are omitted in the chapter. Reference will therefore be made only to the solved examples and to the exercises in the textbook.
121
10.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following: 1.
Compute and interpret probabilities of different types of events
2.
Use Venn Diagrams for illustrations, where appropriate
3.
Compute and interpret conditional probability
4.
Draw Tree Diagrams.
5.
Use Joint Probability Tables.
10.2
INTRODUCTION
Probability is a measure of uncertainty associated with the occurrence or nonoccurrence of an event. It is a concept used in our everyday life, for example, the chance of obtaining a head when a coin is tossed, or, the chance that it rains today. Probability plays an important role in all advanced statistical methods which deal with decisionmaking in situations involving an element of risk and uncertainty.
10.3
MATHEMATICAL PRELIMINARY: AND VENN DIAGRAMS
ELEMENTARY THEORY OF SETS
Set Theory and Venn Diagrams are very useful mathematical tools, both in describing the basic concepts in probability as well as in understanding the different rules of probability. This section therefore, provides a brief introduction to the theory of sets and Venn Diagrams. 10.3.1
•
Elementary Theory of Sets Sets
A set is a collection of distinct objects, normally referred to as elements or
members. A set is usually denoted by a capital letter, and the elements by small letters. Example 1
A = {x : x is an even number}, i.e., Set A consists of elements x where x is an even number.
122
B = {2, 4, 6, 8}, i.e., Set B consists of the four elements 2, 4, 6, 8.
•
A subset of a set A is a set which consists of some or all of the
Subsets
elements of A. If B is a subset of A, then B ⊆ A Example 2
In Ex 1,
B is a subset of A
This is denoted by B ⊂ A. •
The Number of a Set The number of a set A, written as n(A), is defined as the
number of elements that A contains. Example 3
In Ex 1, n (B) = 4
•
The Universal Set
The set of all objects relevant to a particular application is
called the Universal Set. A Universal Set is usually denoted by the capital letter U. Any other set defined with respect to the particular application will necessarily be a subset of the Universal Set. Example 4
If
U = {a, b, c, d, e, f}
then A = {a, d, e} is a subset of U. •
Disjoint Sets
123
Two sets are said to be disjoint if they have no elements in common. Example 5
Sets {2, 3, 4, 5, 6} and {1, 7, 8, 9} are disjoint, as they have no elements in common. The Complement of a Set
If A is any subset of the Universal Set U, then all those
elements that belong to U, but are not contained in A, form the complement of A, denoted by
A ′ or by Ac, Example 6
In Example 4, A ′ = {b, c, f} •
The Empty Set or Null Set A set which has no elements is called an empty set or
null set and is denoted by Φ (phi). •
Set Operations
(i)
Set Union
The union of two sets A and B is written as A ∪ B and defined as that set which contains all the elements of A or B or both. Thus A ∪ B = {x : x ε A or x ε B or x ε (both A and B)} Where ε means ‘belongs to’.
124
Example 7
If A = {1, 2, 3, 4}, and B = {4, 5, 6, 7), then A ∪ B = {1, 2, 3, 4, 5, 6, 7} (ii) Set Intersection The intersection of two sets A and B is written as A ∩ B and is the set of elements belonging to both A and B. Thus
A ∩ B = (x : x ε A and x ε B).
Example 8
In Example 7, A ∩ B = {4}. 10.3.2
Venn Diagrams
A Venn Diagram is a diagram associated with set theory. Venn Diagrams provide pictorial descriptions of sets, subsets, intersections and unions. In a Venn diagram, the universal set U is represented by a rectangle and its subsets are usually represented by circles or ovals. Figure 10.1 shows a universal set containing the union of A and B, and the intersection of A and B, respectively. The union and intersection are shown by the shaded areas in the figure that follows:
A
B
A
(a)
B
(b)
Figure 10.1. Venn diagrams showing
125
(a) the union of A and B, and (b) the intersection of A and B Note: (i)
n(A) + n (A ′) = n (U) Where A ′ is the complement of the set A in U.
10.4
(ii)
n (A) + n (B) = n (A ∪ B), when A and B are disjoint, and,
(iii)
n (A) + n (B)  n (A ∩ B) = n (A ∪ B) whether A and B are disjoint or not.
THE SAMPLE SPACE AND EVENTS
Before defining probability, it is necessary to define several terms that are used in the process of determining the probability. •
An experiment is a process or phenomenon which is being studied or observed. Here we are concerned with an experiment whose outcome depends on chance, i.e. the outcome is not predictable at the outset as there are many possible outcomes. For example, tossing a coin is an experiment whose outcome depends on chance as there are two possible outcomes, namely, a head or a tail.
•
The set of all possible distinct outcomes of an experiment is called the sample space or possibility space of the experiment, which is usually denoted by S. For example, the number of defectives in a batch of 5 components is given by the set S = {0, 1, 2, 3, 4, 5}
•
An event is a subset of the sample space S. In the preceding example, if A is the event of getting at most one defect, then A = {0, 1}, which is a subset of S.
126
•
When each outcome is as likely to occur as any other, the outcomes are called equally likely. For example, when tossing a fair dice, S = {1,2,3,4,5, 6}, where all outcomes are equally likely. When the occurrence of one event say, A, precludes the occurrence of another event B, events A and B are said to be mutually exclusive events. Note: The subsets representing two mutually exclusive events A and B, are disjoint.
For example, if A is the event of getting an even number, and B is the event of getting an odd number, when a fair dice is tossed, then
S = {1,2,3,4,5,6} A = {2,4,6} and
B = {1,3,5}
Since getting an even number precludes the occurrence of an odd number on the same trial, or observation, events A and B are mutually exclusive, and are represented by the two disjoint sets A and B given above. That is, they have no common points.
10.5 10.5.1
THE PROBABILITY OF AN EVENT Definition of Probability
Historically, different approaches have been developed for defining probability. These approaches determine how probability is defined and computed. One of the definitions of probability is given below. If all n(S) outcomes of a sample space S are equally likely and mutually exclusive and n(A) of these are favourable to the occurrence of an event A, i.e., an event A consists of a subset of
127
n(A) of these n(S) outcomes, then the probability of occurrence of A denoted by P(A), is given by
P (A) =
n( A ) n(S)
Thus, the probability that event A will occur is the ratio of the number of outcomes in the subset A to the number of outcomes in the sample space S. It is to be noted that with this definition, the probability can be determined without actually carrying out the experiment, and observing the sample events. For example, the probability of drawing a king from a wellshuffled pack of cards is given by P (K) =
4 1 = 52 13
where K represents the event that the card drawn is a king, n(K) = 4 and n(S) = 52.
Example 9 The data collected by a supermarket showed that 161 of the 253 women who entered the supermarket on a Saturday morning made at least one purchase. Estimate the probability that a woman entering this supermarket on a Saturday morning will make at least one purchase. Let A be the event that a woman entering the supermarket makes at least one purchase. The estimate of P(A) is then given by
P(A) =
Since n = 253 and A occurred 161 times.
128
161 , 253
Activity 1 1.
A fair dice is tossed once. What is the probability of obtaining an even number?
2.
A retailer has 12 TV sets out of which 4 sets are known to be defective. If one set is selected at random, what is the probability that it turns out to be defective?
3.
An inspector randomly samples 50 components manufactured during one day and finds that 2 components are defective. What is the probability that an electronic device containing one component will be inoperative because the component is defective?
4.
10.5.2
Attempt Question 11.2 in textbook (OJ).
Axioms of Probability
From the definition of probability given above, the probability of an event is a proportion. Probability should thus possess the essential properties of a proportion.
Therefore,
probability should be a number between 0 and 1. Furthermore, the probability of the event S, where S is the sample space, should be 1 because one of the possible outcomes is certain to occur when the experiment is carried out. These ideas are contained in the axioms or postulates, of probability which follow:
AXIOMS OF PROBABILITY: Probability is a function, defined on a sample space S that satisfies 1.
P(A) ≥ 0 , for any event A, i.e., probability is nonnegative.
2.
P(S) = 1, i.e., the probability of a certain event is equal to 1.
129
3.
If A, A2 …… is a sequence of mutually exclusive events, the probability that A1 or A2 or … occurs equals the sum of their separate probabilities. Symbolically,
P (A1 ∪ A 2 ∪…..) = P (A1) + P (A2) + ….. Note: If events A1, A2, …are mutually exclusive, then n (A1) + n (A2) + …. = n (A1 ∪ A2 ∪ ….), since n (Ai ∩ Aj) = 0 for i ≠ j and i, j = 1, 2, … This proves the above Axiom 3.
10.5.3
Further Properties of Probability derived from Axioms
By using the three axioms of probability, we can derive more rules which the probability measure must satisfy. 1.
If A1, A2, …. An are n mutually exclusive events, then the probability that A1 or A2 or …or An occurs equals the sum of their separate probabilities. Symbolically,
P (A1 ∪ A 2 ∪… ∪ A n) = P (A1) + P (A2) + …..P( A n) This follows from the third Axiom of Probability. Applying this result to individual outcomes of an experiment, the probability of any event A is given by the sum of the probabilities of the individual outcomes Ai’s comprising A. Symbolically,
P ( A) = ∑ P ( Ai ) i
130
Further, if n = 2, i.e., if two events A1and A2 are mutually exclusive, then the probability that A1or A2 occurs equals the sum of their separate probabilities. Symbolically,
P (A1 ∪ A2 ) = P (A1) + P (A2) This is known as the Law of Addition of two mutually exclusive events. Thus an event which cannot occur or is impossible has the probability zero, and that the respective probabilities that an event will occur and that it will not occur add up to 1. Symbolically, 2.
As
φ∪S=S
P(φ ∪ S) = P(S) Also, P(φ ∪ S) = P(φ) + P(S), from Axiom 3, since φ and S are mutually
exclusive
events. Thus P(φ) + P(S) = P(S) Since P(S) =1 from Axiom 2, it follows that
P(φ) = 0 3.
As
A ∪ A′ = S
P(A ∪ A ′ ) = P (S) = 1 Also, because A and A ′ are mutually exclusive, from the third Axiom of Probability we have, P(A ∪ A ′ )= P(A) + P( A ′ )
131
Thus, P(A) + P( A ′ ) = 1 or
P( A ′ ) = 1  P(A)
for any event A.
Example 10 A card is drawn from a pack of wellshuffled cards. Find the probability of drawing either an ace (A) or a king (B). The events A and B are mutually exclusive. Therefore the probability of drawing either an ace or a king in a single draw is P(A ∪ B) = P(A) + P(B)
=
4 4 + 52 52
=
8 2 = 52 13
Example 11 A pair of dice is rolled once. Determine the probability of obtaining a total of 7. The total number of possible outcomes are 36, as any one of the 6 outcomes on the first dice can be combined with any of 6 outcomes on the second dice. Assuming each one of these 36 possible outcomes have equal probabilities, the probability of any individual outcome is The probability of any event is therefore given by outcomes comprising the event.
132
1 . 36
1 times the number of individual 36
The sum of 7 points is obtained for the 6 individual outcomes: (1,6), (2,5), (3,4), (4,3), (5,2) and (6,1)
Thus if A is the event of obtaining a sum of 7 points then, P(A) =
1 36
=
1 6
(6)
This can be clearly seen from Figure 10.2 as the sum of probabilities of points inside the dotted line. The following figure can also be used to determine the probabilities of any of the possible totals when two dice are rolled.
S e c o n d D i c e
6
.
.
.
.
.
.
5
.
.
.
.
.
.
4
.
.
.
.
.
.
3
.
.
.
.
.
.
2
..
.
.
.
.
.
1
..
..
.
.
.
.
2
3 4 5 First Dice
6
1
Figure 10.2
Thus if B is the event of obtaining a total of 2 or 3 then the probability of B is sum of the three circled points and is equal to
1 . 12
133
P(B) =
1 1 (3) = 36 12
Activity 2 1.
Using your answer to Question 1 of Activity 1, find the probability of obtaining an odd number.
2.
Using your answer to Question 2 of Activity 1, find the probability of obtaining a good TV set.
3.
In a given week the probability that the price of a product will increase (A) in price, remain unchanged (B), or decline (C) in price is estimated to be 0.30, 0.20, and 0.50, respectively. What is the probability that in a given week the price of a product will
4.
(a)
increase or remain unchanged?
(b)
change during the week?
The delivery of an item of raw material from a supplier may take up to five weeks from the time the order is placed. The probabilities of various delivery times are as follows:
Delivery Time
Probability
< 1 week > 1, < 2 weeks > 2, < 3 weeks > 3, < 4 weeks > 4, < 5 weeks
0.12 0.27 0.22 0.22 0.17 1.00 =====
What is the probability that a delivery will take the following times?
134
5.
10.6
(a)
Two weeks or less
(b)
Three or four weeks
(c)
More than four weeks
(d)
More than two weeks
(e)
More than three weeks
A pair of dice is rolled once. Determine the probability of obtaining: (a)
a total of 8
(b)
a total of 9
GENERAL LAW OF ADDITION
General Law of Addition for any two events For any two events A and B P (A ∪ B) = P (A) + P (B)  P (A ∩ B) = probability that at least one of A and B occurs. Recall from Section 10.3.2 (iii) that n(A ∪ B) = n(A) + n(B)  n(A ∩ B) for any two sets A and B. Thus, for any two events A and B,
P(A ∪ B)
=
n(A ∪ B) n(S)
=
n(A ) + n( B) − n(A ∩ B) n(S)
=
n( A ) n( B) n(A ∩ B) + n(S) n(S) n(S)
=
P(A) + P(B)  P(A ∩ B)
135
Note: This law is even applicable to mutually exclusive events because then A
∩ B = φ so that P(A ∩ B) = 0. Example 12 From Example 10 above, what is the probability of drawing an ace (A) or a spade (C)? The events “ace” and “spade” are not mutually exclusive. Therefore, the probability of drawing an ace (A) or a spade (C), or both, in a single draw is P(A ∪ C) = P (A) + P (C)  P (A∩C)
=
n( A ) n ( C) n ( A ∩ C) + n(S) n(S) n(S)
=
4 + 13  1 52
=
52 52
16 52
=
4 13
Note: The Venn diagram can be used to show the union of two events, A and B, denoted by A ∪ B, the intersection of two events A and B, denoted by A ∩ B, the complement of event A, denoted by A ′ , etc., as shown earlier.
136
Activity 3 1.
Out of 300 business students, 100 are enrolled in Marketing and 50 are enrolled in Finance. In fact, 30 of these students are enrolled in both Marketing and Finance.
(a)
Draw a Venn diagram of the data.
(b)
What is the probability that a randomly chosen student will be enrolled in either Marketing (M) or Finance (F) or both?
(c)
What is the probability that a randomly chosen student will be enrolled in either Marketing (M) or Finance (F), but not both?
2.
A hotel has a total of 75 rooms, of which 65 contain a radio. Of the 65 rooms with a radio, 10 have a refrigerator and 43 have a bath. In the entire hotel, 12 rooms have a refrigerator, but only one of these has neither a radio nor a bath. Of all the rooms with baths, 8 have both a radio and a refrigerator, 2 have neither a radio nor a refrigerator, and one has a refrigerator only. Represent the above information by means of a Venn diagram. Calculate the probability that (a)
a room contains a bath,
(b)
a room does not contain either a refrigerator or a bath
(c)
a room contains a bath and a radio but no refrigerator.
137
10.7
CONDITIONAL PROBABILITY
Suppose A and B are two nonmutually exclusive events in a sample space S.
S A
B
Figure 10.3 As shown earlier,
and
P(A) =
n( A ) n(S)
P(B) =
n( B) n(S)
If, however we know that B has occurred, then the sample space is reduced to B, instead of the original S. Thus, if we are interested in knowing whether A will occur, given that B has occurred, then the sample space to be considered, is the reduced sample space B. The probability that A will occur, given that B has occurred, denoted by P(AB) is then given by P(AB) =
n(A ∩ B) n( B)
since n(A ∩ B), i.e. the number of outcomes common to both A and B, gives the number of outcomes in the sample space B that are favourable to the event A.
138
Consider the following example to illustrate the probability P(AB).
Example 13 There are 200 applicants for a secretarial position in a large company. It is known that among the 200 applicants, some have had previous experience in secretarial work and some have had formal training in such work as shown in Table 10.1:
Table 10.1 Training
Formal
No Formal
Total
Training
Training
34
48
82
41
77
118
75
125
200
Experience Previous Experience No Previous Experience Total
Let E denote the event of selection of an applicant with previous experience, and T denote the event of selection of an applicant with formal training. As can be seen from the table, E and T are nonmutually exclusive events since there are some applicants with both previous experience and formal training. If an applicant is randomly selected from these 200 applicants, the probability that the selected applicant has some previous experience is given by P(E)
=
n( E ) n(S)
=
82 200
139
= 0.41 and
P(T)
=
n(T) n(S)
= 75
200
= 0.37 It is assumed that each of the 200 applicants has the same chance of being selected. Suppose now that the management decides to limit the selection to only those applicants who have had some formal training. As a result of this decision, the number of applicants to be considered, i.e. the new sample space, is now reduced to 75 i.e., T. Assuming that each of these 75 has an equal chance of being selected,
P(ET) =
34 75
= 0.45
i.e.
P(ET) =
n( E ∩ T) n(T)
This is called the Conditional Probability of selecting an applicant with previous experience given that the applicant has had some formal training. Note that this conditional probability can also be written as
P(ET) =
=
34 / 200 75 / 200 P( E ∩ T) P(T)
140
Thus, P(E T) is the ratio of the probability of selecting an applicant with previous experience and formal training to the probability of selecting an applicant with formal training. Generalising from the above example, it can be seen that for any two events A and B belonging to a given sample space S, the revised probability of A when it is known that B has occurred, called the conditional probability of A given B and denoted by P (AB) is defined by the formula
P (AB) =
P(A ∩ B) P( B)
provided
P (B) > 0
A and B are said to be dependent events if the probability of occurrence of one must be modified in the light of information as to whether or not the other event has taken place. Note in Example 12 , P(E) ≠ P(ET) Since P(E) = 0.41 and P (ET) = 0.45
Activity 4 1.
For Example 13, find P(TE), using both the table and the formula for the conditional probability. Explain what it means in terms of the data. What do you observe? Is P(TE) same as P(T)?
2.
An electrical component consisting of two elements that operate in sequence will work only if both elements are good elements. From previous records it is found that 80 percent of the components produced work properly. Occasional tests on the first element indicate that 10 percent of these elements are likely to be defective. The
141
second element cannot be tested separately. What is the probability that the second element of a component will be a good one if the first element is a good one? 3.
Attempt Question 11.23 (b) in textbook (OJ).
10.8
LAW OF MULTIPLICATION OF PROBABILITY
If we multiply the formula for conditional probability by P(B) on both sides, we have,
P(A ∩ B) = P(B) . P(A B) This is called the Law of Multiplication of probability.
It enables us to calculate the
probability that two events, A and B, will both occur. The formula thus states that the probability that two events will both occur is the product of the probability that one of the events will occur and the conditional probability that the other event will occur given that the first event has occurred (occurs, or will occur).
Note: This formula can also be written as P(A ∩ B) = P(A) . P(B A) The law of multiplication can be generalised as follows: P (A∩B∩C) = P(A∩B) . P(C A∩B) = P(A) . P(BA) . P(CA∩B)
Example 14 A set of 10 spare parts is known to contain seven good parts (G) and three defective parts (D). Two parts are selected at random without replacement.
142
Find the probability that both the parts drawn are good. P (G1 ∩ G2)
= P (G1) . P (G2 G1) = 7 . 6 10
9
= 42 90 = 7 15
Activity 5 1.
For the problem of Activity 3, determine the conditional probability that a randomly chosen business student is enrolled for finance given that he has enrolled for Marketing.
2.
Of 12 letters kept in a file, 4 contain typing errors. (a)
If a clerk randomly selects two of these letters (without replacement), what is the probability that neither letter will contain typing errors?
(b)
If the clerk samples three letters, what is the probability that none of the letters have typing errors?
10.9
INDEPENDENT EVENTS
When information about the occurrence of B has no effect on the probability of occurrence or nonoccurrence of A ( or, vice versa), then A and B are said to be independent events. In this case, the conditional probability P (AB) is the same as the unconditional probability, P (A).
143
Thus, when P (AB) = P(A) or P(BA) = P(B) or P(A∩B) = P(A) . P(B) A and B are independent events. The probability of the occurrence of both A and B, when they are independent, is therefore, the product of their separate probabilities.
Note: When two events are mutually exclusive they cannot be independent, because when A and B are mutually exclusive, P(A ∩B) = 0
Example 15 A card is drawn at random from a pack of wellshuffled cards. Let A denote an ace and B denote a spade. Show that events A and B are independent. P (A) = 4 52 P (B) = 13 52 so that P(A) . P (B) = 4
. 13 = 1
52
52
52
Also, p (A ∩ B) = 1 , since 52 there is only one card which is both an ace (A) and a spade (B). Thus P (A ∩ B) = P (A) . P (B)
144
Activity 6 1.
For the problem of Activity 3, apply an appropriate test to determine if Marketing and Finance are independent events.
2.
Ex. 11.3 (b) in your textbook (O.J.)
3.
In general, the probability that a client will make a purchase when he is contacted by a salesman is P = 0.40. If a salesman selects two clients randomly from a file and contacts them, what is the probability that both the clients will make a purchase?
4.
Attempt Questions 11.9 and 11.26 in your textbook.
5.
In a certain hospital, 43% of the patients examined are found to suffer from a heart problem and 17% from a respiratory problem. Furthermore, it is observed that 52% of patients examined suffer from either one or both of these problems. Find: (a)
The probability that a patient has neither a heart problem nor a respiratory problem.
(b)
the probability that a patient has a respiratory problem but not a heart problem.
(c)
the probability that a patient has a respiratory problem given that he/she has no heart problem.
(d)
Determine whether there is any association between the event of suffering from a heart problem and that of suffering from a respiratory problem. Explain briefly the implication of your answer.
10.10 TREE DIAGRAMS A tree diagram is very useful as a method of displaying the possible events associated with sequential observations, or sequential trials. E.g. A tree diagram for the events associated with tossing a coin twice.
145
A tree diagram shows the outcomes of successive trials, joint events and the probabilities of the joint events. Since the joint events thus obtained are exhaustive and mutually exclusive, the sum of the probabilities of all joint events is 1.
Example 16 Construct a tree diagram to represent the sequential sampling process in the problem of Example 14
7 10
G1
6 9
G2 3 9
D1
Probability 42 90
G1 I D2
21 90
D1 I G2
21 90
D2
7 9
3 10
Joint Event G1 I G2
G2 D2
2 9
D1 I D2
6 90 90 = 1 90
Figure 10.4 Tree Diagram
Note:
The Tree Diagram of Example 16 can be used to calculate conditional probabilities. Thus the probability that the second spare part is good given that the first spare part is good, is given by
P(G2 G1) =
Similarly
6 9
P(G2 D1) =
7 9
146
Activity 7 1.
Construct a tree diagram to represent the possible events associated with three tosses of a fair coin.
2.
For Example 16, find the conditional probabilities
P(D2 G1) and
P(D2 D1).
10.11 JOINT PROBABILITY TABLES A joint probability table is a table in which all possible events for one variable are listed as column headings, all possible events for a second variable are listed as row headings, and the value entered in each resulting cell is the probability of each joint occurrence. A table of jointoccurrence frequencies which can be used as the basis for constructing a joint probability table is called a contingency table.
Example 17 The Table below is a contingency table.
Table 10.2 Number of Boxes of 100 units containing Defective Electron Tubes No. of Defective Tubes
Marginal Total
Firm 0
1
2
Supplier A
500
200
200
100
1000
Supplier B
320
160
80
40
600
Supplier C
600
100
50
50
800
1420
460
330
190
2400
Marginal Total
147
3 or more
The joint probability table is as shown below.
Table 10.3 Joint Probability Table for Boxes of 100 units containing Defective Electron Tubes No. of Defective Tubes
Marginal
0
1
2
500
200
200
100
1000
2400
2400
2400
2400
2400
320
160
80
40
600
2400
2400
2400
2400
2400
600
100
50
50
800
2400
2400
2400
2400
2400
Marginal
1420
460
330
190
Probability
2400
2400
2400
2400
2400 =1 2400
Firm Supplier A Supplier B Supplier C
Note:
3 or more
Probability
A marginal probability is so named because it is a marginal total of a column or a row. The marginal probabilities are unconditional probabilities of particular events. For example,
P (A) =
P (1) =
1000 2400 460 , etc., 2400
Conditional probabilities can now be calculated from the above table. For example,
P (2B) =
P(2 ∩ B) 80 600 = ÷ P( B) 2400 2400
148
=
80 600
Activity 8 1.
For Example 17 above, calculate the following probabilities. (a)
If one box had been selected at random what is the probability that (i)
it came from supplier B?
(ii)
it would contain two defective tubes?
(iii)
it would have no defective tubes and would have come from supplier A?
(b)
Given that a box selected at random came from supplier B, what is the probability that it contained one or two defective tubes?
(c)
If a box came from supplier A, what is the probability that the box would have two or less defective tubes?
2.
Attempt Question 11.8 in your textbook(OJ).
3.
The contingency table below describes a sample of 350 people who made a purchase in a large store selling sports shoes according to age and gender
149
Customers in a Sports Shoes Store, by Age and Gender Age
Under 30
Gender Male
Female
125
100
225
75
50
125
200
150
350
30 and over
Total
Total
Source: Survey Report of the Store
(a)
Construct the joint probability table for the above data.
(b)
If one customer had been selected at random, what is the probability that the customer selected was (i)
Under 30?
(ii)
a female?
(iii)
a male 30 and over?
(c)
Given that the customer selected was under 30, what is probability that the customer was a female?
(d)
Given that the customer selected was a male, what is the probability that the customer was 30 and over?
10.12
SUMMARY
In this unit, you have learnt about the probability of occurrence of one or more of uncertain events. You should now have a clear understanding of the different types of events and the calculation and interpretation of their probabilities. You have also learnt about the use of Sets, the different laws of probability, Venn Diagrams, Tree Diagrams and Joint Probability Tables.
150
UNIT 11 DATA COLLECTION II Unit Structure 11.0
Overview
11.1
Learning Objectives
11.2
Sample Design 11.2.1 Introduction 11.2.2 Requirements of a Good Sample 11.2.3 The Importance of Random Selection 11.2.4 Representatives 11.2.5 Table of Random Numbers 11.2.6 Sampling Frames 11.2.7 Methods of Random Sampling 11.2.7.1 Simple Random Sampling 11.2.7.2 Systematic Sampling 11.2.7.3 Stratified Random Sampling 11.2.7.4 Cluster Sampling
11.3
11.2.8
Sample Size
11.2.9
Quota Sampling
The Questionnaire 11.3.1 Question Construction 11.3.2 Concluding Remarks
11.4
Summary
11.0
OVERVIEW
In unit 2, you were introduced to data collection. In this unit we take up two very important aspects of data collection for further discussion. We repeat the cautionary note which appeared at the beginning of Unit 2. The material in OJ on sampling (Chapter 15) is not considered appropriate for this course. However, you may find Chapter 16 of OJ useful supplementary reading.
151
11.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following: 1.
Explain the importance of random selection
2.
Use a table of random number to draw a random sample
3.
Explain the strength and weaknesses of the following sample designs and be able to apply them in simple situations: (i)
simple random
(ii)
systematic
(iii)
stratified random
(iv)
cluster sample
4.
Explain the strengths and weaknesses of quota sampling
5.
Explain the general principles of questionnaire design and the precautions to be applied in the wording of questions and be able to construct a simple questionnaire.
11.2 11.2.1
SAMPLE DESIGN Introduction
In unit 2, we noted that the idea of studying or examining a part in order to learn about the whole is familiar and often applied in everyday life. We also noted that sampling has a number of advantages over exhaustive studies. However, we pointed out that the findings relating to a sample are only generalisable to the whole population provided the sample has been selected according to certain principles. Here we discuss the basic principles of valid sample design and the considerations that govern the choice among alternative designs.
11.2.2
Requirements of a Good Sample
The dual characteristics of a good sample are:
152
(i)
randomness
(ii)
representativeness
These two notions will be elaborated respectively in the following two subsections.
11.2.3
The Importance of Random Selection
Selection on grounds of convenience is not appropriate Suppose that a number of people are assembled in a large theatre, say, the University auditorium, for a lecture and it is desired to draw a sample from the audience for the purpose of eliciting their views about the presentation. One way that might suggest itself to us would be to simply select any convenient group, say, people in the front row. Now, it is not hard to think of reasons why people in the front row might not be representative of the whole audience. Can you think of one or two? This method of selection is therefore not appropriate.
Haphazard selection is not appropriate also An alternative method that might suggest itself to us would be to stand on the rostrum and pinpoint “haphazardly” persons to form part of the sample. It is perhaps not as easy to pick any fault with this approach but it is nevertheless faulted. Can you think why? Well the reason is that the individual doing the selection may, for example,
have a preference,
conscious or unconscious, for young persons so that once again the sample would not be representative of the whole population. It has been found, time and again, that when the task of picking a sample is left to a human being, biases tend to creep in, one way or another. Assuming that a particular researcher felt that he or she was free from biases or prejudices of any sort and could therefore safely proceed personally to the selection of a required sample, it would still be impossible for him or her to prove to the rest of the world that the sample were free from any bias whatsoever. Findings based thereon and extended to the whole population would be subject to challenge on grounds of subjectivity.
153
A Random Selection Procedure Suppose that, instead of the above, all members of the audience are asked to write their names on bits of paper which are then scrapped and placed in a basket. The bits of paper are mixed thoroughly by shaking the basket and a sample is then drawn by picking out some bits of paper from the basket. Now this is a truly random procedure which is not subject to biases arising in the way described above. This method of carrying out random selection is not very practical especially if the size of the population is large. However, we shall see that there exists other more practical ones.
Why is random selection required? In spite of it being random, the procedure just described could nevertheless occasionally yield samples similar to the ones obtainable under either the first or second procedures described above. So what has been gained? Well there are three very important advantages: (i)
With the first two procedures, any bias present would persist even if the selection process were repeated many times, always in the same direction. We refer to such biases as systematic biases. For example if an assistant was asked to select animals for an experiment and, for one reason or another, he tended to over represent larger animals in his selection, this bias would persist in repetitions of the selection with the same assistant. With the third procedure described, there is no systematic bias. In a single sample, larger animals may well be over represented but in repetitions of the selection, the biases will not always be in the same direction but rather tend to cancel out.
(ii) Even more interesting is the fact that, with the third procedure described, as the sample size increases, the risk of having an unrepresentative sample decreases. On the other hand, the potential biases in the first two procedures do not diminish with increases in the sample size. Think why? (iii) When sampling has been carried out by a random procedure, it is possible to assess how far the result based on the sample reflects the corresponding true population
154
characteristic. In other words it is possible to indicate the likely margins of error in the sample result.( This is done through the notion of confidence intervals, by application of sampling theory and is outside the scope of this course.) No such margins can be quoted when sampling is nonrandom.
Definition of random selection Definition: Random (or probability)
sampling is a method of sampling where every
member of the target population has a known, nonzero probability of selection. Note that equal chance of selection is not a requirement of random selection. In fact, as we shall see later, there are instances where there are valid reasons for wanting certain sections of a population to be overrepresented and others to be underrepresented. So long as every
one has a chance of being selected and that chance is known, such over or underrepresentation is not a problem and can be compensated at the analysis stage by a technique known as reweighting.
11.2.4
Representativeness
As we have seen, although randomisation eliminates systematic and persistent biases, it does not guarantee a representative sample, particularly when the sample is small. For example, consider a large population made up of men and women in equal proportions. If we draw a sample of ten persons from this population by simple random sampling, we may very well end up with eight men and two women. This is clearly a sample that is unrepresentative in terms of gender. If we had drawn a sample of 100, the risk of having a similarly unrepresentative sample, i.e. 80 men and 20 women would be considerably less. However, whatever the size of our sample, we could easily have forced it to be representative on the gender criterion by selecting half of our sample from men and the other half from women. The objective of sampling is to try to capture into the sample the variation in the target population in respect of the characteristic or characteristics under study. To be certain to do this requires prior knowledge of the distributions such characteristics in the population which is , of course not available. What we can do however, is to ensure representativeness in
155
terms of other known characteristics of the population such as age, gender, occupation, etc which we suspect may be correlated to with the characteristic or characteristics under study. This is the idea underlying stratified sampling which is discussed in 11.2.9 below. Randomisation alone does ensure a valid sample and produces valid estimates with margins of error that can be calculated by the application of statistical theory. Stratified sampling using relevant stratification factors gives additional guarantee of representativeness resulting in smaller margins of error.
11.2.5
Table of Random Numbers
Drawing a random sample by writing names on bits of paper, scrapping the latter and dropping them in a basket, mixing them thoroughly and then picking out the required number of names, as we pointed out, is not very practical. It may be okay for a small sample from a small population but it would be tedious, for instance, for a sample of, say, 200 from the student population of the University of Mauritius. A more practical way of drawing random samples is to use a table of random numbers. The numbers tabulated in a random number table have been generated by a truly random process. No pattern can be discerned in such a table when examining the succession of numbers, in whatever direction we proceed with the examination, whether horizontally, vertically or diagonally. However, if we analyse a sufficiently large block of the table, it will be found that the latter has certain properties. Thus, for example, the frequencies of all digits would be found to be similar. In other words, there is no bias towards any digit or sequence of digits. An extract from a random number table is presented below: 92294 46614 50948 64886 20002 97365 35774 16249 75019 21145 05217 47286 83091 91530 36466 39981 62481 49177 85966 62800 70326 84740 62660 77379 41180 10089 41757 78258 96488 88629
156
Note that the numbers are presented in groups of five separated by spaces for greater readability. The spaces have no significance whatsoever. Suppose we need to draw a simple random sample of 200 from a population of 2000. First we number the individual members of the population from 0001 to 2000. We would need a bigger table than the extract presented above but for the sake of illustration suppose we start at its beginning and read the numbers horizontally. (We can start anywhere we like in a random number table and it is good practice not to always start at the same spot). The first two four digit numbers (9229 and 4466) are irrelevant and are ignored. The third one is relevant and selects the 1450th individual on our list. The next two four digit numbers (9486 and 4886) are again irrelevant but the following one selects the 2000th individual.
11.2.6
Sampling Frames
A list from which we select a sample is called a sampling frame. Sampling frames are often not available. And when they are, they are not always perfect. They may be subject to problems of inaccuracies, omission, duplication, inclusion of irrelevant individuals etc. When no sampling frame exists or available ones carry imperfections which cannot be remedied, it may be necessary to compile one. This can be both costly and time consuming. Fortunately there are methods of random sampling which require no list or only a partial list. This will be elaborated upon below.
11.2.7
Methods of Random Sampling
It is possible to devise various alternative random sample designs. These alternative designs will vary in terms of precision achievable, convenience and cost. In the following subsections we discuss several basic designs. Sample designs for large surveys, e.g. national surveys may be more complex than any of these basic designs but will incorporate the same ideas: they may be made up of several stages and involve combinations of the strategies of the basic designs discussed here. The choice of sample design in a practical situation is a question
157
of choosing the design that gives maximum precision subject to the constraints of cost and resources (including the availability of a sampling frame). In studying the alternative basic designs described below, you should ensure that you understand (i) how to apply each of them (ii) their relative advantages and disadvantages
11.2.7.1
Simple Random Sampling
Definition: Simple random sampling (s.r.s.) is a method of random selection where every subset of the chosen sample size has the same chance of selection. It can be proved that s.r.s. has the property that every member of the sampled population has the same chance of selection but this property is not unique to s.r.s. and therefore should not be used as a definition thereof. The application of simple random sampling is very simple. We need a list of all members of the target population. This list is numbered serially (e.g. if the population consists of 10,000 individuals, we number them 00001, 00002, 00003 etc., up to 10,000). Suppose we need a sample of size 400. We select 400 random numbers (lying between 00001 and 10,000) from a table of random numbers using the method described in 11.2.5. above. Simple random sampling is important for two main reasons: (i) it serves as a yardstick against which to compare other sample designs in terms of precision, convenience and cost. (ii) it is a component of other sample designs. This latter statement will become clear as we discuss these other designs. Note that the implementation of simple random sampling requires a sampling frame. Another important point about simple random sampling is that if the population of interest is geographically scattered, the sample selected by this method will also be physically dispersed.
158
This is not only inconvenient from the point of view of organisation of the field work; it is also costly as it inflates field costs, e.g. the travel costs of field staff if face to face interviewing is used. S.r.s. gives more precision than cluster sampling but less than stratified sampling.
11.2.7.2
Systematic Sampling
Drawing numbers from a random number table, although more practical than picking out slips of paper from a basket can nevertheless prove tedious if the sample to be drawn is large. Given a sampling frame, it is possible to select a sample in a very practical way as follows: Suppose we have a list of N people and it is desired to select a sample of n from it. Regard the N units as arranged in a circle. Let k be the nearest integer to N/n .We first select a random number between 1 and N. This identifies our first selection. We then select every kth individual after that first selection going round the circle until n units have been selected. For example suppose we have a population of 2000 and we need a sample of 175. Then k=11. We select a random number (from a random number table) between 0001 and 2000. Suppose this number is 0379. Then the first selection into our sample is the 379th individual from the start of our (original) list. We then, referring to our imaginary circle, select every kth individual after that first selection going round the circle until 175 units have been selected. If our list of individuals is not arranged in any particular order, then systematic sampling is almost (but not quite equivalent) to simple random sampling. The difference is that not all subsets are possible as in srs.(Think why). However, this is not a very serious drawback and the convenience of systematic sampling constitutes a great practical advantage. If the list is arranged, say by age starting from the youngest, then the systematic sampling procedure spreads out the sample across age more evenly than simple random sampling would do normally. Systematic sampling in these conditions becomes almost equivalent to stratified sampling (discussed in the next subsection) with stratification according to age. The procedure, however, should be avoided if there are cyclical patterns in the list, as may happen
159
if there are several subgroups in the list with each subgroup ordered by age. The selections may then coincide with a particular age range thus biasing the sample.
11.2.7.3
Stratified Random Sampling
Stratified sampling consists of dividing the target population into sub groups called strata and taking a separate sample within each stratum. The sample within each stratum is usually drawn by simple random sampling. Stratification i.e., the division of the target population into groups must be on a criterion or on criteria relevant to the survey topic. For example it is pointless to ensure representativeness in terms of religion if respondents’ answers are unlikely to be influenced by religion. But provided the stratification criterion or criteria are appropriate, the representativeness (on these criteria) achieved by stratified sampling produces greater precision than with simple random sampling. In practice this is reflected by smaller margins of error. The data prerequisites for implementation of stratified sampling are however more demanding than for simple random sampling. We need not only a list of every member of the target population but also in respect of each such member we need information on the stratification criterion. For example, if we are stratifying by educational level, we need to know every individual’s educational level. There are a number of options for allocating the total sample to the various strata i.e. how many to sample from each stratum. One simple option is to share the total sample among the various strata in proportion with their sizes (i.e. a stratum which has 40% of the population gets 40% of the sample, another which has 25% of the population gets 25% of the sample and so on). We refer to this as stratified sampling with proportionate allocation. This allocation has the advantage that it gives every member of the population the same chance of selection and hence eliminates the need for re weighting at the analysis stage. However, there are sometimes good reasons for not using proportionate allocation. For instance, if it is desired to make comparisons among strata, and certain strata are small, it is
160
possible that proportionate allocation will produce too small samples from such strata for meaningful comparisons.
11.2.7.4
Cluster Sampling
Sometimes populations occur or can be conveniently divided into groups or clusters. For example, school children are located in schools. Households in a country can be grouped into geographical clusters made of blocks of houses bounded by streets or natural boundaries. This fact provides an alternative random sampling strategy. This consists of drawing up a list of clusters that together comprise the whole population and then selecting a sample of clusters. This can be done by simple random sampling. We can then include in our sample all individuals in each selected cluster. This is referred to as sampling of whole clusters. It gives equal chance of selection to every member of the population. The method of sampling just described has the great advantage that it does not require a list of all members of the target population. It only requires a list of the clusters and this is usually not hard to obtain. Furthermore, it concentrates the field work (think why and contrast with srs!) and this reduces field costs. However, if the clusters are of unequal size, the method provides no control over the overall sample size. Note that sample size has a direct bearing on costs. In addition, for the same size of sample, the precision with cluster sampling is less than with either srs or stratified sampling. Instead of including every individual in the selected clusters into the sample it is possible to take only a sample from each selected cluster. This method is referred to as cluster sampling
with subsampling. Note that it requires lists of individuals but only for the selected clusters. Various strategies are available for both the selection of clusters and the the selection of individuals within clusters.
Activity 1 Explain how you would select a random sample of students from the University of Mauritius for the purpose of eliciting their views on the University Library services
161
(a)
by simple random sampling
(b)
by stratified random sampling (choose appropriate stratification criteria and justify your choice)
(c)
by cluster sampling
(d)
by systematic sampling
11.2.8
Sample Size
What size of sample do I need for my survey? This is an often asked question. There is no magic answer. Certain items of information are needed before an answer can be attempted. We need an indication of the acceptable margins of error, of the maximum acceptable risk of exceeding these margins and some advance information about the population to be sampled. The latter requirement may seem difficult to satisfy but it is often possible to make certain estimates or assumptions. Given these items of information, there exists theory that enables one to determine the required sample size but this is beyond the scope of this course. Suppose that you are doing a survey to find out what percentage of your target population would be interested to purchase a certain product, and you would like to be quite confident that your survey finding is not off mark by more than 5 percentage points on either side (i.e. if your survey finds that 50 % are interested, for example, you want to be reasonably sure that the true percentage lies between 45% and 55%). Your required sample size would be of the order of 400. If you can relax your error margins to 10 percentage points on either side, then your required sample size would be of the order of 100.
11.2.9
Quota Sampling
Quota sampling is a nonrandom method of selection. It attempts to ensure representativeness on criteria that are considered important in the same way that stratified random sampling does i.e. by ensuring that the proportions in the various strata in the sample are the same as in the population. However, no sampling frame is used and the selection of respondents is left to the interviewers. Thus, for example if it is desired to ensure representativeness by agegroup and the target population consists of 45% under 25, 35% between 25 and 44 and 20% aged 45 and over, then interviewers would be sent out with
162
“quotas” relating to the number of persons in the various age groups that each one should interview. There is then the danger that although the sample would be representative in terms of age group, it could be biassed in terms of other characteristics. Can you think why? Certain precautions can be taken but the risk can neither be eliminated nor controlled statistically (by specifying margins of error as in the case of random sampling). Opinion poll organisations usually use quota sampling because of its low cost and convenience. Over the years these organisations have considerably refined the procedure by incorporating more stratification criteria and more controls so that, nowadays, they are usually able to produce reliable results. Surveys using quota sampling have become a common feature of modern society as the general public is very fond of information on a variety of subjects. The results of such surveys often appear in newspapers or are presented on television. Watch for the next one!
11.3
THE QUESTIONNAIRE
As was noted in Unit 2, the observational methods are less effective in providing information about personal beliefs, feelings, motivations, expectations or future plans. The signal advantage of the mail questionnaire and personal interviews as principal ways of collecting survey data have been discussed. We now turn to the instrument on which both approaches depend, the questionnaire or recording schedule. Both of them contain a set of questions logically related to a problem under study, but whilst a schedule is used as a tool for interviewing, the questionnaire is used for mailing. The process of construction of a schedule and a questionnaire is almost same, except for some minor technical differences. Whilst the questionnaire itself is simpler, shorter and carefully and clearly laid out, the requirements for the recording schedule are in some respects different as it is handled by interviewers.
163
Having made the distinction between questionnaires and recording schedules, we shall now concentrate on some basic points to be kept in mind whilst designing them. We shall however use the term ‘questionnaire’ for discussion of both types of documents.
Read pp 317321 of your textbook (OJ) As you see, several considerations must be borne in mind while designing a questionnaire. Careful planning, the physical design of the questions, careful selection and phrasing of the questions affect the number of returns as well as the quality and accuracy of the findings. The entire process of questionnaire construction can be divided into the following steps: (i)
Information to be sought
(ii)
Type of questionnaire to be used
(iii)
Writing a first draft
(iv)
Reexamining the questions
(v)
Pretesting and editing the questionnaire
(vi)
Specifying procedure for its use.
So, the first step of questionnaire design is to define the problem to be tackled and hence decide on what questions to ask. The temptation is to cover too much, but this has to be resisted as lengthy questionnaires can prove to be demoralising for both the interviewer and the respondent.
11.3.1
Question Construction
Once the information needs and size of the questionnaire have been agreed on, we can begin question construction  this involves the following : (a)
Question Relevance and Content
In considering any question, it is wise to ponder upon whether respondents are likely to possess the required knowledge, or have access to the appropriate information, necessary for
164
giving a correct answer. Further, it must be made clear whether questioning would secure the required information or not. If we find that our objectives are not met by questioning, then we should think of alternative procedures. (b)
Question wording
Obviously, great care is required in formulating the questions. Reliable and meaningful returns depend to a large extent on this. Naturally, if questions are beyond the understanding of the respondent, he/she may choose one of the alternative responses without any idea as to the meaning of his/her response. Some suggestions for wording questions are given below: (i)
Simple words which are expectedly familiar to all potential informants should be used. Avoid multiple meaning questions, as they tend to give rise to confusion on both sides; they should be formulated as two or more questions. Avoid ambiguity and vague words as they encourage ambiguous and vague answers.
(ii)
Caution must be exercised in the use of phrases which reflect upon the prestige of the respondent. Embarrassing questions, leading questions, those involving memory, catchwords or words with emotional connotations should be avoided. Further, the question must allow for all possible responses  thus provision for such indefinite answers as “don’t know”, “no choice”, “other (specify)” should be made. But, at the same time, to avoid abuses of these indefinite questions, the range of answers should be exhaustive and well established as far as possible.
(iii)
Questions should not, generally speaking, presume anything about the respondent. For example,
165
•
How many cigarettes a day do you smoke?’ are best asked only after a ‘filter’ question. For e.g., Do you smoke? Yes No
This has revealed whether the respondent smokes cigarettes or not. Once filters have been formulated, skip instructions are necessary. For instance, for the above case, suppose the respondent does not smoke, then he may be directed to skip questions related to those who smoke. Question wording remains a matter of experience and common sense and what we have discussed above is in no way complete. (c)
Response form or types of questions
The third major area in question construction is the type of questions to be included in the instrument. They may be classified into open questions and closed questions. The closed (sometimes called Precoded, Fixed Alternative) questions are structured ones with two or more alternative responses from which the respondent can choose. They are efficient where the possible alternative replies are known, limited and clearcut as in the case of factual information. They have the advantage of being ‘standardisable’, simple to administer, quick and relatively inexpensive to analyse. But at the other end, they may tend to force a statement of opinion on an issue or the respondent may be led to choose a response, even when he/she has no knowledge of it, or the limited alternatives may not cover his/her viewpoints. Openended questions are unstructured ones, providing free scope to the respondents to reply with their own choice of words, e.g.
166
•
What do you propose to do after leaving the University?
While they present a major strength in the sense that the informant is given the chance of answering in his/her own terms and frame of reference, their analysis are often complex, difficult and expensive. Openended questions are desirable when the issue is complex or when the interest of the researcher is the exploration of a process, but in other cases, closed questions are preferable. (d)
Question order/sequence
The order in which questions are arranged is as important as question wording, as they may affect the refusal rate and there is evidence that they may even influence the answers obtained. As mentioned in the book by Goode and Hatt (1952), “Methods in Social Research”: McGrawHill, NY, there should be a logical progression in the sequence so that the respondent is (i)
drawn into the questioning process by awakening his/her interest
(ii)
not confronted by an early and sudden request for personal information
(iii)
easily brought along items which are simple to answer to those which are complex
(iv)
never asked to give an answer which could be embarrassing without being given an opportunity to explain
(v)
brought smoothly from one frame of reference to another rather than made to jump back and forth.
The overall sequence in a questionnaire is of paramount importance, as usually the interviewer is a stranger to the respondent and the latter is under no obligation to comply. So,
167
the interviewer should try to awaken the respondent’s interest in the study and motivate participation. There was a tendency in the past to begin the questionnaire with easytoanswer demographic profiles of the respondent such as age, marital status, religion etc., but there is a school of thought that sees this practice as not desirable because people do not like to furnish such information so abruptly to strangers. It may thus be more desirable that these questions be put at the end, as by that time, the interviewer has evoked the interest of the respondent in the study and the latter is more willing to give such information. (e)
Pilot studies/Pretesting
A pilot study is a fullfledged miniature study of a problem, while a pretest is a trial test of a specific aspect of the study, such as method of data collection, data collection instrument, interview schedule etc.. The draft questionnaire must be pretested in order to find out how it works before launching off on a fullscale survey. This often solves unforeseen problems in field work and indicates any necessary change in the questions and other problems with the questionnaire. After the editing is done, other pretests might be necessary before administering the questionnaire ,depending on the complexity of the study. Finally a pilot study, which is a main rehearsal of the main study is vital for the proper running of the survey later.
11.3.2
Concluding Remarks
For the purpose of this course , the coverage of questionnaire design has been brief. The interested reader is directed to specialised books referred in the Recommended Readings for a comprehensive discussion on the topic. To end up, questionnaire design remains a matter of common sense, experience and avoiding known pitfalls. And detailed pretests and pilot studies, more than anything else, are the essence of a good questionnaire.
168
Activity 2 (i)
You have been requested by University management to conduct a small study on the adequacy of accessibility by students to computer facilities on the University campus. Describe and justify what sampling procedure you would adopt and also, design a short questionnaire of about 1015 questions for the purpose.
(ii)
Concern has been expressed in various quarters about the difficulty experienced by working women in reconciling their domestic responsibilities with their work. Issues of interest are the extent of the domestic responsibilities, whether any help in coping with them is obtained, the amount and type of leisure enjoyed and stresses generated by the dual responsibilities. Differences in level of difficulty experienced and ways of coping with them across different categories of women are also of interest. Design a suitable short questionnaire (of about 15 to 20 questions) for carrying out a national sample survey to address the issue described. The survey will use face to face interviewing.
Recommended Readings: 1. Payne, S.L.B., The Art of Asking Questions, Princeton: Princeton University Press 1951. 2. Moser C. A., and Kalton G., Survey methods in Social Investigation, ELBS and Heinemann educational books Ltd: University Press 1971.
11.4
SUMMARY
In this unit, you have studied the importance of randomness and representativeness in sample selection, the use of random number tables, the application, strengths and weaknesses of simple random sampling, systematic sampling, stratified sampling, cluster sampling and quota sampling. You have also studied the principles of questionnaire design.
169
UNIT 12 LINEAR CORRELATION
RELATIONSHIP
BETWEEN
VARIABLES

1:
Unit Structure 12.0
Overview
12.1
Learning Objectives
12.2
Bivariate Data 12.2.1 Scatter Diagrams
12.3
Measures of Correlation 12.3.1 Product Moment Correlation Coefficient 12.3.2 Rank Correlation Coefficient
12.4
Interpretation of the Coefficient of Correlation and Problems Related Thereof
12.5
Coefficient of Determination
12.6
Summary
12.0
OVERVIEW
In many situations it is of interest to find out whether two or more variables are related, and if so, to investigate the nature and strength of these relationships. For instance, one might be interested in studying the relationship between Yield, Temperature, Humidity, Rainfall, etc. Such relationships are studied using the techniques of Correlation and Regression. In this unit, you shall study the concept of Correlation, different ways of measuring it, its importance and limitations.
170
12.1
LEARNING OBJECTIVES
When you have successfully completed this Unit, you should be able to do the following: 1.
Identify bivariate relationships.
2.
Construct and interpret Scatter Diagrams.
3.
Compute, interpret and use the following:
12.2
(i)
Product moment correlation coefficient.
(ii)
Rank correlation coefficient.
(iii)
Coefficient of Determination.
BIVARIATE DATA
So far, we have confined ourselves to univariate data i.e. the data concerning only one variable. We may, however, come across data involving two or more variables, for example, the marks of students in various subjects. The data involving two variables is known as Bivariate Data. Table 12.1 is an example of bivariate data:
Table 12.1 Student
% Marks in
% Marks in
Maths
Statistics
A
40
65
B
68
75
C
35
35
D
52
48
E
70
50
171
Note: Bivariate data must always be in pairs and the two sets of data should correspond to the same units of observation. For instance, in Table 12.1, the marks in Maths and Statistics should correspond to the same set of students. We are often interested to find out the nature and strength of relationship between two variables under study. In the above example, we might be interested in knowing about the type of relationship that exists between Marks in Maths (X) and Marks in Statistics (Y) and whether high values of X tend to be associated with high or low values of Y or viceversa. The coming sections of this unit are devoted to the analysis of bivariate data using correlation technique.
12.2.1
Scatter Diagrams
You have been introduced to scatter diagrams in Section 5.2.6 of the manual and you have seen their usefulness. We further develop scatter diagrams in this section, especially in the context of discerning the relationship between the variables. As mentioned earlier, if the paired values of variables X and Y are plotted along xaxis and yaxis respectively in the xyplane, the diagram of points so obtained is known as Scatter Diagram. From the scatter diagram, we can form a fairly good idea about the relationship between X and Y. Read pp 453454 of your textbook (OJ).
172
Study the following scatter diagrams carefully and try to identify the nature of relationship represented by each of them.
Y
Y
Height of
Interest
Sons 0
0
X
X
Height of father
Savings
Fig. 12.1
Fig. 12.2
Y
Y
Price
Number of errors made 0
0
X Demand
X Number of weeks experience
Fig. 12.3
Fig. 12.4
Y Consumption Yield
of Cigarettes
0
Rainfall
X
Height
Fig. 12.5
Fig. 12.6
173
You will note that in Figures 12.1 and Fig. 12.2, high values of variable X are associated with high values of Variable Y indicating positive relationship. Since the points in Fig. 12.2 are less scattered as compared to those in Fig. 12.1, the positive relationship exhibited in Fig. 12.2 is stronger than that in Fig. 12.1.
Activity 1 A machine will run at different speeds but the higher the speed the sooner a certain part has to be replaced. Trial observation gives the following data:
Table 12.2 Speed
Life of
(revolutions per minute)
drillhead
18
162
20
154
20
171
21
165
23
128
26
138
26
140
28
129
31
125
32
106
32
97
40
95
41
103
42
109
43
69
Plot the figures on a scatter diagram and comment.
12.3
MEASURES OF CORRELATION 174
You have seen that the scatter diagram provides a useful aid in discerning the nature of the relationship between two variables, but it cannot supply a quantitative measure of the extent of the relationship between the two variables. Thus, in addition to examining the scatter diagram, it is therefore, necessary to compute a descriptive measure that reflects the strength of the existing relationship. Correlation, in fact, does so and gives us a measure of the strength of the linear relationship that exists between two or more variables. In this unit, we consider the linear relationship between two variables i.e. simple correlation. In this section, you study two measures of correlation: 1.
Product Moment Correlation Coefficient (r).
2.
Rank Correlation Coefficient (P).
12.3.1
Product Moment Correlation Coefficient
Consider the following table representing the volume of Sales and Total expenses for ten firms.
Table 12.3 Volume of Sales (in thousands of units)
Total Expenses (£000)
Y
X
20 2 4 23 18 14 10 8 13 18
60 25 26 66 49 48 41 18 40 33
175
The scatter diagram is produced below:
Graph of Volume of Sales v/s Total Expenses
Volume of sales
25 20 15 10 5 0 0
20
40
60
80
Total Expenses
Fig. 12.7 The scatter diagram indicates a positive relationship between the two variables but it is insufficient to give us a measure of the strength of the relationship between the two. Let us consider the problem:Compute X and Y and form the columns ( X − X) and ( Y − Y ) .
Calculate ( X − X) ( Y − Y ) . Plot the points Y − Y against X − X .
Table 12.4 Note: n = 10, ∑X = 406, ∑Y = 130 ∴ X = 40.6, Y = 13
X−X
Y−Y
( X − X)( Y − Y )
19.4 15.6 14.6 25.4 8.4 7.4 0.4 22.6 0.6 7.6
7 11 9 10 5 1 3 5 0 5
135.8 171.6 131.4 254 42 7.4 1.2 113 0 38 ___ 816 ===
176
Y−Y
IInd Quadrant
15
Ist Quadrant
10 5
X−X
0 30
20
10
0
10
20
30
5 10
IIIrd Quadrant
15
IVth Quadrant
Figure 12.8
Scatter Diagram (Fig. 12.7) indicates that high values of X are associated with high values of Y and Fig. 12.8 shows that most of the points lie in the Ist and IIIrd Quadrant, where the
(
)(
)
product X − X Y − Y is positive. Hence, we expect
∑ ( X − X)( Y − Y )
to be positive if
most of the points lie in the first and third quadrant of ( X − X, Y − Y ) plane. It implies that there is a direct or positive correlation between X and Y.
Similarly, if most of the points lie in the second and fourth quadrant,
∑ ( X − X)( Y − Y ) will
be negative, thereby implying negative or inverse correlation.
If the points are rather evenly distributed in all the four quadrants, the sum of the positive products ( X − X)( Y − Y ) would roughly equal the sum of negative products.
Thus,
∑ ( X − X)( Y − Y ) is
expected to be close to zero, indicating very weak linear
relationship between X and Y.
177
Alternatively, by shifting the origin from (0, 0) to ( X, Y ) i.e. (40.6, 13) in Fig. 12.7, we can draw similar conclusions as above. It is shown in Fig. 12.9.
Ist/IIIrd Quadrant
Volume of sales
( X − X)(Y − Y) is +ve
X
25 20 15 10
Y
5 0 0
20
40
60
80
Total Expenses
Figure 12.9
You may be wondering if we could use
∑ ( X − X)( Y − Y ) as a measure for the degree of
association? Well, it could be used but this term has two deficiencies: (1)
it is influenced by the variability of X and Y and
(2)
the magnitude of this term depends upon the size of the sample.
To overcome the first deficiency, we divide
∑ ( X − X)( Y − Y ) by the measures of dispersion
σ x , σ y . The measure thus derived is also a dimensionless ratio. A remedy for the second deficiency is to divide the ratio by n, the sample size.
178
Hence Product Moment Correlation Coefficient is given by
r=
∑ ( X − X)( Y − Y) nσ xσ y
............................... (Formula 12.1) .
From Table 12.4, we have
∑ ( X − X)( Y − Y ) = 816 Also, you recall
and
σ X2 =
1 n
∑ ( X − X)
2
σ Y2 =
1 n
∑ ( Y − Y)
2
Using Table 12.4, we obtain
σ X 2 = 217.24 & σ Y 2 = 43.6 Using Formula 12.1
∴ r= 10 x
816 217.24 x 43.6
= 0.8384
r measures the strength of the linear relationship between X and Y. This formula was
developed by KarlPearson; hence this coefficient of correlation is commonly known as KarlPearson’s product moment correlation coefficient or simply Pearson’s coefficient of correlation.
179
Note: For calculation purpose, 12.1 is often expressed as
r=
(∑ X)(∑ Y) [ n ∑ X − ( ∑ X) ] [ n ∑ Y − ( ∑ Y ) ] n∑ XY −
2
2
..................(Formula 12.2)
2
2
Activity 2 (Optional)
∑ ( X − X)( Y − Y) = ∑ XY −
∑ X∑ Y
(1)
Show that
(2)
Recall from Unit 6 Section 6.3.1.1, Activity 2(ii) that
(3)
∑ ( X − X)
2
∑ ( Y − Y)
2
= ∑X
2
= ∑Y
2
( ∑ X) −
n
2
n
( ∑ Y) −
2
n
Hence show that
∑ ( X − X)( Y − Y ) = nσ x σ y
(∑ X)(∑ Y) [ n ∑ X − ( ∑ X) ] [ n ∑ Y − ( ∑ Y ) ] n∑ XY − 2
2
2
2
To illustrate the calculation of r, using the computational formula (12.2), we reconsider the data from Table 12.3, and compute the columns XY, X² and Y²
180
Table 12.5 X
Y
XY
X²
Y²
60
20
1 200
3 600
400
25
2
50
625
4
26
4
104
676
16
66
23
1 518
4 356
529
49
18
882
2 401
324
48
14
672
2 304
196
41
10
410
1 681
100
18
8
144
324
64
40
13
520
1 600
169
33
18
594
1 089
324
___
___
_____
_______
_____
406
130
6 094
18 656
2 126
r=
=
10 x 6 094 − 406 x 130
[10 x 18 656 − (406) ][10 x 2 126 − (130) ] 2
60 940 − 52 780
(21 724)(4 360)
2
= 0.8384
Note that we get the same answer as before, but computations are simpler. Activity 3
Attempt Question 23.6 (b) and 23.18 in (OJ). 12.3.2
Rank Correlation Coefficient
Read pp 471473 of your textbook (OJ).
181
Note that the equation on p:472 (OJ) at the bottom of the page, should read as:
P = 1−
6 × 34.5 207 = 1− = 0.7125 9(81 − 1) 720
Activity 4
Attempt Question 23.24 (b) from (OJ).
12.4
•
INTERPRETATION OF THE COEFFICIENT OF CORRELATION AND PROBLEMS RELATED THEREOF
Correlation coefficient r lies between 1 and +1 i.e. 1≤ r ≤ 1
r = 1 implies a perfect inverse linear relationship between X and Y, that is, all the sample points will fall on a straight line with negative slope. r = 0 implies no linear relationship between X and Y r = +1 implies a perfect direct linear relationship between X and Y, that is, all the points (X, Y) will fall on a straight line which has a positive slope. It follows that a value of r near to 1 indicates a high degree of negative association per high values of one variable are associated with low values of the other. The negative sign shows that the relationship is inverse. On the other hand, a value of r close to +1 implies a high degree of positive association, i.e., high values of one variable are associated with high values of the other. The positive sign shows that the relationship is direct.
182
Note:
•
Rank Correlation Coefficient (P) is nothing but product moment correlation coefficient between the ranks. Hence, it can be interpreted in the same way as r.
•
Correlation might exist between two variables and it could be strong, yet there is no logical or causal relationship. A ‘Causal’ or ‘Cause and effect’ relationship is said to exist between two variables if change in one variable causes change in the other. For example: Age of Machine (Cause) v/s. Maintenance Cost (effect); Rainfall (Cause) v/s Yield (effect). Some relationships are even purely accidental. This is known as a spurious or nonsense correlation. For example, Average working hours per week and percentage of fibre in diet. Automation is responsible for cutting down the hours required to work while medical awareness is causing diets to become healthier. The correlation in this case will obviously be high but spurious. One would surely not wish to make a comment  “Healthy eating causes laziness”!
•
Two series may vary together, being under the influence of other variable/s. You might find a close relationship between jewellery sales and sales of colour TV sets. Here, changes in both sales are probably a result of changes in consumer income.
•
Zero correlation doesn’t always mean that there is no relationship between the variables. All it says is that there is No Linear Relationship between the Variables there may be strong relationship but of a non linear kind.
183
Activity 5
Consider the following bivariate data
X
4
3
2
1
1
2
3
4
Y
16
9
4
1
1
4
9
16
For the above data, compute the following:
∑ X, ∑ Y, ∑ X
2
,
∑Y
2
,
∑ XY
Hence, calculate the Karl Pearson’s correlation coefficient, r. •
What do you notice?
•
Based on the value of r, what conclusions can you make?
•
Plot a scatter diagram of Y against X.
•
Is there any relationship between Y and X?
•
Is there anything special about the form of the relationship between Y and X? (HINT!! Compare Y with the computed values of X² ).
•
According to you, what important issue on the interpretation of r does this activity bring out?
12.5
COEFFICIENT OF DETERMINATION
The Coefficient of Determination is equal to the square of the Coefficient of Correlation and
(
)
is denoted by r 2 0 ≤ r 2 ≤ 1 . It gives the percentage of variation in one variable explained by variation in the other variable. For example, if r = 0.5, r² = 0.25 which implies that only 25 percent of the variation in Y (or X) is explained by the variation in X (or Y), thereby indicating very low linear relationship between the two variables.
184
Similarly, since r = 0.7 seems to indicate a high positive correlation, yet in fact only (0.7)², i.e. 49% of the variation in one variable is explained by the variation in the other. Hence, there is a moderate relationship between the two variables. Thus, we should take care in interpreting the values of the correlation coefficient. Note: The concept of the coefficient of determination, its applications and uses are
further developed in Unit 13, in connection with regression analysis.
12.5
SUMMARY
In this unit, you have learnt about the linear relationships that exist between two variables using scatter diagrams and the measures of correlation viz. Product Moment Correlation Coefficient and the Coefficient of Rank Correlation. You should now be able to interpret the results properly.
185
UNIT 13
LINEAR RELATIONSHIP REGRESSION
BETWEEN
VARIABLES

II:
Unit Structure 13.0
Overview
13.1
Learning Objectives
13.2
Modelling the Relationship Between X and Y
13.3
The Simple Linear Regression
13.4
Functional Forms
13.5
Linear Regression and the Time Series
13.6
Coefficient of Determination
13.7
Summary
13.0
OVERVIEW
Regression analysis is used to study the nature of the relationship that exists between two or more variables, as well as to serve as a basis for prediction. This unit focuses on computing and analysing the linear regression that describes the relationship between two variables.
13.1
LEARNING OBJECTIVES
When you have successfully completed the Unit, you should be able to do the following: 1.
Explain the purpose of regression analysis.
2.
Compute, interpret and use the simple linear regression equation for elementary forecasting purposes.
3.
Interpret the coefficient of determination.
186
13.2
MODELLING THE RELATIONSHIP BETWEEN X AND Y
Consider the following example. A home service repair charges are Rs 40 for the service call and Rs 10 per hour spent on location. This situation can be modelled exactly as Y = 40 + 10 X ............................ (Formula 13.1) where Y (the labour charges) is the dependent variable and X (the number of hours) is the independent variable.
Gra ph of La bour cha rge s v/s Num be r of Hours 80 60 40 20 0 0
1
2
3
N umb e r o f H o ur s
Figure 13.1
When a graph is drawn as shown in Fig. 13.1, you find that all the points lie on a straight line so that for a given value of X, you can calculate an exact value for Y, i.e., say a worker spends 2 hours on a particular job then the labour charges would be 40+10(2)= RS 60. Such a model is generally represented by Y = α + β X and referred to as Exact Model.
But in real life, more often, we come across situations which cannot be modelled exactly as discussed above. For example, consider the scatter diagram of advertising expenditure and sales on page 453 of textbook (OJ) and reproduced below.
187
Graph of Sales v/s Advertising Expenditure Brown's Department Store
Sales(£m)
10 8 6 4 2 0 60
70
80
90
100
110
Advertising(£000)
Figure 13.2
It can be noted that though the relationship between the advertising expenditure (X) and sales (Y) may be strong, yet we cannot calculate sales exactly for a given advertising expenditure, i.e., suppose we wished to calculate sales for an expenditure of £71,000 or £102,000.
You note that in Figure 13.2, a straight line cannot be placed through all the points in the scatter diagram. However it is plausible to suppose that there exists a linear relationship between the two variables except that there are also some unpredictable deviations. So there is a need to include an error term in our exact model.
Hence, in such situations, we can fit a statistical model of the form Y = α + β X + ε where ε is the error term. When you read your textbook later, you shall learn more about the error term and how by minimising the sum of squares of the errors, we can obtain estimates of α and β.
188
13.3
THE SIMPLE LINEAR REGRESSION
Read pp 454461 of textbook (OJ). The simple linear regression model is Y = α + βX + ε Where Y is the dependent variable X is the independent variable α and β are the population parameters and
ε is the random error term.
The fitted regression line of y on x is thus y$ = a + bx , where a and b are estimates of α and β respectively. Pages 458459, provides you with formulae to calculate a and b rather than having to solve the normal equations simultaneously. The normal equations were (p:456) (OJ)
∑ y = na + b∑ x ∑ xy = a ∑ x + b∑ x b =
giving
2
∑ ( x − x) ( y − y) ∑( x − x) 2
which can be further simplified to
b=
n∑ x y −(∑ x )(∑ y )
(1)
n∑ x 2 −(∑ x )
2
and a = y −bx 1 where x = ∑ x n 1 y= ∑y n
(2)
189
(The formulae (1) and (2) are provided in exams). Assumptions
There are a number of assumptions underlying the simple linear regression model but we mention only some of them: (i)
The independent variable X is measured without errors.
(ii)
The model has been correctly specified.
(iii)
Errors are independently distributed, their standard deviation is
constant and the average of all errors is zero. SOLVED EXAMPLE
We take up the data used in 12.3.1 to fit a regression line by the method of least squares. The fitted regression line is y$ = a + bx , where a and b are as above.
To compute a and b, we need
∑ x ,∑ x ,∑ y 2
and
∑ xy .
We refer to Table 12.5, where these quantities have already been computed: We have, n = 10
∑ x = 406 ∑ y = 130 ∑x
2
= 18656
∑ xy = 6094 Thus
190
b
=
=
(∑ x)(∑ y) n∑ x − (∑ x )
n∑ xy −
2
2
10 ( 6094) − ( 406) (130) 10(18656) − ( 406)
2
=
0.3756
=
y − bx
=
130 406 − 0.3756 10 10
=
−2.25
and a
Therefore the fitted line is:y$ = − 2.25 + 0.3756 x Activity 1
Attempt Questions 23.3 and 23.5 of textbook (OJ). ^
Equation Y = a + bX gives the line of regression of Y on X. It is
Note:
used to estimate or predict the value of Y for any given value of X i.e., when Y is a dependent variable and X is an independent variable. Moreover, the estimate obtained will be best since this regression equation minimises the sum of squares of errors by the method of least squares. Caution!!!
This regression equation cannot be used to estimate or predict the value of X for any given value of Y.
13.4
FUNCTIONAL FORMS
191
It may be possible to use the techniques you have learnt so far to fit other functional forms which are different from the usual form, Y = a + bX ...... (Formula 13.2) but can be easily converted by some appropriate transformation to form Formula 13.2 above. Example 1
Consider
Y = a+
If we let
Z=
treat
b X
1 i.e., X
1 as the independent variable we get X Y = a + bZ
which is of the form and can be fitted by the method of least squares. Similarly
y = ab
x
Taking ln (loge) on both sides, we get
ln y
i.e.
Y
x
=
ln( ab )
=
ln a + ln b
=
ln a + x ln b
=
A + xB
x
192
i.e.
Y
=
Try
y = ax ,
b
ln y is the dependent variable
y = ae
−x
SOLVED EXAMPLE
The number y of bacteria per unit volume present in a culture after x hours is given in the table below. 0
1
2
3
4
5
6
32
47
65
92
132
190
275
Number of Hours (x) Number of Bacteria per Unit Volume (y)
(a)
It is suggested that for this type of situations, the curve
y = A.Bx where A and B are constants, gives a good fit. Show that the curve above can be transformed into the usual standard linear regression equation. Hence obtain estimates of A and B (correct to 2 d.p).
(b)
Compare the fitted values of y obtained from the above equation with the actual values and comment on your findings.
(c)
Estimate the value of y for
193
x = 3.5 and x = 8 Which one of the two estimates would you expect to be more reliable and why? SOLUTION
y = A . Bx
(a)
Taking ln (log e) on both sides, we have
⇒
(
x
)
ln y
=
ln A . B
ln y
=
ln A + ln B
ln y
=
ln A + x . l n B
x
1
Let
Y = ln y
α = ln A & β = ln B
2
Thus (1) is reduced to Y=α+
.x
3
which is the usual standard linear regression equation and thus estimates of α and b respectively, can be obtained by the method of least squares.
x
y
Y = ln y
x2
x.Y
0
32
3.466
0
0
1
47
3.850
1
3.850
2
65
4.174
4
8.349
194
say a and
3
92
4.522
9
13.565
4
132
4.883
16
19.531
5
190
5.247
25
26.235
6
275
5.617
36
33.701
31.759
91
105.231
21 We have
n=7
. , ∑x ∑ x = 21, ∑ Y = 31759
2
= 91, & ∑ x .Y = 105.231
Thus we have
b
=
=
(∑ x )(∑ Y ) n ∑ x − (∑ x )
n ∑ x .Y −
2
2
7 (105.231) − ( 21)( 31.759 ) 7 ( 91) − ( 21)
2
= 0 .355 a
= Y − bx
(∑ Y − b ∑ x )
=
1 7
=
1 (31.759 − 0.355 ( 21) ) 7
= 3.472
195
Hence, from (2) A$ = e a = e 3.472 = 32.20 (to 2 d . p) & B$ = e b = e 0.355 = 143 . (to 2 d . p) x ∴ y$ = 32.20 (143 . )
(b)
x
y
0 1 2 3 4 5 6
32 47 65 92 132 190 275
y$ = 32.20 (1.43) 32.2 46.0 65.8 94.1 134.6 192.5 275.3
x
The curve y = A . Bx gives a very good fit for this type of situation. (c )
Try this part yourself.
Activity 2 The table below gives experimental values of the pressure P of a given mass of gas corresponding to various values of the volume, V. According to thermodynamic principles, a relationship of the type PVa = c where a & c are constants should exist between the variables. Volume, V
54.3
61.8
72.4
88.7
118.6
194.0
Pressure, P
61.2
49.5
37.6
28.4
19.2
10.1
196
(a)
Using the method of least squares, estimate the values of a and c.
(b)
Write down the equation connecting P and V.
(c)
Estimate P when V = 100.0
13.5
LINEAR REGRESSION AND THE TIME SERIES
Read pp 461462 of textbook (OJ).
Activity 3 Attempt Question 23.8 of textbook (OJ).
13.6
COEFFICIENT OF DETERMINATION
Read pp 464468 of textbook (OJ).
Activity 4 For the data set used in 12.3.1, (a)
compute the coefficient of determination and interpret its value.
(b)
compare with the value of the correlation coefficient that you previously calculated for the data and comment.
13.7
SUMMARY
In this unit, you have learnt how to use sample data to fit the simple linear regression, relating a dependent variable y on a single independent variable x, and also how it can be used for prediction. Further, you have understood the importance of the coefficient of determination in regression.
197
View more...
Comments