statistics for management
January 14, 2017 | Author: Sreenivas Kodamasimham | Category: N/A
Short Description
Download statistics for management...
Description
LESSON – 1 STATISTICS FOR MANAGEMENT
Session – 1
Duration: 1 hr
Meaning of Statistics The term statistics mean that the numerical statement as well as statistical methodology. When it is used in the sense of statistical data it refers to quantitative aspects of things and is a numerical description. Example: Income of family, production of automobile industry, sales of cars etc. There quantities are numerical. But there are some quantities which are not in themselves numerical but can be made so by counting. The sex of a baby is not a number, but by counting the number of boys, we can associate a numerical description to sex of all new born babies, for an example, when saying that 60% of all live-born babies are boy. This information then, comes within the realm of statistics.
Definition The word statistics can be used is two senses, viz, singular and plural. In narrow sense and plural sense, statistics denotes some numerical data (statistical data). In a wide and singular sense statistics refers to the statistical methods. Therefore, these have been grouped under two heads – ‘Statistics as a data” and “Statistics as a methods”.
Statistics as a Data Some definitions of statistics as a data are a) Statistics are numerical statement of facts in any department of enquiring placed in relation to each other. - Powley b) By statistics we mean quantities data affected to a marked extent by multiplasticity of course. - Yule and Kendall c) By statistics we mean aggregates of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for predeterminated purpose and placed in relation to each other. - H. Secrist This definition is more comprehensive and exhaustive. It shows more light on characteristics of statistics and covers different aspects. Some characteristics the statistics should possess by H. Secrist can be listed as follows.
1
Statistics are aggregate of facts Statistics are affected to a marked extent by multiplicity of causes. Statistics are numerically expressed Statistics should be enumerated / estimated Statistics should be collected with reasonable standard of accuracy Statistics should be placed is relation to each other.
Statistics as a methods Definition a) “Statistics may be called to science of counting” - A.L. Bowley b) “Statistics is the science of estimates and probabilities”. - Boddington c) Dr. Croxton and Cowden have given a clear and concise definition. “Statistics may be defined as the collection, presentation, analysis and interpretation of numerical data”. According to Croxton and Cowden there are 4 stages. a) Collection of Data A structure of statistical investigation is based on a systematic collection of data. The data is classified into two groups i) Internal data and ii) External data Internal data are obtained from internal records related to operations of business organisation such as production, source of income and expenditure, inventory, purchases and accounts. The external data are collected and purchased by external agencies. The external data could be either primary data or secondary data. The primary data are collected for first time and original, while secondary data are collected by published by some agencies. b) Organisations of data The collected data is a large mass of figures that needs to be organised. The collected data must be edited to rectify for any omissions, irrelevant answers, and wrong computations. The edited data must be classified and tabulated to suit further analysis. c) Presentation of data
2
The large data that are collected cannot be understand and analysis easily and quickly. Therefore, collected data needs to be presented in tabular or graphic form. This systematic order and graphical presentation helps for further analysis. d) Analysis of data The analysis requires establishing the relationship between one or more variables. Analysis of data includes condensation, abstracting, summarization, conclusion etc. With the help of statistical tools and techniques like measures of dispersion central tendency, correlation, variance analysis etc analysis can be done. e) Interpretation of data The interpretation requires deep insight of the subject. Interpretation involves drawing the valid conclusions on the bases of the analysis of data. This work requires good experience and skill. This process is very important as conclusions of results is done based on interpretation. We can define statistics as per Seligman as follows. “Statistics is a science which deals with the method and of collecting, classifying, presenting, comparing and interpretating the numerical data collected to throw light on enquiry”.
Importance of statistics In today’s context statistics is indispensable. As the use of statistics is extended to various field of experiments to draw valid conclusions, it is found increased importance and usage. The number of research investigations in the field of economics and commerce are largely statistical. Further, the importance and statistics in various fields are listed as below. a) State Affairs: In state affairs, statistics is useful in following ways 1. To collect the information and study the economic condition of people in the states. 2. To asses the resources available in states. 3. To help state to take decision on accepting or rejecting its policy based on statistics. 4. To provide information and analysis on various factors of state like wealth, crimes, agriculture experts, education etc. b) Economics: In economics, statistics is useful in following ways 1. Helps in formulation of economic laws and policies 2. Helps in studying economic problems 3. Helps in compiling the national income accounts. 4. Helps in economic planning. c) Business 1. Helps to take decisions on location and size 2. Helps to study demand and supply
3
3. Helps in forecasting and planning 4. Helps controlling the quality of the product or process 5. Helps in making marketing decisions 6. Helps for production, planning and inventory management. 7. Helps in business risk analysis 8. Helps in resource long term requirements, in estimating consumers preference and helps in business research. d) Education: Statistics is necessary to formulate the polices regarding start of new courses, consideration of facilities available for proposed courses. e) Accounts and Audits: 1. Helps to study the correlation between profits and dividends enable to know trend of future profits. 2. In auditing sampling techniques are followed.
Functions of statistics Some important functions of statistics are as follows 1. To collect and present facts in a systematic manner. 2. Helps in formulation and testing of hypothesis. 3. Helps in facilitating the comparison of data. 4. Helps in predicting future trends. 5. Helps to find the relationship between variable. 6. Simplifies the mass of complex data. 7. Help to formulate polices. 8. Helps Government to take decisions.
Limitations of statistics 1. Does not study qualitative phenomenon. 2. Does not deal with individual items. 3. Statistical results are true only on an average. 4. Statistical data should be uniform and homogeneous. 5. Statistical results depends on the accuracy of data. 6. Statistical conclusions are not universally true. 7. Statistical results can be interpreted only if person has sound knowledge of statistics.
Distrust of Statistics 4
Distrust of statistics are due to lack of knowledge and limitations of its uses, but not due to statistical sciences. Distrust of statistics are due to following reasons. a) Figures are manipulated or incompleted. b) Quoting figures without their context. c) Inconsistent definitions. d) Selection of non-representative statistical units. e) Inappropriate comparison f)
Wrong inference drawn.
g) Errors in data collection.
Statistical Data Statistical investigation is a long and comprehensive process and requires systematic collection of data in large size. The validity and accuracy of the conclusion or results of the study depends upon how well the data were gathered. The quality of data will greatly influence the conclusions of the study and hence importance is to be given to the data collection process. Statistical data may be classified as Primary Data and Secondary Data based on the sources of data collection.
♦ Primary data Primary data are those which are collected for the first time by the investigator / researchers and are thus original in character. Thus, data collected by investigator may be for the specific purpose / study at hand. Primary data are usually in the shape of raw materials to which statistical methods are applied for the purpose of analysis and interpretation.
♦ Secondary data Secondary have been already collected for the purpose other than the problem at hand. These data are those which have already been collected by some other persons and which have passed through the statistical analysis at least once. Secondary data are usually in the shape of finished products since they have been already treated statistically in one or the other form. After statistical treatment the primary data lose their original shape and becomes secondary data. Secondary data of one organisation become the primary data of other organisation who first collect and publish them.
Primary Vs Secondary Data
5
Primary data are originated by researcher for specific purpose / study at hand while secondary data have already been collected for purpose other than research work at hand. Primary data collection requires considerably more time, relatively expensive. While the secondary data are easily accessible, inexpensive and quickly obtained. Table – A compression of Primary and Secondary Data Primary data Secondary data Collection purpose
For the problem at hand
For other problems
Collection process
Very involved
Rapid and easy
Collection cost
High
Relatively low
Collection time
Long
Short
Suitability
Its suitability is positive
It may or may not suit the object of survey
Originality
It is original
It is not original
Precautions
No extra precautions required to use the data
It should be used with extra case
Limitations of secondary data a) Since secondary data is collected for ‘some other purpose, its usefulness to current problem may be limited in several important ways, including relevancies and accuracy. b) The objectives, nature and methods used to collect secondary data may not be appropriate to present situation. c) The secondary data may not be accurate, or they may not be completely current or dependable.
Criteria for evaluating secondary data Before using the secondary data it is important to evaluate them on following factors a) Specification and methodology used to collect the data b) Error and accuracy of data of the data c) The currency d) The objective – The purpose for which data were collected e) The nature – content of data f) The dependability
Sources of data 6
Primary source – The methods of collecting primary data. When data is neither internally available nor exists as a secondary source, then the primary sources of data would be approximate. The various method of collection of primary data are as follows a) Direct personal investigation -
Interview
-
Observation
b) Indirect or oral investigation c) Information from local agents and correspondents d) Mailded questionnaires and schedules e) Through enumerations
Secondary source – The methods of collecting secondary data i)
Published Statistics a) Official publications of Central Government Ex: Central Statistical Organisation (CSO) – Ministry of planning
ii)
-
National Sample Survey Organisation (NSSO)
-
Office of the Registrar General and Census Committee – GOI
-
Director of Statistics and Economics – Ministry of Agriculture
-
Labour Bureau – Ministry of Labour etc.
Publications of Semi-government organisation Ex:
iii)
-
The institute of foreign trade, New Delhi
-
The institute of economic growth, New Delhi.
Publication of research institutes Ex:
iv)
-
Indian Statistical Institute
-
Indian Agriculture Statistical Institute
-
NCRET Publications
-
Indian Standards Institute etc.
Publication of Business and Financial Institutions Ex:
v)
-
Trade Association Publications like Sugar factory, Textile mill, Indian chamber of Industry and Commerce.
-
Stock exchange reports, Co-operative society reports etc.
News papers and periodicals 7
Ex: vi)
The Financial Express, Eastern Economics, Economic Times, Indian Finance, etc.
Reports of various committees and commissions Ex:
vii)
-
Kothari commission report on education
-
Pay commission reports
-
Land perform committee reports etc.
Unpublished statistics -
Internal and administrative data like Periodical Loss, Profit, Sales, Production Rate, Balance Sheet, Labour Turnover, Budges, etc.
Classification and Tabulation The data collected for the purpose of a statistical inquiry some times consists of a few fairly simple figures which can be easily understood without any special treatment. But more often there is an overwhelming mass of raw data without any structure. Thus, unwidely, unorganised and shapeless mass of collected is not capable of being rapidly or easily associated or interpreted. Unorganised data are not fit for further analysis and interpretation. In order to make the data simple and easily understandable the first task is not condense and simplify them in such a way that irrelevant datas are removed and their significant features are stand out prominently. The procedure adopted for this purpose is known as method of classification and tabulation. Classification helps proper tabulation. “Classified and arranged facts speak themselves; unarranged, unorganised they are dead as mutton”. - Prof. J.R. Hicks
♦ Meaning of Classification Classification is a process of arranging things or data in groups or classes according to their resemblances and affinities and gives expressions to the unity of attributes that may subsit among a diversity of individuals.
♦ Definition of Classification Classification is the process of arranging data into sequences and groups according to their common characteristics or separating them into different but related parts. - Secrist The process of grouping large number of individual facts and observations on the basis of similarity among the items, is called classification. - Stockton & Clark
Characteristics of classification 8
a) Classification performs homogeneous grouping of data b) It brings out points of similarity and dissimilating c) The classification may be either real or imaginary d) Classification is flexible to accommodate adjustments
Objectives / purposes of classifications i) To simplify and condense the large data ii) To present the facts to easily in understandable form iii) To allow comparisons iv) To help to draw valid inferences v) To relate the variables among the data vi) To help further analysis vii) To eliminate unwanted data viii)To prepare tabulation
Guiding principles (rules) of classifications Following are the general guiding principles for good classifications a) Exhaustive: Classification should be exhaustive. Each and every item in data must belong to one of class. Introduction of residual class (i.e. either, miscellaneous etc.) should be avoided. b) Mutually exclusive: Each item should be placed at only one class c) Suitability: The classification should confirm to object of inquiry. d) Stability: Only one principle must be maintained throughout the classification and analysis. e) Homogeneity: The items included in each class must be homogeneous. f) Flexibility: A good classification should be flexible enough to accommodate new situation or changed situations.
Modes / Types of Classification Modes / Types of classification refers to the class categories into which the data could be sorted out and tabulated. These category depends on the nature of data and purpose for which data is being sought.
Important types of classification a) Geographical (i.e. on the basis of area or region wise) b) Chronological (On the basis of Temporal / Historical, i.e. with respect to time) c) Qualitative (on the basis of character / attributes) d) Numerical, quantitative (on the basis of magnitude) 9
a) Geographical Classification In geographical classification, the classification is based on the geographical regions. Ex:
Sales of the company (In Million Rupees) (region – wise) Region Sales North
285
South
300
East
185
West
235
b) Chronological Classification If the statistical data are classified according to the time of its occurrence, the type of classification is called chronological classification. Sales reported by a departmental store Sales Month (Rs.) in lakhs January
22
February
26
March
32
April
25
May
27
June
29
July
30
August
30
c) Qualitative Classification In qualitative classifications, the data are classified according to the presence or absence of attributes in given units. Thus, the classification is based on some quality characteristics / attributes. Ex: Sex, Literacy, Education, Class grade etc. Further, it may be classified as a) Simple classification
b) Manifold classification
i) Simple classification: If the classification is done into only two classes then classification is known as simple classification.
10
Ex:
a) Population in to Male / Female b) Population into Educated / Uneducated
ii) Manifold classification: In this classification, the classification is based on more than one attribute at a time. Ex: Population
Smokers
Literate
Non-smokers
Illiterate
Male
Male
Illiterate
Literate
Female
Female
Male
Male
Female
Female
d) Quantitative Classification: In Quantitative classification, the classification is based on quantitative measurements of some characteristics, such as age, marks, income, production, sales etc. The quantitative phenomenon under study is known as variable and hence this classification is also called as classification by variable. Ex: For a 50 marks test, Marks obtained by students as classified as follows Marks
No. of students
0 – 10
5
10 – 20
7
20 – 30
10
30 – 40
25
40 – 50
3
Total Students = 50 In this classification marks obtained by students is variable and number of students in each class represents the frequency.
Meaning and Definition of Tabulation
11
Tabulation may be defined as systematic arrangement of data is column and rows. It is designed to simplify presentation of data for the purpose of analysis and statistical inferences.
Major Objectives of Tabulation 1. To simplify the complex data 2. To facilitate comparison 3. To economise the space 4. To draw valid inference / conclusions 5. To help for further analysis
Differences between Classification and Tabulation 1. First data are classified and presented in tables; classification is the basis for tabulation. 2. Tabulation is a mechanical function of classification because is tabulation classified data are placed in row and columns. 3. Classification is a process of statistical analysis while tabulation is a process of presenting data is suitable structure.
Classification of tables Classification is done based on 1. Coverage (Simple and complex table) 2. Objective / purpose (General purpose / Reference table / Special table or summary table) 3. Nature of inquiry (primary and divided table). Ex: a) Simple table: Data are classified based on only one characteristic Distribution of marks Class Marks
No. of students
30 – 40
20
40 – 50
20
50 – 60
10
Total
50
b) Two-way table: Classification is based on two characteristics 12
No. of students
Class Marks
Boys
Girls
Total
30 – 40
10
10
20
40 – 50
15
5
20
50 – 60
3
7
10
28
22
50
Total
Frequency Distribution Frequency distribution is a table used to organize the data. The left column (called classes or groups) includes numerical intervals on a variable under study. The right column contains the list of frequencies, or number of occurrences of each class/group. Intervals are normally of equal size covering the sample observations range. It is simply a table in which the gathered data are grouped into classes and the number of occurrences which fall in each class is recorded.
♦ Definition A frequency distribution is a statistical table which shows the set of all distinct values of the variable arranged in order of magnitude, either individually or in groups with their corresponding frequencies. - Croxton and Cowden A frequency distribution can be classified as a) Series of individual observation b) Discrete frequency distribution c) Continuous frequency distribution a) Series of individual observation Series of individual observation is a series where the items are listed one after the each observations. For statistical calculations, these observation could be arranged is either ascending or descending order. This is called as array. Ex: Roll No.
Marks obtained in statistics paper
1
83
2
80
13
3
75
4
92
5
65
The above data list is a raw data. The presentation of data in above form doesn’t reveal any information. If the data is arranged in ascending / descending in the order of their magnitude, which gives better presentation then, it is called arraying of data.
Discrete (ungrouped) Frequency Distribution If the data series are presented in such away that indicating its exact measurement of units, then it is called as discrete frequency distribution. Discrete variable is one where the variates differ from each other by definite amounts. Ex: Assume that a survey has been made to know number of post-graduates in 10 families at random, the resulted raw data could be as follows. 0, 1, 3, 1, 0, 2, 2, 2, 2, 4 This data can be classified into an ungrouped frequency distribution. The number of post-graduates becomes variable (x) for which we can list the frequency of occurrence (f) in a tabular from as follows; Number of post graduates (x)
Frequency (f)
0
2
1
2
2
4
3
1
4
1
The above example shows a discrete frequency distribution, where the variables has discrete numerical values.
Continuous frequency distribution (grouped frequency distribution) Continuous data series is one where the measurements are only approximations and are expressed in class intervals within certain limits. In continuous frequency distribution the class interval theoretically continuous from the starting of the frequency distribution till the end without break. According to Boddington ‘the variable which can take very intermediate value between the smallest and largest value in the distribution is a continuous frequency distribution. Ex:
14
Marks obtained by 20 students in students exam for 50 marks are as given below – convert the data into continuous frequency distribution form. 18
23
28
29
44
28
48
33
32
43
24
29
32
39
49
42
27
33
28
29
By grouping the marks into class interval of 10 following frequency distribution table can be formed. Marks
No. of students
0-5
0
5 – 10
0
10 – 15
0
15 – 20
1
20 – 25
2
25 – 30
7
30 – 35
4
35 – 40
1
40 – 45
3
45 – 50
2
Technical terms used in formulation frequency distribution a) Class limits: The class limits are the smallest and largest values in the class. Ex: 0 – 10, in this class, the lowest value is zero and highest value is 10. the two boundaries of the class are called upper and lower limits of the class. Class limit is also called as class boundaries. b) Class intervals The difference between upper and lower limit of class is known as class interval. Ex: In the class 0 – 10, the class interval is (10 – 0) = 10. The formula to find class interval is gives on below
15
i=
L−S R
L = Largest value S = Smallest value R = the no. or classes Ex: If the marks of 60 students in a class varies between 40 and 100 and if we want to form 6 classes, the class interval would be i=
L−S R
100 − 40 6
=
=
60 6
= 10
L = 100 S = 40 K=6
Therefore, class intervals would be 40 – 50, 50 – 60, 60 – 70, 70 – 80, 80 – 90 and 90 – 100.
♦ Methods of forming class-interval a) Exclusive method (overlapping) In this method, the upper limits of one class-interval is the lower limit of next class. This methods makes continuity of data. Ex: Marks
No. of students
20 – 30
5
30 – 40
15
40 – 50
25
A student whose mark is between 20 to 29.9 will be included in the 20 – 30 class. Better way of expressing is Marks
No. of students
20 to les than 30
5
(More than 20 but les than 30) 30 to les than 40
15
40 to les than 50
25
Total Students
50
16
b) Inclusive method (non-overlaping) Ex: Marks
No. of students
20 – 29
5
30 – 39
15
40 – 49
25
A student whose mark is 29 is included in 20 – 29 class interval and a student whose mark in 39 is included in 30 – 39 class interval.
♦ Class Frequency The number of observations falling within class-interval is called its class frequency. Ex: The class frequency 90 – 100 is 5, represents that there are 5 students scored between 90 and 100. If we add all the frequencies of individual classes, the total frequency represents total number of items studied.
♦ Magnitude of class interval The magnitude of class interval depends on range and number of classes. The range is the difference between the highest and smallest values is the data series. A class interval is generally in the multiples of 5, 10, 15 and 20. Sturges formula to find number of classes is given below K = 1 + 3.322 log N. K = No. of class log N = Logarithm of total no. of observations Ex: If total number of observations are 100, then number of classes could be K = 1 + 3.322 log 100 K = 1 + 3.322 x 2 K = 1 + 6.644 K = 7.644 = 8 (Rounded off) NOTE: Under this formula number of class can’t be less than 4 and not greater than 20.
♦ Class mid point or class marks The mid value or central value of the class interval is called mid point.
17
Mid point of a class =
(lower limit of class + upper limit of class) 2
♦ Sturges formula to find size of class interval Size of class interval (h) =
Range 1 + 3.322 log N
Ex: In a 5 group of worker, highest wage is Rs. 250 and lowest wage is 100 per day. Find the size of interval. h=
250 − 100 Range = = 55.57 ≅ 56 1 + 3.322 log N 1 + 3.322 log 50
Constructing a frequency distribution The following guidelines may be considered for the construction of frequency distribution. a) The classes should be clearly defined and each observations must belong to one and to only one class interval. Interval classes must be inclusive and nonoverlapping. b) The number of classes should be neither too large nor too small. Too small classes result greater interval width with loss of accuracy. Too many class interval result is complexity. c) All interval should be of the same width. computations. The width of interval =
This is preferred for easy
Range Number of classes
d) Open end classes should be avoided since creates difficulty in analysis and interpretation. e) Intervals would be continuous throughout the distribution. This is important for continuous distribution. f) The lower limits of the class intervals should be simple multiples of the interval. Ex: A simple of 30 persons weight of a particular class students are as follows. Construct a frequency distribution for the given data. 62
58
58
52
48
53
54
63
69
63
57
56
46
48
53
56
57
59
58
53
52
56
57
52
52
53
54
58
61
63
♦ Steps of construction Step 1
18
Find the range of data
(H) Highest value = 70 (L) Lowest value = 46
Range = H – L = 69 – 46 = 23 Step 2 Find the number of class intervals. Sturges formula K = 1 + 3.322 log N. K = 1 + 3.222 log 30 K = 5.90 Say K = 6 ∴ No. of classes = 6 Step 3 Width of class interval Width of class interval =
Range 23 = 3.883 ≅ 4 = Number of classes 6
Step 4 Conclusions all frequencies belong to each class interval and assign this total frequency to corresponding class intervals as follows. Class interval
Tally bars
Frequency
46 – 50
|||
3
50 – 54
|||| |||
8
54 – 58
|||| |||
8
58 – 62
|||| |
6
62 – 66
||||
4
66 – 70
|
1
Cumulative frequency distribution Cumulative frequency distribution indicating directly the number of units that lie above or below the specified values of the class intervals. When the interest of the investigator is on number of cases below the specified value, then the specified value represents the upper limit of the class interval. It is known as ‘less than’ cumulative frequency distribution. When the interest is lies in finding the number of cases above specified value then this value is taken as lower limit of the specified class interval. Then, it is known as ‘more than’ cumulative frequency distribution. The cumulative frequency simply means that summing up the consecutive frequency.
19
Ex: Marks
No. of students
‘Less than’ cumulative frequency
0 – 10
5
5
10 – 20
3
8
20 – 30
10
18
30 – 40
20
38
40 – 50
12
50
In the above ‘less than’ cumulative frequency distribution, there are 5 students less than 10, 3 less than 20 and 10 less than 30 and so on. Similarly, following table shows ‘greater than’ cumulative frequency distribution. Ex: Marks
No. of students
‘Less than’ cumulative frequency
0 – 10
5
50
10 – 20
3
45
20 – 30
10
42
30 – 40
20
32
40 – 50
12
12
In the above ‘greater than’ cumulative frequency distribution, 50 students are scored more than 0, 45 more than 10, 42 more than 20 and so on.
Diagrammatic and Graphic Representation The data collected can be presented graphically or pictorially to be easy understanding and for quick interpretation. Diagrams and graphs gives visual indications of magnitudes, groupings, trends and patterns in the data. There parameter can be more simply presented in the graphical manner. The diagrams and graphs helps for comparison of the variables.
Diagrammatic presentation
20
A diagram is a visual form for presentation of statistical data. The diagram refers various types of devices such as bars, circles, maps, pictorials and cartograms etc.
Importance of Diagrams 1. They are simple, attractive and easy understandable 2. They give quick information 3. It helps to compare the variables 4. Diagrams are more suitable to illustrate discrete data 5. It will have more stable effect in the reader’s mind. Limitations of diagrams 1. Diagrams shows approximate value 2. Diagrams are not suitable for further analysis 3. Some diagrams are limited to experts (multidimensional) 4. Details cannot be provided fully 5. It is useful only for comparison
General Rules for drawing the diagrams i) Each diagram should have suitable title indicating the theme with which diagram is intended at the top or bottom. ii) The size of diagram should emphasize the important characteristics of data. iii) Approximate proposition should be maintained for length and breadth of diagram. iv) A proper / suitable scale to be apoted for diagram v) Selection of approximate diagram is important and wrong selection may mislead the reader. vi) Source of data should be mentioned at bottom. vii) Diagram should be simple and attractive viii)Diagram should be effective than complex.
Some important types of diagrams a) One dimensional diagrams (line and bar) b) Two-dimensional diagram (rectangle, square, circle) c) Three dimensional diagram (cube, sphere, cylinder etc.) d) pictogram
21
e) Cartogram a) One dimensional diagrams (line and bar) In one dimensional diagrams, the length of the bars or lines are taken into account. Width of the bars are not considered. Bar diagrams are classified mainly as follows. i) Line diagram ii) Bar diagram -
Vertical bar diagram
-
Horizontal bar diagram
-
Multiple (compound) bar diagram
-
Sub-divided (component) bar diagram
-
Percentage subdivided bar diagram
i) Line diagram This is simplest type of one dimensional diagram. On the basis of size of the figures, heights of the bar / lines are drawn. The distance between bars are kept uniform. The limitation of this diagram are it is not attractive cannot provide more than one information. Ex: Draw the line diagram for the following data Year No. of students passed in first class with distinction
2001
2002
2003
2004
2005
2006
5
7
12
5
13
15
No. of students passed in FCD
16
(15)
14
(13) (12)
12 10 8 6 4
(7) (5)
2001
(5) 2002
2003
2004
Year
22
2005
2006
Indication of diagram: Highest FCD is at 2006 and lowest FCD are at 2001 and 2004. b) Simple bars diagram A simple bar diagram can be drawn using horizontal or vertical bar. In business and economics, it is very a common diagram. Vertical bar diagram The annual expresses of maintaining the car of various types are given below. Draw the vertical bar diagram. The annual expenses of maintaining includes (fuel + maintenance + repair + assistance + insurance). Type of the car
Expense in Rs. / Year
Maruthi Udyog
47533
Hyundai
59230
Tata Motors
63270 Source: 2005 TNS TCS Study Published at: Vijaya Karnataka, dated: 03.08.2006
70000 63270
65000
59230
60000 55000 47533
50000 45000 40000 35000 30000
Maruthi Udyog
Hyundai
Tata Motors Source: 2005 TNS TCS Study
Published at: Vijaya Karnataka, dated: 03.08.2006 Indicating of diagram a) Annual expenses of Maruthi Udyog brand car is comparatively less with other brands depicted
23
b) High annual expenses of Tata motors brand can be seen from diagram. ♦ Horizontal bar diagram World biggest top 10 steel makers are data are given below. Draw horizontal bar diagram. Steel maker
Arcelo r Mittal
Nippo n
Prodn. in million tonnes
110
32
POSCO JFE
31
US Stee l
NUCOR
24
20
18
30
Tangshan
16
Thyssen-krupp
17
RIVA
18
NUCOR
18
Top - 10 Steel Makers
BAO Steel
RIVA
Thyssen -krupp
Tangshan
18
17
16
20
US Steel
24
BAO Steel JFE
30
POSCO
31
Nippon
32 110
Arcelor Mittal
0
20
40
60
80
100
120
Production of Steel (Million Tonnes)
Source: ISSB Published by India Today
♦ Compound bar diagram (Multiple bar diagram) Multiple bar diagrams are used to provide more information than simple bar diagram. Multiple bar diagram provides more than one phenomenon and highly useful for direct comparison. The bars are drawn side by side and different columns, shades hatches can be used for indicating each variables used. Ex: Draw the bar diagram for the following data. Resale value of the cars (Rs. 000) are as follows. Year (Model)
Santro
Zen
Wagonr
2003
208
252
248
2004
240
278
274
24
2005
261
296
302
350
Value in Rs.
300 250
261 240 208
296 278 252
302 274 248
2 Model of Car
3
200 150 100 50 0 1
Santro
Zen
Wagnor
Source: True value used car purchase data Published by: Vijaya Karnataka, dated: 03.08.2006 Ex: Represent following in suitable diagram Class
A
B
C
Male
1000
1500
1500
Female
500
800
1000
1500
2300
2500
Total
25
Population (in Nos.)
2500 2000 1500
1500
0
800
1000
1500
1500
2
3
500
1000 500
2500
2300
1000
1
Class Male
Female
Ex: Draw the suitable diagram for following data Investment in 2004 in Rs.
Mode of investment
Investment in 2005 in Rs.
Investment
%age
Investment
%age
NSC
25000
43.10
30000
45.45
MIS
15000
25.86
10000
15.15
Mutual Fund
15000
25.86
25000
37.87
LIC
3000
5.17
1000
1.52
58000
100
66000
100
Total
110 100
5.17
1.52
25.86
37.87
25.86
15.15
43.10
45.45
% of Investment
90 80 70 60 50 40 30 20 10 0
2004
2005
Year
26
Two-dimensional diagram In two-dimensional diagram both breadth and length of the diagram (i.e. area of the diagram) are considered as area of diagram represents the data. The important two dimensional diagrams are a) Rectangular diagram b) Square diagram a) Rectangular diagram Rectangular diagrams are used to depict two or more variables. This diagram helps for direct comparison. The area of rectangular are kept in proportion to the values. It may be of two types. i)
Percentage sub-divided rectangular diagram
ii)
Sub-divided rectangular diagram
In former care width of the rectangular are proportional to the values, the various components of the values are converted into percentages and rectangles are divided according to them. While later case is used to show some related phenomenon like cost per unit, quality of production etc. Ex: Draw the rectangle diagram for following data Expenditure in Rs.
Item Expenditure
Family A
Family B
Provisional stores
1000
2000
Education
250
500
Electricity
300
700
House Rent
1500
2800
Vehicle Fuel
500
1000
3500
7000
Total
Total expenditure will be taken as 100 and the expenditure on individual items are expressed in percentage. The width of two rectangles are in proportion to the total expenses of the two families i.e. 3500 : 7000 or 1 : 2. The height of rectangles are according to percentage of expenses. Monthly expenditure Item Expenditure
Family A (Rs. 3500)
Family B(Rs. 7000)
Rs.
%age
Rs.
%age
Provisional stores
1000
28.57
2000
28.57
Education
250
7.14
500
7.14
Electricity
300
8.57
700
10
27
House Rent
1500
42.85
2800
40
Vehicle Fuel
500
12.85
1000
14.28
Total
3500
100
7000
100
Provisonal Stores Electricity House Rent
Education Vehicle Fuel
% of Expenditure
100
80
60
40
20
0
A
B
Family
b) Square diagram To draw square diagrams, the square root is taken of the values of the various items to be shown. A suitable scale may be used to depict the diagram. Ratios are to be maintained to draw squares. Ex: Draw the square diagram for following data 4900
2500
1600
Solution: Square root for each item in found out as 70, 50 and 40 and is divided by 10; thus we get 7, 5 and 4.
28
6000
4900
5000
4000
3000
2000
1000
0
2500 1600 4 1
5
7
2
3
29
Pie diagram Pie diagram helps us to show the portioning of a total into its component parts. It is used to show classes or groups of data in proportion to whole data set. The entire pie represents all the data, while each slice represents a different class or group within the whole. Following illustration shows construction of pie diagram.
Draw the pie diagram for following data Revenue collections for the year 2005-2006 by government in Rs. (crore)s for petroleum products are as follows. Draw the pie diagram. Customs
9600
Excise
49300
Corporate Tax and dividend
18900
States taking
48800
Total
126600
Solution: Item / Source
Value in crores
Angle of circle
%ge
9600
9600 x 360 = 27.30 o 126600
7.58
Excise
49300
49300 x 360 = 140.20 o 126600
39.00
Corporate Tax and Dividend
18900
18900 x 360 = 53.70 o 126600
14.92
State’s taking
48800
48800 x 360 = 138.80 o 126600
38.50
126600
360o
Customs
Total
30
100
7.58
Customs
38.5
Excise
39
Corporate Tax and Dividend State’s taking
14.92 Source: India Today 19 June, 2006
Choice or selection of diagram There are many methods to depict statistical data through diagram. No angle diagram is suited for all purposes. The choice / selection of diagram to suit given set of data requires skill, knowledge and experience. Primarily, the choice depends upon the nature of data and purpose of presentation, to whom it is meant. The nature of data will help in taking a decision as to one-dimensional or two-dimensional or threedimensional diagram. It is also required to know the audience for whom the diagram is depicted. The following points are to be kept in mind for the choice of diagram. 1. To common man, who has less knowledge in statistics cartogram and pictograms are suited. 2. To present the components apart from magnitude of values, sub-divided bar diagram can be used. 3. When a large number of components are to be shows, pie diagram is suitable.
Graphic presentation A graphic presentation a visual form of presentation graphs are drawn on a special type of paper known are graph paper. Common graphic representations are a) Histogram b) Frequency polygon c) Cumulative frequency curve (ogive)
Advantages of graphic presentation 1. It provides attractive and impressive view 31
2. Simplifies complexity of data 3. Helps for direct comparison 4. It helps for further statistical analysis 5. It is simplest method of presentation of data 6. It shows trend and pattern of data Difference between graph and diagram Diagram
Graph
1. Ordinary paper can be used 2. It is attractive understandable
1. Graph paper is required
and
easily 2. Needs some effect to understand
3. It is appropriate and effective to 3. It creates problem measure more variable 4. It can’t be used for further analysis
4. Can be used for further analysis
5. It gives comparison
5. It shows variables
6. Data are rectangles
represented
by
relationship
between
bars, 6. Points and lines are used to represent data
Frequency Histogram In this type of representation the given data are plotted in the form of series of rectangles. Class intervals are marked along the x-axis and the frequencies are along the y-axis according to suitable scale. Unlike the bar chart, which is one-dimensional, a histogram is two-dimensional in which the length and width are both important. A histogram is constructed from a frequency distribution of grouped data, where the height of rectangle is proportional to respective frequency and width represents the class interval. Each rectangle is joined with other and the blank space between the rectangles would mean that the category is empty and there is no values in that class interval. Ex: Construct a histogram for following data. Marks obtained (x) No. of students (f)
Mid point
15 – 25
5
20
25 – 35
3
30
35 – 45
7
40
45 – 55
5
50
55 – 65
3
60
65 – 75
7
70
Total
30
For convenience sake, we will present the frequency distribution along with mid-point of each class interval, where the mid-point is simply the average of value of lower and upper boundary of each class interval.
32
Frequency (No. of students)
7 6 5 4 3 2 1 0 15
25
45
35
55
65
75
Class Interval (Marks)
Frequency polygon A frequency polygon is a line chart of frequency distribution in which either the values of discrete variables or the mid-point of class intervals are plotted against the frequency and those plotted points are joined together by straight lines. Since, the frequencies do not start at zero or end at zero, this diagram as such would not touch horizontal axis. However, since the area under entire curve is the same as that of a histogram which is 100%. The curve must be ‘enclosed’, so that starting mid-point is jointed with ‘fictitious’ preceding mid-point whose value is zero. So that the beginning of curve touches the horizontal axis and the last mid-point is joined with a ‘fictitious’ succeeding mid-point, whose value is also zero, so that the curve will end at horizontal axis. This enclosed diagram is known as ‘frequency polygon’. Ex: For following data construct frequency polygon. Marks (CI) No. of frequencies (f)
Mid-point
15 – 25
5
20
25 – 35
3
30
35 – 45
7
40
45 – 55
5
50
55 – 65
3
60
65 – 75
7
70
33
10
Frequency
8
A Frequency polygon
6
4
2
0 0
10
20
30
40
50
60
70
80
90
100
Mid point (x)
Cumulative frequency curve (ogive) ogives are the graphic representations of a cumulative frequency distribution. These ogives are classified as ‘less than’ and ‘more than ogives’. In case of ‘less than’, cumulative frequencies are plotted against upper boundaries of their respective class intervals. In case of ‘grater than’ cumulative frequencies are plotted against upper boundaries of their respective class intervals. These ogives are used for comparison purposes. Several ogves can be compared on same grid with different colour for easier visualisation and differentiation. Ex: Marks (CI)
No. of frequencies (f)
Mid-point
Cum. Freq. Less than
Cum. Freq. More than
15 – 25
5
20
5
30
25 – 35
3
30
8
25
35 – 45
7
40
15
22
45 – 55
5
50
20
15
55 – 65
3
60
23
10
65 – 75
7
70
30
7
34
Less than Cumulative Frequency
Less than give diagram
30
'Less than' ogive 25
20
15
10
5 20
30
40
50
60
70
Upper Boundary (CI)
Less than give diagram
35
'More than' ogive
More than Ogive
30
25
20
15
10
10
20
30
40
50
Lower Boundary (CI)
35
60
70
LESSON – 1 STATISTICS FOR MANAGEMENT
Session – 2
Duration: 1 hr
Classification and Tabulation The data collected for the purpose of a statistical inquiry some times consists of a few fairly simple figures, which can be easily understood without any special treatment. But more often there is an overwhelming mass of raw data without any structure. Thus, unwieldy, unorganised and shapeless mass of collected is not capable of being rapidly or easily associated or interpreted. Unorganised data are not fit for further analysis and interpretation. In order to make the data simple and easily understandable the first task is not condense and simplify them in such a way that irrelevant data are removed and their significant features are stand out prominently. The procedure adopted for this purpose is known as method of classification and tabulation. Classification helps proper tabulation. “Classified and arranged facts speak themselves; unarranged, unorganised they are dead as mutton”. - Prof. J.R. Hicks
♦ Meaning of Classification Classification is a process of arranging things or data in groups or classes according to their resemblances and affinities and gives expressions to the unity of attributes that may subsit among a diversity of individuals.
♦ Definition of Classification Classification is the process of arranging data into sequences and groups according to their common characteristics or separating them into different but related parts. - Secrist The process of grouping large number of individual facts and observations on the basis of similarity among the items is called classification. - Stockton & Clark
Characteristics of classification e) Classification performs homogeneous grouping of data f) It brings out points of similarity and dissimilarities. g) The classification may be either real or imaginary h) Classification is flexible to accommodate adjustments
Objectives / purposes of classifications
36
ix) To simplify and condense the large data x) To present the facts to easily in understandable form xi) To allow comparisons xii) To help to draw valid inferences xiii)To relate the variables among the data xiv)To help further analysis xv) To eliminate unwanted data xvi)To prepare tabulation
Guiding principles (rules) of classifications Following are the general guiding principles for good classifications g) Exhaustive: Classification should be exhaustive. Each and every item in data must belong to one of class. Introduction of residual class (i.e. either, miscellaneous etc.) should be avoided. h) Mutually exclusive: Each item should be placed at only one class i) Suitability: The classification should confirm to object of inquiry. j) Stability: Only one principle must be maintained throughout the classification and analysis. k) Homogeneity: The items included in each class must be homogeneous. l) Flexibility: A good classification should be flexible enough to accommodate new situation or changed situations.
Modes / Types of Classification Modes / Types of classification refers to the class categories into which the data could be sorted out and tabulated. These categories depend on the nature of data and purpose for which data is being sought.
Important types of classification e) Geographical (i.e. on the basis of area or region wise) f) Chronological (On the basis of Temporal / Historical, i.e. with respect to time) g) Qualitative (on the basis of character / attributes) h) Numerical, quantitative (on the basis of magnitude) e) Geographical Classification In geographical classification, the classification is based on the geographical regions. Ex:
Sales of the company (In Million Rupees) (region – wise) 37
Region
Sales
North
285
South
300
East
185
West
235
f) Chronological Classification If the statistical data are classified according to the time of its occurrence, the type of classification is called chronological classification. Sales reported by a departmental store Sales Month (Rs.) in lakhs January
22
February
26
March
32
April
25
May
27
June
30
g) Qualitative Classification In qualitative classifications, the data are classified according to the presence or absence of attributes in given units. Thus, the classification is based on some quality characteristics / attributes. Ex: Sex, Literacy, Education, Class grade etc. Further, it may be classified as a) Simple classification
b) Manifold classification
iii) Simple classification: If the classification is done into only two classes then classification is known as simple classification. Ex:
a) Population in to Male / Female b) Population into Educated / Uneducated
iv) Manifold classification: In this classification, the classification is based on more than one attribute at a time. Ex:
38
Population
Smokers
Literate
Non-smokers
Illiterate
Male
Male
Illiterate
Literate
Female
Female
Male
Male
Female
Female
h) Quantitative Classification: In Quantitative classification, the classification is based on quantitative measurements of some characteristics, such as age, marks, income, production, sales etc. The quantitative phenomenon under study is known as variable and hence this classification is also called as classification by variable. Ex: For a 50 marks test, Marks obtained by students as classified as follows Marks
No. of students
0 – 10
5
10 – 20
7
20 – 30
10
30 – 40
25
40 – 50
3
Total Students = 50 In this classification marks obtained by students is variable and number of students in each class represents the frequency.
Tabulation Meaning and Definition of Tabulation Tabulation may be defined, as systematic arrangement of data is column and rows. It is designed to simplify presentation of data for the purpose of analysis and statistical inferences.
Major Objectives of Tabulation
39
6. To simplify the complex data 7. To facilitate comparison 8. To economize the space 9. To draw valid inference / conclusions 10. To help for further analysis
Differences between Classification and Tabulation 4. First data are classified and presented in tables; classification is the basis for tabulation. 5. Tabulation is a mechanical function of classification because is tabulation classified data are placed in row and columns. 6. Classification is a process of statistical analysis while tabulation is a process of presenting data is suitable structure.
Classification of tables Classification is done based on 4. Coverage (Simple and complex table) 5. Objective / purpose (General purpose / Reference table / Special table or summary table) 6. Nature of inquiry (primary and derived table).
Ex: c) Simple table: Data are classified based on only one characteristic Distribution of marks Class Marks
No. of students
30 – 40
20
40 – 50
20
50 – 60
10
40
Total
50
d) Two-way table: Classification is based on two characteristics No. of students
Class Marks
Boys
Girls
Total
30 – 40
10
10
20
40 – 50
15
5
20
50 – 60
3
7
10
28
22
50
Total
Frequency Distribution Frequency distribution is a table used to organize the data. The left column (called classes or groups) includes numerical intervals on a variable under study. The right column contains the list of frequencies, or number of occurrences of each class/group. Intervals are normally of equal size covering the sample observations range. It is simply a table in which the gathered data are grouped into classes and the number of occurrences, which fall in each class, is recorded.
♦ Definition A frequency distribution is a statistical table which shows the set of all distinct values of the variable arranged in order of magnitude, either individually or in groups with their corresponding frequencies. - Croxton and Cowden A frequency distribution can be classified as d) Series of individual observation e) Discrete frequency distribution f) Continuous frequency distribution b) Series of individual observation Series of individual observation is a series where the items are listed one after the each observation. For statistical calculations, these observation could be arranged is either ascending or descending order. This is called as array. Ex: Roll No.
Marks obtained in statistics
41
paper 1
83
2
80
3
75
4
92
5
65
The above data list is a raw data. The presentation of data in above form doesn’t reveal any information. If the data is arranged in ascending / descending in the order of their magnitude, which gives better presentation then, it is called arraying of data.
Discrete (ungrouped) Frequency Distribution If the data series are presented in such away that indicating its exact measurement of units, then it is called as discrete frequency distribution. Discrete variable is one where the variants differ from each other by definite amounts. Ex: Assume that a survey has been made to know number of post-graduates in 10 families at random; the resulted raw data could be as follows. 0, 1, 3, 1, 0, 2, 2, 2, 2, 4
This data can be classified into an ungrouped frequency distribution. The number of post-graduates becomes variable (x) for which we can list the frequency of occurrence (f) in a tabular from as follows;
Number of post graduates (x)
Frequency (f)
0
2
1
2
2
4
3
1
42
4
1
The above example shows a discrete frequency distribution, where the variable has discrete numerical values.
Continuous frequency distribution (grouped frequency distribution) Continuous data series is one where the measurements are only approximations and are expressed in class intervals within certain limits. In continuous frequency distribution the class interval theoretically continuous from the starting of the frequency distribution till the end without break. According to Boddington ‘the variable which can take very intermediate value between the smallest and largest value in the distribution is a continuous frequency distribution. Ex: Marks obtained by 20 students in students’ exam for 50 marks are as given below – convert the data into continuous frequency distribution form. 18
23
28
29
44
28
48
33
32
43
24
29
32
39
49
42
27
33
28
29
By grouping the marks into class interval of 10 following frequency distribution tables can be formed.
Marks
No. of students
0-5
0
5 – 10
0
10 – 15
0
15 – 20
1
20 – 25
2
25 – 30
7
30 – 35
4
35 – 40
1
40 – 45
3
43
45 – 50
2
LESSON – 1 STATISTICS FOR MANAGEMENT
Session – 3
Duration: 1 hr
Technical terms used in formulation frequency distribution c) Class limits: The class limits are the smallest and largest values in the class. Ex: 0 – 10, in this class, the lowest value is zero and highest value is 10. the two boundaries of the class are called upper and lower limits of the class. Class limit is also called as class boundaries. d) Class intervals The difference between upper and lower limit of class is known as class interval. Ex: In the class 0 – 10, the class interval is (10 – 0) = 10. The formula to find class interval is gives on below i=
L−S R
L = Largest value S = Smallest value R = the no. of classes Ex: If the mark of 60 students in a class varies between 40 and 100 and if we want to form 6 classes, the class interval would be I= (L-S ) / K =
100 − 40 6
=
60 6
= 10
L = 100 S = 40 K=6
Therefore, class intervals would be 40 – 50, 50 – 60, 60 – 70, 70 – 80, 80 – 90 and 90 – 100.
♦ Methods of forming class-interval
44
c) Exclusive method (overlapping) In this method, the upper limits of one class-interval are the lower limit of next class. This method makes continuity of data. Ex: Marks
No. of students
20 – 30
5
30 – 40
15
40 – 50
25
A student whose mark is between 20 to 29.9 will be included in the 20 – 30 class. Better way of expressing is Marks
No. of students
20 to les than 30
5
(More than 20 but les than 30) 30 to les than 40
15
40 to les than 50
25
Total Students
50
d) Inclusive method (non-overlaping) Ex: Marks
No. of students
20 – 29
5
30 – 39
15
40 – 49
25
A student whose mark is 29 is included in 20 – 29 class interval and a student whose mark in 39 is included in 30 – 39 class interval.
♦ Class Frequency The number of observations falling within class-interval is called its class frequency.
45
Ex: The class frequency 90 – 100 is 5, represents that there are 5 students scored between 90 and 100. If we add all the frequencies of individual classes, the total frequency represents total number of items studied.
♦ Magnitude of class interval The magnitude of class interval depends on range and number of classes. The range is the difference between the highest and smallest values is the data series. A class interval is generally in the multiples of 5, 10, 15 and 20. Sturges formula to find number of classes is given below K = 1 + 3.322 log N. K = No. of class log N = Logarithm of total no. of observations Ex: If total number of observations are 100, then number of classes could be K = 1 + 3.322 log 100 K = 1 + 3.322 x 2 K = 1 + 6.644 K = 7.644 = 8 (Rounded off) NOTE: Under this formula number of class can’t be less than 4 and not greater than 20.
♦ Class mid point or class marks The mid value or central value of the class interval is called mid point. Mid point of a class =
(lower limit of class + upper limit of class) 2
♦ Sturges formula to find size of class interval Size of class interval (h) =
Range 1 + 3.322 log N
Ex: In a 5 group of worker, highest wage is Rs. 250 and lowest wage is 100 per day. Find the size of interval. h=
250 − 100 Range = = 55.57 ≅ 56 1 + 3.322 log N 1 + 3.322 log 50
Constructing a frequency distribution The following guidelines may be considered for the construction of frequency distribution.
46
g) The classes should be clearly defined and each observation must belong to one and to only one class interval. Interval classes must be inclusive and nonoverlapping. h) The number of classes should be neither too large nor too small. Too small classes result greater interval width with loss of accuracy. Too many class interval result is complexity. i) All intervals should be of the same width. computations. The width of interval =
This is preferred for easy
Range Number of classes
j) Open end classes should be avoided since creates difficulty in analysis and interpretation. k) Intervals would be continuous throughout the distribution. This is important for continuous distribution. l) The lower limits of the class intervals should be simple multiples of the interval. Ex: A simple of 30 persons weight of a particular class students are as follows. Construct a frequency distribution for the given data. 62
58
58
52
48
53
54
63
69
63
57
56
46
48
53
56
57
59
58
53
52
56
57
52
52
53
54
58
61
63
♦ Steps of construction Step 1 Find the range of data
(H) Highest value = 70 (L) Lowest value = 46
Range = H – L = 69 – 46 = 23 Step 2 Find the number of class intervals. Sturges formula K = 1 + 3.322 log N. K = 1 + 3.222 log 30 K = 5.90 Say K = 6 ∴ No. of classes = 6 Step 3 Width of class interval Width of class interval =
Range 23 = 3.883 ≅ 4 = Number of classes 6
47
Step 4 Conclusions all frequencies belong to each class interval and assign this total frequency to corresponding class intervals as follows. Class interval
Tally bars
Frequency
46 – 50
|||
3
50 – 54
|||| |||
8
54 – 58
|||| |||
8
58 – 62
|||| |
6
62 – 66
||||
4
66 – 70
|
1
Cumulative frequency distribution Cumulative frequency distribution indicating directly the number of units that lie above or below the specified values of the class intervals. When the interest of the investigator is on number of cases below the specified value, then the specified value represents the upper limit of the class interval. It is known as ‘less than’ cumulative frequency distribution. When the interest is lies in finding the number of cases above specified value then this value is taken as lower limit of the specified class interval. Then, it is known as ‘more than’ cumulative frequency distribution. The cumulative frequency simply means that summing up the consecutive frequency. Ex: Marks
No. of students
‘Less than’ cumulative frequency
0 – 10
5
5
10 – 20
3
8
20 – 30
10
18
30 – 40
20
38
40 – 50
12
50
48
In the above ‘less than’ cumulative frequency distribution, there are 5 students less than 10, 3 less than 20 and 10 less than 30 and so on. Similarly, following table shows ‘greater than’ cumulative frequency distribution. Ex: Marks
No. of students
‘Less than’ cumulative frequency
0 – 10
5
50
10 – 20
3
45
20 – 30
10
42
30 – 40
20
32
40 – 50
12
12
In the above ‘greater than’ cumulative frequency distribution, 50 students are scored more than 0, 45 more than 10, 42 more than 20 and so on.
Diagrammatic and Graphic Representation The data collected can be presented graphically or pictorially to be easy understanding and for quick interpretation. Diagrams and graphs give visual indications of magnitudes, groupings, trends and patterns in the data. These parameter can be more simply presented in the graphical manner. The diagrams and graphs help for comparison of the variables.
Diagrammatic presentation A diagram is a visual form for presentation of statistical data. The diagram refers various types of devices such as bars, circles, maps, pictorials and cartograms etc.
Importance of Diagrams 6. They are simple, attractive and easy understandable 7. They give quick information 8. It helps to compare the variables 9. Diagrams are more suitable to illustrate discrete data 10. It will have more stable effect in the reader’s mind. Limitations of diagrams 1. Diagrams shows approximate value
49
2. Diagrams are not suitable for further analysis 3. Some diagrams are limited to experts (multidimensional) 4. Details cannot be provided fully 5. It is useful only for comparison
General Rules for drawing the diagrams ix) Each diagram should have suitable title indicating the theme with which diagram is intended at the top or bottom. x) The size of diagram should emphasize the important characteristics of data. xi) Approximate proposition should be maintained for length and breadth of diagram. xii) A proper / suitable scale to be adopted for diagram xiii)Selection of approximate diagram is important and wrong selection may mislead the reader. xiv)Source of data should be mentioned at bottom. xv) Diagram should be simple and attractive xvi)Diagram should be effective than complex.
Some important types of diagrams f) One dimensional diagrams (line and bar) g) Two-dimensional diagram (rectangle, square, circle) h) Three-dimensional diagram (cube, sphere, cylinder etc.) i) Pictogram j) Cartogram c) One dimensional diagrams (line and bar) In one-dimensional diagrams, the length of the bars or lines is taken into account. Widths of the bars are not considered. Bar diagrams are classified mainly as follows. iii) Line diagram iv) Bar diagram -
Vertical bar diagram
-
Horizontal bar diagram
-
Multiple (compound) bar diagram
-
Sub-divided (component) bar diagram
-
Percentage subdivided bar diagram
50
ii) Line diagram This is simplest type of one-dimensional diagram. On the basis of size of the figures, heights of the bar / lines are drawn. The distances between bars are kept uniform. The limitation of this diagram are it is not attractive cannot provide more than one information. Ex: Draw the line diagram for the following data Year No. of students passed in first class with distinction
2001
2002
2003
2004
2005
2006
5
7
12
5
13
15
No. of students passed in FCD
16
(15)
14
(13) (12)
12 10 8
(7)
6 4
(5)
2001
(5) 2002
2003
2004
2005
2006
Year
Indication of diagram: Highest FCD is at 2006 and lowest FCD are at 2001 and 2004. d) Simple bars diagram A simple bar diagram can be drawn using horizontal or vertical bar. In business and economics, it is very a common diagram. Vertical bar diagram The annual expresses of maintaining the car of various types are given below. Draw the vertical bar diagram. The annual expenses of maintaining includes (fuel + maintenance + repair + assistance + insurance). Type of the car
Expense in Rs. / Year
Maruthi Udyog
47533
Hyundai
59230
51
Tata Motors
63270 Source: 2005 TNS TCS Study Published at: Vijaya Karnataka, dated: 03.08.2006
70000 63270
65000
59230
60000 55000 47533
50000 45000 40000 35000 30000
Maruthi Udyog
Hyundai
Tata Motors Source: 2005 TNS TCS Study
Published at: Vijaya Karnataka, dated: 03.08.2006 Indicating of diagram a) Annual expenses of Maruthi Udyog brand car is comparatively less with other brands depicted b) High annual expenses of Tata motors brand can be seen from diagram. ♦ Horizontal bar diagram World biggest top 10 steel makers are data are given below. Draw horizontal bar diagram. Steel maker
Arcelo r Mittal
Nippo n
Prodn. in million tonnes
110
32
POSCO JFE
31
BAO Steel
US Stee l
NUCOR
24
20
18
30
52
RIVA
Thyssen -krupp
Tangshan
18
17
16
16
Thyssen-krupp
17
RIVA
18
NUCOR
18
Top - 10 Steel Makers
Tangshan
20
US Steel
24
BAO Steel JFE
30
POSCO
31
Nippon
32 110
Arcelor Mittal
0
20
40
60
80
100
120
Production of Steel (Million Tonnes)
Source: ISSB Published by India Today
♦ Compound bar diagram (Multiple bar diagram) Multiple bar diagrams are used to provide more information than simple bar diagram. Multiple bar diagram provides more than one phenomenon and highly useful for direct comparison. The bars are drawn side-by-side and different columns, shades hatches can be used for indicating each variable used. Ex: Draw the bar diagram for the following data. Resale value of the cars (Rs. 000) is as follows. Year (Model)
Santro
Zen
Wagonr
2003
208
252
248
2004
240
278
274
2005
261
296
302
53
350
Value in Rs.
300 250
296 278 252
302 274 248
2 Model of Car
3
261 240 208
200 150 100 50 0 1
Santro
Zen
Wagnor
Source: True value used car purchase data Published by: Vijaya Karnataka, dated: 03.08.2006 Ex: Represent following in suitable diagram Class
A
B
C
Male
1000
1500
1500
Female
500
800
1000
1500
2300
2500
Total
2300
2500
Population (in Nos.)
2500 2000 1500
1500
0
1000
1500
1500
2
3
500
1000 500
800
1000
1
Class Male
Female
54
Ex: Draw the suitable diagram for following data Investment in 2004 in Rs.
Mode of investment
Investment in 2005 in Rs.
Investment
%age
Investment
%age
NSC
25000
43.10
30000
45.45
MIS
15000
25.86
10000
15.15
Mutual Fund
15000
25.86
25000
37.87
LIC
3000
5.17
1000
1.52
58000
100
66000
100
Total
110 100
5.17
1.52
25.86
37.87
25.86
15.15
43.10
45.45
% of Investment
90 80 70 60 50 40 30 20 10 0
2004
2005
Year
Two-dimensional diagram In two-dimensional diagram both breadth and length of the diagram (i.e. area of the diagram) are considered as area of diagram represents the data. The important two-dimensional diagrams are a) Rectangular diagram b) Square diagram c) Rectangular diagram Rectangular diagrams are used to depict two or more variables. This diagram helps for direct comparison. The area of rectangular are kept in proportion to the values. It may be of two types. iii)
Percentage sub-divided rectangular diagram
iv)
Sub-divided rectangular diagram
55
In former case, width of the rectangular are proportional to the values, the various components of the values are converted into percentages and rectangles are divided according to them. Later case is used to show some related phenomenon like cost per unit, quality of production etc. Ex: Draw the rectangle diagram for following data Expenditure in Rs.
Item Expenditure
Family A
Family B
Provisional stores
1000
2000
Education
250
500
Electricity
300
700
House Rent
1500
2800
Vehicle Fuel
500
1000
3500
7000
Total
Total expenditure will be taken as 100 and the expenditure on individual items are expressed in percentage. The widths of two rectangles are in proportion to the total expenses of the two families i.e. 3500: 7000 or 1: 2. The heights of rectangles are according to percentage of expenses. Monthly expenditure Item Expenditure
Family A (Rs. 3500)
Family B(Rs. 7000)
Rs.
%age
Rs.
%age
Provisional stores
1000
28.57
2000
28.57
Education
250
7.14
500
7.14
Electricity
300
8.57
700
10
House Rent
1500
42.85
2800
40
Vehicle Fuel
500
12.85
1000
14.28
Total
3500
100
7000
100
56
Provisonal Stores Electricity House Rent
Education Vehicle Fuel
% of Expenditure
100
80
60
40
20
0
A
B
Family
d) Square diagram To draw square diagrams, the square root is taken of the values of the various items to be shown. A suitable scale may be used to depict the diagram. Ratios are to be maintained to draw squares. Ex: Draw the square diagram for following data 4900
2500
1600
Solution: Square root for each item in found out as 70, 50 and 40 and is divided by 10; thus we get 7, 5 and 4.
6000
4900
5000
4000
3000
2000
1000
0
2500 1600 4 1
5
7
2
3
57
Pie diagram Pie diagram helps us to show the portioning of a total into its component parts. It is used to show classes or groups of data in proportion to whole data set. The entire pie represents all the data, while each slice represents a different class or group within the whole. Following illustration shows construction of pie diagram.
Draw the pie diagram for following data Revenue collections for the year 2005-2006 by government in Rs. (crore)s for petroleum products are as follows. Draw the pie diagram. Customs
9600
Excise
49300
Corporate Tax and dividend
18900
States taking
48800
Total
126600
Solution: Item / Source
Value in crores
Angle of circle
%ge
9600
9600 x 360 = 27.30 o 126600
7.58
Excise
49300
49300 x 360 = 140.20 o 126600
39.00
Corporate Tax and Dividend
18900
18900 x 360 = 53.70 o 126600
14.92
State’s taking
48800
48800 x 360 = 138.80 o 126600
38.50
126600
360o
Customs
Total
58
100
7.58
Customs Excise
38.5 39
Corporate Tax and Dividend State’s taking
14.92 Source: India Today 19 June, 2006
Choice or selection of diagram There are many methods to depict statistical data through diagram. No angle diagram is suited for all purposes. The choice / selection of diagram to suit given set of data requires skill, knowledge and experience. Primarily, the choice depends upon the nature of data and purpose of presentation, to which it is meant. The nature of data will help in taking a decision as to one-dimensional or two-dimensional or threedimensional diagram. It is also required to know the audience for whom the diagram is depicted. The following points are to be kept in mind for the choice of diagram. 4. To common man, who has less knowledge in statistics cartogram and pictograms are suited. 5. To present the components apart from magnitude of values, sub-divided bar diagram can be used. 6. When a large number of components are to be shows, pie diagram is suitable.
Graphic presentation A graphic presentation is a visual form of presentation graphs are drawn on a special type of paper known are graph paper. Common graphic representations are a) Histogram b) Frequency polygon c) Cumulative frequency curve (ogive)
59
Advantages of graphic presentation 7. It provides attractive and impressive view 8. Simplifies complexity of data 9. Helps for direct comparison 10. It helps for further statistical analysis 11. It is simplest method of presentation of data 12. It shows trend and pattern of data Difference between graph and diagram Diagram
Graph
7. Ordinary paper can be used 8. It is attractive understandable
7. Graph paper is required
and
easily 8. Needs some effect to understand
9. It is appropriate and effective to 9. It creates problem measure more variable 10. It can’t be used for further analysis
10. Can be used for further analysis
11. It gives comparison
11. It shows variables
12. Data are rectangles
represented
by
relationship
between
bars, 12. Points and lines are used to represent data
Frequency Histogram In this type of representation the given data are plotted in the form of series of rectangles. Class intervals are marked along the x-axis and the frequencies are along the y-axis according to suitable scale. Unlike the bar chart, which is one-dimensional, a histogram is two-dimensional in which the length and width are both important. A histogram is constructed from a frequency distribution of grouped data, where the height of rectangle is proportional to respective frequency and width represents the class interval. Each rectangle is joined with other and the blank space between the rectangles would mean that the category is empty and there are no values in that class interval. Ex: Construct a histogram for following data. Marks obtained (x) No. of students (f)
Mid point
15 – 25
5
20
25 – 35
3
30
35 – 45
7
40
45 – 55
5
50
55 – 65
3
60
65 – 75
7
70
Total
30
60
Frequency (No. of students)
For convenience sake, we will present the frequency distribution along with mid-point of each class interval, where the mid-point is simply the average of value of lower and upper boundary of each class interval.
7 6 5 4 3 2 1 0 15
25
45
35
55
65
75
Class Interval (Marks)
Frequency polygon A frequency polygon is a line chart of frequency distribution in which either the values of discrete variables or the mid-point of class intervals are plotted against the frequency and those plotted points are joined together by straight lines. Since, the frequencies do not start at zero or end at zero, this diagram as such would not touch horizontal axis. However, since the area under entire curve is the same as that of a histogram which is 100%. The curve must be ‘enclosed’, so that starting mid-point is jointed with ‘fictitious’ preceding mid-point whose value is zero. So that the beginning of curve touches the horizontal axis and the last mid-point is joined with a ‘fictitious’ succeeding mid-point, whose value is also zero, so that the curve will end at horizontal axis. This enclosed diagram is known as ‘frequency polygon’. Ex: For following data construct frequency polygon. Marks (CI) No. of frequencies (f)
Mid-point
15 – 25
5
20
25 – 35
3
30
35 – 45
7
40
45 – 55
5
50
55 – 65
3
60
65 – 75
7
70
61
10
Frequency
8
A Frequency polygon
6
4
2
0 0
10
20
30
40
50
60
70
80
90
100
Mid point (x)
Cumulative frequency curve (ogive) ogives are the graphic representations of a cumulative frequency distribution. These ogives are classified as ‘less than’ and ‘more than ogives’. In case of ‘less than’, cumulative frequencies are plotted against upper boundaries of their respective class intervals. In case of ‘grater than’ cumulative frequencies are plotted against upper boundaries of their respective class intervals. These ogives are used for comparison purposes. Several ogves can be compared on same grid with different colour for easier visualisation and differentiation. Ex: Marks (CI)
No. of frequencies (f)
Mid-point
Cum. Freq. Less than
Cum. Freq. More than
15 – 25
5
20
5
30
25 – 35
3
30
8
25
35 – 45
7
40
15
22
45 – 55
5
50
20
15
55 – 65
3
60
23
10
65 – 75
7
70
30
7
62
Less than Cumulative Frequency
Less than give diagram
30
'Less than' ogive 25
20
15
10
5 20
30
40
50
60
70
Upper Boundary (CI)
Less than give diagram
35
'More than' ogive
More than Ogive
30
25
20
15
10
10
20
30
40
50
Lower Boundary (CI)
63
60
70
Session – 4 Measures of Central Tendency
A classified statistical data may sometimes be described as distributed around some value called the central value or average is some sense. It gives the most representative value of the entire data. Different methods give different central values and are referred to as the measures of central tendency. Thus, the most important objective of statistical analysis is to determine a single value that represents the characteristics of the entire raw data. This single value representing the entire data is called ‘Central value’ or an ‘average’. This value is the point around which all other values of data cluster. Therefore, it is known as the measure of location and since this value is located at central point nearest to other values of the data it is also called as measures of central tendency. Different methods give different central values and are referred as measures of central tendency. The common measures of central tendency are a) Mean b) Median c) Mode. These values are very useful not only in presenting overall picture of entire data, but also for the purpose of making comparison among two or more sets of data.
Average Definition Average is a value which is typical or representative of a set of data. - Murry R. Speigal Average is an attempt to find one single figure to describe whole of figures. - Clark & Sekkade From above definitions it is clear that average is a typical value of the entire data and is a measure of central tendency. Functions of an average •
To represents complex or large data.
•
It facilitates comparative study of two variables.
•
Helps to study population from sample data.
•
Helps in decision making.
•
Represents single value for a series of data.
•
To establish mathematical relationship.
64
Characteristics of a typical average •
It should be rigidly defined and easily understandable.
•
It should be simple to compute and in the form of mathematical formula.
•
It should be based on all the items in the data.
•
It should not be unduly influenced by any single item.
•
It should be capable of further mathematical treatment.
•
It should have sampling stability.
Types of average Average or measures of central tendency are of following types. 1. Mathematical average a. Arithmetical mean i. Simple mean ii. Weighted mean b. Geometric mean c. Harmonic mean 2. Positional Averages a. Median b. Mode Arithmetic mean Arithmetic mean is also called arithmetic average. It is most commonly used measures of central tendency. Arithmetic average of a series is the value obtained by dividing the total value of various item by its number. Arithmetic average are of two types a. Simple arithmetic average b. Weighted arithmetic average Simple arithmetic average (Mean) Arithmetic mean is simply sometimes referred as ‘Mean’. Ex: Mean income, Mean expenses, Mean marks etc. Unlike other averages, mean has to be computed by considering each and every observations in the series. Hence, the mean cannot be found by either by inspection or observation of items. Simple arithmetic mean is equal to sum of the variable divided by their number of observations in the sample.
65
Let xi is the variable which takes values x1, x2, x3,……… xn over ‘n’ items, then arithmetic mean, simply the mean of x, denoted by bar over the variable x is given by. x=
x 1 + x 2 + x 3 + ............... + x n n
=
∑x n
Where, Σ is the Greek symbol sigma denotes the summation of all xi values. Arithmetic mean can be computed by following two methods for direct observation of individual items. a. Direct method b. Short cut method. Direct method uses above equation and steps for short cut method is illustrated in the subsequent topic. Ex: (For Direct Method) 1. Calculate the mean for following data. Marks obtained by 65 students are given below: 20, 15, 23, 22, 25, 20. x=
Mean marks
x 1 + x 2 + ......... + x n n
=
20 + 15 + 23 + 22 + 25 + 20 6
=
125 6
= 20.83 2. Six month income of departmental store are given below. Find mean income of stores. Month
Jan
Feb
Mar
Apr
May
June
Income (Rs.)
25000
30000
45000
20000
25000
20000
n = Total No. of items (observations) = 6 Total income = Σxi = (25000 + 30000 + 45000 + 20000 + 20000) = 140000 Mean income =
∑x n
i
=
140000 = Rs. 23333.33 6
The above example shows that if there are large data or large figures are there in data, computations required to get mean in high. In order to reduce computations one can go for short-cut method. The method is illustrated below.
66
Shortcut method Steps of this method is given below. Step 1: Assume any one value as a mean which is called arbitrary average (A). Step 2: Find the difference (deviations) of each value from arbitrary average. D = xi – A Step 3: Add all deviations (differences) to get Σd. Step 4: Use following equation and compute the mean value. x=A+
∑d n n = Total No. of observations Σd = Total deviation value A = Arbitrary mean
Example: Find the mean marks obtained by the students for the joining data given. 20
25
20
22
20
21
23
Let A = 20 and n = 10 Marks
D = (xi – 20)
20
0
25
5
20
0
22
2
20
0
21
1
23
3
25
5
22
2
18
-2 Σd = 16
x=A+ x = 20 +
∑d n 16 10
= 20 + 1.6 Mean Marks
x = 21.6
67
25
22
18
1. Mathematical characteristics of mean a. Algebraic sum of deviations of all observations from their arithmetic mean is zero i.e. Σ(xi - x ) = 0. b. The sum of squared deviations of the items from the mean is a minimum, that is less than the sum of squared deviations of items from any other value. Σd2 = minimum x c. Since x = ∑ . If any two values are given, third value can be computed. n d. If all the items of a sets are increased / decreased by any constant value, the arithmetic mean will also increases / decreases by the same constant.
2. Weighted arithmetic mean The weighted mean is computed by considering the relative importance of each of values to the total value. The arithmetic mean gives equal importance to all the items of distribution. In certain cases, relative importance of items is not the same. To give relative importance, weightage may be given to variables depending on cases. Thus, weightage represents the relative importance of the items. The weighted arithmetic mean in computed by following equation. Let x1, x2, x3, ………… xn are the variables and w1, w2, w3, ………… wn are the respective weights assigned. Then weighted mean x w is given by below equation. xw =
x 1 w 1 + x 2 w 2 + x 3 w 3 + ...... + x n w n = w 1 + w 2 + w 3 + ............ + w n
∑ xw ∑w
i.e., weighted average is the ratio of product of all values and respective weights to sum of weights. Ex: Compute simple weighted arithmetic mean and comment on them. Monthly salary Strength of Designation cadre (w) (Rs) (x)
xw
General Manager
25000
10
250000
Mangers
19000
20
380000
Supervisors
14000
10
140000
Office Assistant
10000
50
500000
Helpers
8000
25
200000
Σx = 76000
Σw = 115
Σxw = 1470000
(N = 5) Total
68
a. Simple arithmetic mean =
∑ x 76000 = = Rs. 15200 N 5
b. Weighted arithmetic mean =
∑ xw 1470000 = = Rs. 12782.6 ∑w 115
In this example, simple arithmetic mean does not accounts the difference in salary range for various staff. It is given equal importance. The salary of General Manager and Manager has inflated the value of simple mean. The weighted mean gives importance to the number of persons in various salary range. Ex: Comment on performance of students of two universities given below. Universit y
Bombay % of pas (x)
No. of (w) students (000)
MBA
71
MCA
Madras wx
% of pas (x)
No. of (w) students
wx
3
213
81
5
405
83
2
166
76
3
228
MA
73
5
365
58
3
174
M.Sc.
75
2
150
76
1
76
M.Com.
70
2
140
81
2
162
Σwx =1034
Σx =372
Σw =14
Σwx =1045
Course
Total (Σ)
Σx = 372
Σw =14
a. Since Σx is same, simple arithmetic average for both universities. =
∑ x 372 = = 74.4 N 5
b. Weighted mean for Bombay University = c. Weighted mean for Madras University =
∑ wx 1034 = = 73.86 ∑w 14 ∑ wx 1045 = = 74.64 ∑w 14
Comment: Madras University student’s performance is better than Bombay University students.
Discrete Series Frequencies of each value is multiplied with respective size to get total number of items is discrete series and their total number of item is divided by total number of frequencies to obtain arithmetic mean. This can be done in two methods one by direct or by short cut method.
69
Ex: Calculate the mean for following data. Value (x)
1
2
3
4
5
Frequency (f)
10
15
10
9
5
Steps: 1. Multiply each size of item by frequency to get Σfx 2. Add all frequencies (Σf = N) 3. Use formula x =
∑ fx ∑ fx = to get mean value. ∑f N
Solution: By direct method Value (x)
Frequency (f)
fx
1
10
10
2
15
30
3
10
30
4
9
36
5
5
25
Σf = 49 x=
Σfx = 131
∑ fx 131 = = 2.67 N 49
By short-cut method Let A = 3, (Assumed mean = 3) Value (x)
Frequency (f)
d = (x –A)
fd
1
10
-2
-20
2
15
-1
-15
3
10
0
0
4
9
1
9
5
5
2
10
Σf = 49 x=A+
Σfd = - 16
∑ fx − 16 =3 + = 2.67 N 49
70
Continuous series In continuous frequency distribution, the individual value of each item in the frequency distribution is not known. In a continuous series the mid points of various class intervals are written down to replace the class interval. In continuous series the mean can be calculated by any of the following methods. a. Direct method b. Short cut method c. Step deviation method a. Direct method Steps of their method are as follows 1. Find out the mid value of class group or class. Ex: For a class interval 20-30, the mid value is
23 + 30 50 = = 25 mid value 2 2
is denoted by ‘m’. 2. Multiply the mid value ‘m’ by frequency ‘f’ of each class and sum up to get Σ fm. 3. Use x =
∑ fm where N = Σf formula to get mean value. N
Ex: Compute the mean for following data. Age group
No. of persons
Mid point
(CI)
(f)
‘m’
0 – 10
5
5
25
10 – 20
15
15
225
20 – 30
25
25
625
30 – 40
8
35
280
40 – 50
7
45
315
Total
Σf = 60 = N
Mean age =
fm
Σfm = 1470
∑ fm ∑ fm 1470 = = = 245 ∑f N 60 x = 24.5
b. Short cut method Steps of above methods are described below. 1. Find the mid value of each class 2. Assume any of the mid value as arbitrary average (A). 3. Multiply the deviation (differences) ‘d’ by frequency ‘f’.
71
Using the formula x = A +
∑ fd find the mean value. N
Ex: Find the mean age of patient visiting to hospital in a particular day using following data. Age group
Mid value
CI
No. of patients (f)
0 – 10
5
10 – 20
d = (m – 25)
fd
5
-20
-100
15
15
-10
-150
20 – 30
25
25
0
0
30 – 40
8
35
10
80
40 – 50
7
45
20
140
Total
Σf = 60 = N
M
Σfd = –30
Let Arbitrary average = A = 25 Mean age
x = A+
∑ fd N
1 − 30 x = 25 + = 25 − = 24.5 2 60 x = 24.5 c. Step deviation method In this method, after finding deviation from arbitrary mean, it is divided by a common factor. Scaling down the deviation by a ‘step’ will reduce the calculation to minimum. The procedure of this method is described below. Steps of step deviation method 1. Find out the mid value ‘m’. 2. Select the arbitrary men ‘A’. 3. Find the deviation (d) of mid value of each from ‘A’. 4. Deviations ‘d’ are divided by a common factor –d'. 5. multiply d' of each class by frequency ‘f’ to get fd' and sum up for all classes to get Σfd'. 6. Using the formula x = A +
∑ fd ' x C (where, C is a common factor) N
calculate mean value.
72
Ex: Find the mean age of following data. Age (CI) 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 Total Let
No. of persons ‘f’ 5 15 25 8 7 Σf=60=N
Mid value ‘m’
(d=m–A) (d=m–25)
5 15 25 35 45
-20 -10 0 10 20
A = 25 and C = 10 x=A+
∑ fd ' xC N
x = 25 +
(−3) x 10 60
x = 25 −
1 2
x = 24.5
73
d'=
d 10
-2 -1 0 1 2
fd' -10 -15 0 8 14 Σfd'= -3
Session – 5 Measures of Central Tendency Combined Mean Combined arithmetic mean can be computed if we know the mean and number of items in each groups of the data. The following equation is used to compute combined mean. Let x 1 & x 2 are the mean of first and second group of data containing N1 & N2 items respectively. Then, combined mean = x 12 =
N1 x 1 + N 2 x 2 N1 + N 2
If there are 3 groups then x 123 =
N1 x 1 + N 2 x 2 + N 3 x 3 N1 + N 2 + N 3
Ex - 1: a) Find the means for the entire group of workers for the following data. Group – 1
Group – 2
75
60
1000
1500
Mean wages No. of workers Given data:
N1 = 1000
N2 = 1500
x 1 = 75 & x 2 = 60 Group Mean = x 12 = =
N1 x 1 + N 2 x 2 N1 + N 2
1000 x 75 + 1500 x 60 1000 + 1500
= x 12 = Rs. 66 Ex - 2: Compute mean for entire group. Medical examination
No. examined
Mean weight (pounds)
A
50
113
B
60
120
C
90
115
74
Combined mean (grouped mean weight) = x 123 =
N1 x 1 + N 2 x 2 + N 3 x 3 N1 + N 2 + N 3
(50 x 113 + 60 x 120 + 90 x 115) (50 + 60 + 90)
x 123 = Mean weight = 116 pounds Merits of Arithmetic Mean 1. It is simple and easy to compute. 2. It is rigidly defined. 3. It can be used for further calculation. 4. It is based on all observations in the series. 5. It helps for direct comparison. 6. It is more stable measure of central tendency (ideal average). Limitations / Demerits of Mean 1. It is unduly affected by extreme items. 2. It is sometimes un-realistic. 3. It may leads to confusion. 4. Suitable only for quantitative data (for variables). 5. It can not be located by graphical method or by observations.
Geometric Mean (GM) The GM is nth root of product of quantities of the series. It is observed by multiplying the values of items together and extracting the root of the product corresponding to the number of items. Thus, square root of the products of two items and cube root of the products of the three items are the Geometric Mean. Usually, geometric mean is never larger than arithmetic mean. If there are zero and negative number in the series. If there are zeros and negative numbers in the series, the geometric means cannot be used logarithms can be used to find geometric mean to reduce large number and to save time. In the field of business management various problems often arise relating to average percentage rate of change over a period of time. In such cases, the arithmetic mean is not an appropriate average to employ, so, that we can use geometric mean in such case. GM are highly useful in the construction of index numbers. Geometric Mean (GM) = n x 1 x x 2 x ...........x x n When the number of items in the series is larger than 3, the process of computing GM is difficult. To over come this, a logarithm of each size is obtained.
75
The log of all the value added up and divided by number of items. The antilog of quotient obtained is the required GM. log1 + log 2 + ................ + log n (GM) = Antilog Anti log n Merits of GM a. It is based on all the observations in the series. b. It is rigidly defined. c. It is best suited for averages and ratios. d. It is less affected by extreme values. e. It is useful for studying social and economics data. Demerits of GM a. It is not simple to understand. b. It requires computational skill. c. GM cannot be computed if any of item is zero or negative. d. It has restricted application. Ex - 1: a. Find the GM of data 2, 4, 8 x1 = 2, x2 = 4, x3 = 8 n=3 GM = n x 1 x x 2 x x 3 GM = 3 2 x 4 x 8 GM = 3 64 = 4 GM = 4 b. Find GM of data 2, 4, 8 using logarithms. Data: x1 = 2 x2 = 4 x3 = 8 N=3
76
∩ log x i i ∑ =1 N
x
log x
2
0.301
4
0.602
8
0.903 Σlogx = 1.806
∑ log x GM = Antilog N 1.806 GM = Antilog 3 GM = Antilog (0.6020) = 3.9997 GM ≅ 4 Ex - 2: Compare the previous year the Over Head (OH) expenses which went up to 32% in year 2003, then increased by 40% in next year and 50% increase in the following year. Calculate average increase in over head expenses. Let 100% OH Expenses at base year Year
OH Expenses (x)
log x
2002
Base year
–
2003
132
2.126
2004
140
2.146
2005
150
2.176 Σ log x = 6.448
∑ log x GM = Antilog N 6.448 GM = Antilog 3 GM = 141.03 GM for discrete series GM for discrete series is given with usual notations as month:
77
∩ log x i i ∑ =1 N
GM = Antilog Ex - 3:
Consider following time series for monthly sales of ABC company for 4 months. Find average rate of change per monthly sales. Month
Sales
I
10000
II
8000
III
12000
IV
15000
Let Base year = 100% sales. Solution:
(Rs)
Increase / decrease %ge
Conversion (x)
log (x)
100%
10000
–
–
–
II
– 20%
8000
80
80
1.903
III
+ 50%
12000
130
130
2.113
IV
+ 25%
15000
155
155
2.190
Month
Base year
I
Sales
Σlogx = 6.206 6.206 GM = Antilog = 117.13 3 Average sales = 117.13 – 100 = 14.46% Ex - 4: Find GM for following data. Marks
No. of students
(x)
(f)
130
log x
f log x
3
2.113
6.339
135
4
2.130
8.52
140
6
2.146
12.876
145
6
2.161
12.996
150
3
2.176
6.528
Σf = N = 22
Σ f log x =47.23
78
∑ f log x GM = Antilog N 47.23 GM = Antilog 22 GM = 140.212 Geometric Mean for continuous series Steps: 1. Find mid value m and take log of m for each mid value. 2. Multiply log m with frequency ‘f’ of each class to get f log m and sum up to obtain Σ f log m. 3. Divide Σ f log m by N and take antilog to get GM. Ex: Find out GM for given data below Yield of wheat in
No. of farms frequency
MT
(f)
1 – 10
3
11 – 20
Mid value
log m
f log m
5.5
0.740
2.220
16
15.5
1.190
19.040
21 – 30
26
25.5
1.406
36.556
31 – 40
31
35.5
1.550
48.050
41 – 50
16
45.5
1.658
26.528
51 – 60
8
55.5
1.744
13.954
‘m’
Σf = N = 100
Σ f log m = 146.348
∑ f log m GM = Antilog N 146.348 GM = Antilog 100 GM = 29.07
Harmonic Mean It is the total number of items of a value divided by the sum of reciprocal of values of variable. It is a specified average which solves problems involving variables expressed in within ‘Time rates’ that vary according to time.
79
Ex: Speed in km/hr, min/day, price/unit. Harmonic Mean (HM) is suitable only when time factor is variable and the act being performed remains constant. N HM = 1 ∑ x Merits of Harmonic Mean 1. It is based on all observations. 2. It is rigidly defined. 3. It is suitable in case of series having wide dispersion. 4. It is suitable for further mathematical treatment. Demerits of Harmonic Mean 1. It is not easy to compute. 2. Cannot used when one of the item is zero. 3. It cannot represent distribution. Ex: 1. The daily income of 05 families in a very rural village are given below. Compute HM. Family
Income (x)
Reciprocal (1/x)
1
85
0.0117
2
90
0.01111
3
70
0.0142
4
50
0.02
5
60
0.016 ∑1
HM = =
N ∑1
x
5 = 67.72 0.0738
HM = 67.72
80
x = 0.0738
2. A man travel by a car for 3 days he covered 480 km each day. On the first day he drives for 10 hrs at the rate of 48 KMPH, on the second day for 12 hrs at the rate of 40 KMPH, and on the 3rd day for 15 hrs @ 32 KMPH. Compute HM and weighted mean and compare them. Harmonic Mean x
1
48
0.0208
40
0.025
32
0.0312 ∑1
x
x = 0.0770 Data: 10 hrs @ 48 KMPH 12 hrs @ 40 KMPH 15 hrs @ 32 KMPH
HM = =
N ∑1
x
3 0.0770
HM = 38.91 Weighted Mean w
x
wx
10
48
480
12
40
480
15
32
480
Σw = 37
Weighted Mean = x = =
Σwx = 1440 ∑ wx ∑w
1440 37
x = 38.91 Both the same HM and WM are same.
81
3. Find HM for the following data. Class (CI)
Frequency (f)
Mid point (m)
1 Reciprocal m
1 f m
0 – 10
5
5
0.2
1
10 – 20
15
15
0.0666
0.999
20 – 30
25
25
0.04
1
30 – 40
8
35
0.0285
0.228
40 – 50
7
45
0.0222
0.1554
Σf = 60
1 Σ f = 3.3824 m
N 1 HM = ∑f m =
60 3.3824
HM = 17.73
Relationship between Mean, Geometric Mean and Harmonic Mean. 1. If all the items in a variable are the same, the arithmetic mean, harmonic mean and Geometric mean are equal. i.e., x = GM = HM . 2. If the size vary, mean will be greater than GM and GM will be greater than HM. This is because of the property that geometric mean to give larger weight to smaller item and of the HM to give largest weight to smallest item. Hence, x > GM > HM .
Median Median is the value of that item in a series which divides the array into two equal parts, one consisting of all the values less than it and other consisting of all the values more than it. Median is a positional average. The number of items below it is equal to the number. The number of items below it is equal to the number of items above it. It occupies central position. Thus, Median is defined as the mid value of the variants. If the values are arranged in ascending or descending order of their magnitude, median is the middle value of the number of variant is odd and average of two middle values if the number of variants is even. Ex: If 9 students are stand in the order of their heights; the 5th student from either side shall be the one whose height will be Median height of the students group. Thus, median of group is given by an equation.
82
N + 1 Median = 2 Ex 1. Find the median for following data. 22
20
25
31
26
24
23
Arrange the given data in array form (either in ascending or descending order). 20
22
23
24
25
26
31
N + 1 th 7 + 1 8 th Median is given by item = = Median = 4 item. 2 2 4 2. Find median for following data. 20
21
22
24
28
32
N + 1 th 6 + 1 th Median is given by item = Median = 3.5 item. 2 2 The item lies between 3rd and 4. So, there are two values 22 and 24. The median value will be the mean values of these two values. 22 + 24 Median = = 23 2 Discrete Series – Median In discrete series, the values are (already) in the form of array and the frequencies are recorded against each value. However, to determine the size of N + 1 th median item, a separate column is to be prepared for cumulative 2 frequencies. The median size is first located with reference to the cumulative frequency which covers the size first. Then, against that cumulative frequency, the value will be located as median.
83
Ex: Find the median for the students’ marks. Obtained in statistics Marks (x)
No. of students (f)
Cumulative frequency
10
5
5
20
5
10
30
3
13
40
15
28
50
30
58
60
10
68
Just above 34 is 58. Against 58 c.f. the value is 50 which is median value
N = 68
Ex: In a class 15 students, 5 students were failed in a test. The marks of 10 students who have passed were 9, 6, 7, 8, 9, 6, 5, 4, 7, 8. Find the Median marks of 15 students. Marks
No. of students (f)
cf
0 1 2 5
3 4
1
6
5
1
7
6
2
9
7
2
11
8
2
13
9
2
15
Σf = 15 Median = Me =
N + 1th item 2
15 + 1 = 8th 2
Me 8th item covers in cf of 9. the marks against cf 9 is 6 and hence Median = 6
84
Continuous Series The procedure is different to get median in continuous series. The class intervals are already in the form of array and the frequency are recorded against each th n class interval. For determining the size, we should take item and median class 2 located accordingly with reference to the cumulative frequency, which covers the size first. When the median class is located, the median value is to be interpolated using formula given below. h f
Median = +
N 2 − C
0 +1 where, 0 is left end point of N/2 class and l1is right end 2 point of previous class. Where =
h = Class width, f = frequency of median clas C = Cumulative frequency of class preceding the median class. Ex: Find the median for following data. The class marks obtained by 50 students are as follows. Cum. frequency (cf)
CI
Frequency (f)
10 – 15
6
6
15 – 20
18
24
20 – 25
9
33 N/2 class
25 – 30
10
43
30 – 35
4
47
35 – 40
3
50
Σf = N = 50 N 50 = = 25 2 2 Cum. frequency just above 25 is 33 and hence, 20 – 25 is median class. =
0 +1 2
20 + 20 = 20 2 = 20 h = 20 – 15 = 5
85
f=9 c = 24 h N − C f 2
Median = +
5 [ 25 − 24] 9
Median = 20 + = 20 +
5 9
Median = 20.555 Ex: Find the median for following data. Mid values (m)
115
125
135
145
155
165
175
185
195
Frequencies (f)
6
25
48
72
116
60
38
22
3
The interval of mid-values of CI and magnitudes of class intervals are same i.e. 10. So, half of 10 is deducted from and added to mid-values will give us the lower and upper limits. Thus, classes are. 115 – 5 = 110 (lower limit) 115 – 5 = 120 (upper limit) similarly for all mid values we can get CI. Cum. frequency (cf)
CI
Frequency (f)
110 – 120
6
6
120 – 130
25
31
130 – 140
48
79
140 – 150
72
151
150 – 160
116
267
160 – 170
60
327
170 – 180
38
365
180 – 190
22
387
190 – 200
3
390
Σf = N = 390 N 390 = 2 2 = 195 Cum. frequency just above 195 is 267. 86
N/2 class
Median class = 150 – 160 =
150 + 150 = 150 2
h = 116 N/2 = 195 C = 151 h = 10 Median = +
h N − C f 2
Median = 150 +
10 [195 −151] 116
Median = 153.8 Merits of Median a. It is simple, easy to compute and understand. b. It’s value is not affected by extreme variables. c. It is capable for further algebraic treatment. d. It can be determined by inspection for arrayed data. e. It can be found graphically also. f. It indicates the value of middle item. Demerits of Median a. It may not be representative value as it ignores extreme values. b. It can’t be determined precisely when its size falls between the two values. c. It is not useful in cases where large weights are to be given to extreme values.
87
Session – 6 Measures of Central Tendency Mode It is the value which occurs with the maximum frequency. It is the most typical or common value that receives the height frequency. It represents fashion and often it is used in business. Thus, it corresponds to the values of variable which occurs most frequently. The model class of a frequency distribution is the class with highest frequency. It is denoted by ‘z’. Mode is the value of variable which is repeated the greatest number of times in the series. It is the usual, and not casual, size of item in the series. It lies at the position of greatest density. Ex: If we say modal marks obtained by students in class test is 42, it means that the largest number of student have secured 42 marks. If each observations occurs the same number of times, we can say that there is ‘no mode’. If two observations occur the same number of times, we can say that it is a ‘Bi-modal’. If there are 3 or more observations occurs the same number of times we say that ‘multi-modal’ case. When there is a single observation occurs mot number of times, we can say it is ‘uni-modal’ case. For a grouped data mode can be computed by following equations with usual notations. Mode = =
h (f m − f 1 ) 2f m − f 1 − f 2
where, fm = max frequency (modal class frequency) f1 = frequency preceding to modal class. f2 = frequency succeeding to modal class h = class width. or Mode = +
hf 2 f1 + f 2
88
Ex: 1. Find the modal for following data. Marks
No. of students
(CI)
(f)
1 – 10
3
11 – 20
16
21 – 30
26
31 – 40
31
41 – 50
16
51 – 60
8
Max. frequency
Σf = N = 100
We shall identify the modal class being the class of maximum frequency. i.e. 31-40. where, fm = 31 f1 = 26 f2 = 16 h = 10 =
30 + 31 2
= 30.5
Mode (z) = +
h (f m − f 1 ) 2f m − f 1 − f 2
Mode = 30.5 +
10 (31 - 26) 2 x 31 − 26 − 16
Mode = 33. Or
89
Mode = +
10 x 16 hf 2 = 30.5 + (26 + 16) f1 + f 2
Mode = 34.30 It can be noted that there exists slightly different mode value in the second method.
Partition values Median divides in to two equal parts. There are other values also which divides the series partitioned value (PV). Just as one point divides as series in to two equal parts (halves), 3 points divides in to four points (Quartiles) 9 points divides in to 10 points (deciles) and 99 divide in to 100 parts (percentage). The partitioned values are useful to know the exact composition of series. Quartiles A measure, which divides an array, in to four equal parts is known as quartile. Each portion contain equal number of items. The first second and third point are termed as first quartile (Q1). Second quartile (Q2) and third quartile (Qs). The first quartile is also known as lower quartiles as 25% of observation of distribution below it, 75% of observations of the distribution below it and 25% of observation above it. Calculation of quartiles Q1 = size of Q2 = size of
( N + 1) th 4 3( N + 1) 4
item
th
item
N Q2 = (median) = + h f − C 2
Measures of quartiles The quartile values are located on the principle similar to locating the median value.
90
Following table shows procedure of locating quartiles. Individual and Discrete senses
Measure
( N + 1) th
Q1
4
Continuous series
item
n 4
th
item
Q2
2( N + 1) 4
item
2 th n item 4
Q3
3 ( N + 1) th item 4
3 th n item 4
th
Ex - 1: From the following marks find Q1, Median and Q3 marks 23, 48, 34, 68, 15, 36, 24, 54, 65, 75, 92, 10, 70, 61, 20, 47, 83, 19, 77 Let us arrange the data in array form. Sl. No.
x
1.
10
2.
15
3.
19
4.
20
5.
23 Q1
6.
24
7.
34
8.
36
9.
47
10.
48 Q2
11.
54
12.
61
13.
65
14.
68
15.
70 Q3
16.
75
17.
77
18.
83
19.
92 91
a. Q1 =
1 ( n + 1) th item 4
Q1 =
1 (19 + 1) 4
Q1 =
1 x 20 4
Here, n = 19 items
Q1 = 5th item ∴ Q1 = 23 b. Q2 = Q2 =
2 ( n + 1) th item 4 2 x 20 4
10th item ∴ Q2 = 48 c. Q3 = Q3 =
3 ( n + 1) th item 4 3 x 20 = 15th item 4
∴ Q3 = 70 Ex - 2: Locate the median and quartile from the following data. Size of shoes
4
4.5
5
5.5
6
6.5
7
7.5
8
Frequencies
20
36
44
50
80
30
30
16
14
X
f
cf
4
20
20
4.5
36
56
5
44
100 Q1
5.5
50
150
6
80
230 Q2
6.5
30
260 Q3
7
30
290
7.5
16
306
8
14
320
N = Σf = 320
92
Q1 =
1 ( n + 1) th item 4
Q1 =
1 321 4
Q1 = 80.25th item Just above 80.25, the cf is 100. Against 100 cf, value is 5. ∴ Q1 = 5
Q2 =
1 ( n + 1) th item 2
Q2 =
1 x 321 2
160.5th item Just above 160.5, the cf is 230. Against 230 cf value is 6. ∴ Q2 = 6
Q3 =
3 ( n + 1) th item 4
Q3 =
3 x 321 = 240.75th item 4
Just above 240.75, the cf is 260. Against 260 cf value is 6.5. ∴ Q3 = 6.5 Ex - 3: Compute the quartiles from the following data. CI Frequency (f)
0-10
10-20
20-30
30-40
40-50
50-60
60-70
70-80
5
8
7
12
28
20
10
10
First quartile (Q1) = +
h 1 h 3 N − C and Q3 = + N − C f 4 f 4
and (Q2) = Median = +
93
h N − C and f 2
CI
f
cf
0-10
5
5
10-20
8
13
20-30
7
20
30-40
12
32 Q1
40-50
28
60 Q2
50-60
20
80 Q3
60-70
10
90
70-80
10
100
N = Σf = 100 a. First locate Q1 for ¼ N ¼ N = 25 = 30 h = 10 f = 12 c = 20 (Q1) = +
h 1 N − C f 4
= +
30 + 30 = 30 2
Q1 = 30 +
10 [ 25 − 20] 12
Q1 = 34.16 b. Locate Q2 (Median) Q2 corresponds to N/2 = 50, + Q2 = +
h N − f 2
Q2 = 40 +
C
10 [ 50 − 32] 28
Q2 = 46.42
94
40 + 40 = 40 2
Q3 corresponds to ¾ N = 75, + Q3 = +
h 3 N − C f 4
Q3 = 50 +
10 [ 75 − 60] 20
50 + 50 = 50 2
Q3 = 57.5
Deciles The deciles divide the arrayed set of variates into ten portions of equal frequency and they are some times used to characterize the data for some specific purpose. In this process, we get nine decile values. The fifth decile is nothing but a median value. We can calculate other deciles by following the procedure which is used in computing the quartiles. Formula to compute deciles. D1 = +
h f
h 2 1 N − C , D 2 = + N − C f 20 10
& so, on
Percentiles Percentile value divides the distribution into 100 parts of equal frequency. In this process, we get ninety-nine percentile values. The 25th, 50th and 75th percentiles are nothing but quartile first, median and third quartile values respectively. Formula to compute percentiles is given below: P25 = +
h 25 h 26 N − C , P26 = + N − C f 100 f 100
and so, on
Ex: Find the decile 7 and 60th percentile for the given data of patients visited to hospital on a particular day. CI
f
Cf
10-20
1
1
20-30
3
4
30-40
11
15
40-50
21
36
50-60
43
79 P60
60-70
32
111 D70
70-80
9
120
Σf = N = 120 95
h 7 N − C , f 10
a. D7 = +
=
60 + 60 = 60 2 7 N = 84 10 h = 10, f = 32 c = 79
D7 = 60 +
10 ( 84 − 79) 32
7th Decile = D7 = 61.562 b. 60th percentile P60 = +
60 N − C 100
h f
=
50 + 50 = 50 2 h = 10 f = 43 c = 36 60 N = 72 100
P60 = 50 +
10 ( 72 − 36) 43
P60 = 50 +
10 ( 72 − 36) 43
P60 = 58.37
SOME NUMERICAL EXAMPLES 1. Show that following distribution is symmetrical about the average. Also shows that median is the mid-way between lower and upper quartiles. X
2
3
4
5
6
7
8
9
10
Frequency
2
9
29
57
80
57
29
9
2
•
To show the given distribution is symmetrical, Mean, Median and Mode must be same.
96
•
To show median is mid-way between the lower and upper quartile i.e., Q2 – Q1 = Q3 – Q2.
Mid-point
Class interval
x
CI
2
cf
f
d = (x – 6)
fd
1.5 – 2.5
2
-4
-8
2
3
2.5 – 3.5
9
-3
-27
11
4
3.5 – 4.5
29
-2
-58
40
5
4.5 – 5.5
57
-1
-57
97 Q1 class
6
5.5 – 6.5
80
0
0
177 Q2 class
7
6.5 – 7.5
57
1
57
234 Q3 class
8
7.5 – 8.5
29
2
58
263
9
8.5 – 9.5
9
3
27
272
10
8.5 – 10.5
2
4
8
274
Cum. freq.
Σfd = 0
N=274 Let A = 6
Mean = A + =6 +
h ∑ fd N 1x 0 =6 274
Mean = 6.
Median Q2 = +
h f
N 2 −
C N 274 = = 137 2 2 C = 97
Q2 = 5. +
1 [137 − 97] 80
Q2 = 5.5 + 0.5 Median = Q2 = 6.
97
Mode Mode = +
h ( f m − f1 ) 2f m − f 1 − f 2
Mode = 5.5 +
Modal class 5.5 – 6.5
1 ( 80 − 57 ) 2 x 80 − 57 − 57
Mode = 6. Since, Mean = Mode = Median. The given distribution is symmetrical. Q1 calculation h 4 N − C f 2
Q1 = +
Q1 = 6.5 +
1 [ 68.5 − 40] 57
∴ Q1 = 7. Now, Q2 – Q1 = Q3 – Q2 i.e.
6–5=7–5 2=2
2. Find the mean for the set of observations given below. 6, 7, 5, 4 x= =
∑
n
i =1
N
xi =
6+8+7+8+4 5
30 =6 5
3. Find the mean for the following data. CI
f
xi
fx
0-10
3
5.5
16.5
11-20
16
15.5
248
21-30
26
23.5
683
31-40
31
35.5
1180.5
41-50
16
45.5
728
51-60
8
55.5
444
N = Σf = 100
3300
98
x=
∑ fx 3200 = N 100
x = 32 4. Find the mean profit of the organisation for the given data below: Profit CI
f
xi
fx
100-200
10
150
1500
200-300
18
250
4500
300-400
20
350
7000
400-500
26
450
11700
500-600
30
550
16500
600-700
28
650
18200
700-800
18
750
13500
N = Σf = 150 x1 =
100 + 200 2
x1 =
300 2
72900
x1 = 150
x= =
∑ fx N 72900 150
x = 486
Step Deviation Method x = a + hd d = x=a+h
x−a h
∑ fd N
a = Arbitrary constant h = class width
99
Profit CI
f
xi
d
fd
100-200
10
150
-3
-30
200-300
18
250
-2
-36
300-400
20
350
-1
-20
400-500
26
450
0
0
500-600
30
550
+1
30
600-700
28
650
+2
56
700-800
18
750
+3
34
N = Σf = 150 x=a+h
Σfm = 54
∑ fd N
54 x = 450 + 100 150 x = 486 5. In an office there are 84 employees and there salaries are given below. Salary
2430
2590
2870
3390
4720
5160
4
28
31
16
3
2
Employees
1. Find the mean salary of the employees 2. What is the total salary of the employees? x= =
∑ fx N
2430 x 4 + 2590 x 28 + 2870 x 31 + 3390 x 16 + 4730 x 3 + 5160 x 2 84
x=
∑ fx N
x=
249930 84
Rs. 2975.36 1.
x = 2975.36
2. Total salary = 2,49,930 (Rs.)
100
6. The average marks secured by 36 students was 52 but it was discovered that on item 64 was misread as 46. Find the correct me of the marks. x=
∑ fx N
52 =
∑ fx 56
Σfx = 52 x 36 = 1872 Σfx = Σfx - incorrect + correct correct = 1872 – 46 64 = 1890 x=
∑ fx correct N
x=
1890 36
x = 52.5 7. The mean of 100 items is 46, later it was discovered that an item 16 was misread as 61 and another item 43 was misread as 34 and also found that the total number of items are 90 not 100 find the correct mean value. x=
∑ fx N
46 =
∑ fx 100
Σfx = 4600 Σfx = Σfx - incorrect + correct = 4600– 61 - 34 + 16 + 43 = 4564 x=
∑ fx correct N
x=
4564 90
= 50.71
101
8. Calculate the mean for the following data. Value
Frequency
< 10
4
< 20
10
< 30
15
< 40
25
< 50
30
CI
f
‘m’ mid point
fm
0-10
4
5
20
10-20
10
15
150
20-30
15
25
375
30-40
25
35
875
40-50
30
45
1350
Σf = 84 x=
∑ fm N
=
2770 84
Σfx 2770
x = 32.97 9. For a given frequency table, find out the missing data. The average accident are 1.46. No. of accidents
Frequency
0
46
1
?
2
?
3
25
4
10
5
5
102
No. of accidents (x)
Frequency
0
46
0
1
?
f1
2
?
2f1
3
25
75
4
10
40
5
5
25
N = 200
Σfx = 140 + f1 + 2f2
1.46 =
fx
(f)
140 + f 1 + 2f 2 200
292 = 140 + f1 + 2f2 ∴ f1 + 2f2 = 152
----(1)
w.k.t. N = Σf 200 = 86 + f1 + f2 f1 + f2 = 114
----(2)
f1 + 2f2 = 152
----(1)
f1 + f2 = 114
----(2)
(1) – (2)
--------------------------------f2 = 38 --------------------------------∴ f2 = 38 f1 + f2 = 114 f1 + 114 – 38 f1 = 76
103
10. Find out the missing values of the variate for the following data with mean is 31.87. xi
F
12
8
20
16
27
48
33
90
?
30
54
8 N = 200
xi
f
fx
12
8
96
20
16
320
27
48
1296
33
90
2970
x
30
30x
54
8
432
N = 200
Σfx = 5114 + 30x
x = 31.87 x=
∑ fx N
31.87 =
∑ fx 200
Σfx = 6374
----(1)
Σfx = 5114 + 30x
----(2)
(1) = (2) 6374 = 5114 + 30x 6374 - 5114 = 30x ∴30x = 1260 x = 42.
104
11. The average rainfall of a city from Monday to Saturday is 0.3 inches. Due to heavy rainfall Sunday the average rainfall for the week increased to 0.5 inches. What is the rainfall on Sunday? Given:
Mon – Sat
= 0.3”
Sun
= 0.5”
x=
∑ fx 1 N
0.3 =
∑ fx 1 6
Σfx1 = 1.8
x=
∑ fx 2 N
0.5 =
∑ fx 2 7
Σfx2 = 3.5
Rainfall on Sunday = Σfx2 – Σfx1 = 3.5 – 1.8 = 1.7” 12. The average salary of male employees in a firm was Rs. 520 and that of females Rs. 420 the mean of salary of all the employees as a whole is Rs. 500. Find the percentage of male and female employees. Given: x 1 = 520
x 2 = 420
n1 = Male persons. x=
x = 500
n2 = Female persons.
n1 x1 + n 2 x 2 n1 + n 2
500 =
n 1 x 520 + n 2 x 420 n1 + n 2
500 =
520n 1 + 420n 2 n1 + n 2
500n1 + 500n2 = 520n1 – 420n2 80n2 = 20n1 n1 = 4n2 Let n1 + n2 = 100 4n2 + n2 = 100 5n2 = 100 n2 = 20%
Female
n1 = 80%
Male
20% and 80% are male and females in the firm.
105
13. The A-M of two observations is 25 and there GM is 15. Find the HM. Given: AM = 25 x= x= 25 =
GM = 15
HM = ?
a+b 2
GM = 2 ab
a+b 2
15 =
GM =
a+b 2
ab ab
(15)2 = ( ab )2 ab = 225
a + b = 50
2 HM = 1 1 + a b HM =
2ab a+b
HM =
2 x 225 50
HM = 9 a + b = 50 ab = 225 a=
225 b
HM = 9 14. The GM is 60 an HM is 28.24. Find AM for two observations. AM x= x=
GM
HM
a+b 2
60 =
254 − 95b 2
ab = 3600
ab
28.24 =
2ab a+b
a+b=
2ab 28.4
602 = ab
= 127.475
=
2 x 3600 28.4
a + b = 254.95
106
15. Calculate the missing frequency from the data if the median is 50. CI
f
cf
10-20
2
2
20-30
8
10
30-40
6
16
40-50
? f1
16+f1
50-60
15
31+f1 median class
60-70
10
41+f1
f = 41 + f1 Q= +
h N − C f 2
50 = 50 + 0=
10 N − (16 + f1 ) 15 2
10 N − (16 + f1 ) 15 2
50 – 50 =
N 0 = 10 − (16 + f1 ) 2
N 0 = − (16 + f 1 ) 2 (16 + f 1 ) =
10 N − (16 + f1 ) 15 2
N 2
16 + f1 = ½ (41 + f1) 2 (16 + f1) = 41 + f1 32 + 2f1 = 41 + f1 f1 = 9
107
SOURCES AND REFERENCES 1. Statistics for Management, Richard I Levin, PHI / 2000. 2. Statistics, RSN Pillai and Bagavathi, S. Chands, Delhi. 3. An Introduction to Statistical Method, C.B. Gupta, & Vijaya Gupta, Vikasa Publications, 23e/2006. 4. Business Statistics, C.M. Chikkodi and Salya Prasad, Himalaya Publications, 2000. 5. Statistics, D.C. Sancheti and Kappor, Sultan Chand and Sons, New Delhi, 2004. 6. Fundamentals of Statistics, D.N. Elhance and Veena and Aggarwal, KITAB Publications, Kolkata, 2003. 7. Business Statistics, Dr. J.S. Chandan, Prof. Jagit Singh and Kanna, Vikas Publications, 2006.
108
Session – 7 Measures of Dispersions
The measures of Central Tendency alone will not exhibit various characteristics of the frequency distribution having the same total frequency. Two distribution can have the same mean but can differ significantly. We need to know the extent of variation or deviation of the values in comparison with the central value or average referred to as the measures of central tendency. Measures of dispassion are the ‘average of second order’. The are based on the average of deviations of the values obtained from central tendencies x , Me or z. The variability is the basic feature of the values of variables. Such type of variation or dispersion refers to the ‘lack of uniformity’. Definition: A measure of dispersion may be defined as a statistics signifying the extent of the scatteredness of items around a measure of central tendency. Absolute and Relative Measures of Dispersion: A measure of dispersion may be expressed in an absolute form, or in a relative form. It is said to be in absolute form when it states the actual amount by which the value of item on an average deviates from a measure of central tendency. Absolute measures are expressed in concrete units i.e., units in terms of which the data have been expressed e.g.: Rupees, Centimetres, Kilogram etc. and are used to describe frequency distribution. A relative measures of dispersion is a quotient by dividing the absolute measures by a quality in respect to which absolute deviation has been computed. It is as such a pure number and is usually expressed in a percentage form. Relative measures are used for making comparisons between two or more distribution. Thus, absolute measures are expressed in terms of original units and they are not suitable for comparative studies. The relative measures are expressed in ratios or percentage and they are suitable for comparative studies. Measures of Dispersion Types Following are the common measures of dispersions. a. The Range b. The Quartile Deviation (QD) c. The Mean Deviation (MD) d. The Standard Deviation (SD)
109
Range ‘Range’ represents the differences between the values of the extremes’. The range of any such is the difference between the highest and the lowest values in the series. The values in between two extremes are not all taken into consideration. The range is an simple indicator of the variability of a set of observations. It is denoted by ‘R’. In a frequency distribution, the range is taken to be the difference between the lower limit of the class at the lower extreme of the distribution and the upper limit of the distribution and the upper limit of the class at the upper extreme. Range can be computed using following equation. Range = Large value – Small value Coefficient of Range =
L arg e value − Small value L arg e value + Small value
Problems 1. Compute the range and also the co-efficient of range of the given series of state which one is more dispersed and which is more uniform. Series – I – 9, 10, 15, 19, 21
Series – II – 1, 15, 24, 28, 29
R = LV – SV = 21 – 9 = 12 CR =
R = LV – SV = 29 – 1 = 28
12 12 = = 0.4 21 + 9 30
CR =
R 28 = = 0.933 LV + SV 30
Series I is les dispersed and more uniform Series II is more dispersed and less uniform Evaluating Criteria i.
Less the CR is less dispersion
ii.
More the CR is less uniform
Range Merits i.
It is very simplest to measure.
ii.
It is defined rigidly
iii.
It is very much useful in Statistical Quality Control (SBC).
iv.
It is useful in studying variation in price of shars and stocks.
110
Limitations i.
It is not stable measure of dispersion affected by extreme values.
ii.
It does not considers class intervals and is not suitable for C.I. problems.
iii.
It considers only extreme values.
2. Find range of Co-efficient of range from following data. A:
10
11
12
13
14
B:
40
41
42
43
44
C:
100
101
102
103
104
Series - I
Series – II
Series – III
R =LV – 3m = 14 – 10
R = 44 - 40
= 4
CR = =
R = 104 - 100
= 4 R LV + SV
CR =
4 24
=
= 0.166
= 4 R LV + SV
CR =
4 84
=
= 0.0476
R LV + SV 4 204
= 0.0196
Series III is less dispersed and more uniform Series I is more dispersed and less uniform 3. Compute range and coefficient of range for the following data. x:
6
12
18
24
30
36
42
f:
20
130
16
14
20
15
40
Range = LV – SV = 42 – 6 = 36 CR =
R LV + SV
=
36 48
= 0.75
111
Quartile Deviation Quartile divides the total frequency in to four equal parts. The lower quartile Q1 refers to the values of variates corresponding to the cumulative frequency N/4. Upper quartile Q3 refers the value of variants corresponding to cumulative frequency ¾ N. 1 (Q3 – Q1). In this quartile Q2 as it 2 N corresponds to the value of variate with cumulative frequency is equal to c.f. = . 2 Quartile deviation is defined as QD =
a) QD =
1 (Q3 – Q1) 2
b) Relative measure of dispersion coefficient of QD =
Q 3 − Q1 Q 3 + Q1
Problems 1. Find quartile deviation and coefficient of quartile deviation for the given grouped data also compute middle quartile. Class
f
1 – 10
3
11 – 20
16
21 – 30
26
31 – 40
31
41 – 50
16
51 – 60
8 Σf = N = 100
Class
f
Cf
1 – 10
3
3
11 – 20
16
19
21 – 30
26
45 Q1 Class
31 – 40
31
76 Q2 & Q3 Class
41 – 50
16
92
51 – 60
8
100
N = 100
112
Q1
N 100 25 = 4 4
Q1 = +
h f
N 4 − C
Q1 = 20.5 +
10 [ 25 − 19] 26
Q1 = 22.80
Q2 = +
h N − C f 2
Q2 = 30.5 +
10 [ 50 − 45] 31
Q2 = 32.11
Q3 = +
h 3 N − C f 4
Q3 = 30.5 +
10 [ 75 − 45] 31
Q3 = 40.17
QD = =
1 (Q3 – Q1) = 0.5 (Q3 – Q1) 2 1 (40.17 – 22.80) 2
= 8.685 Coef. QD =
Q 3 − Q1 Q 3 + Q1
=
40.17 − 22.80 40.17 + 22.80
=
17.37 62.97
= 0.275
113
2. Find quartile deviation from the following marks of 12 students and also co-efficient of quartile deviation. Sl. No.
Marks
1.
25
2.
30
3.
37
4.
43
5.
48
6.
54
7.
61
8.
67
9.
72
10.
80
11.
84
12.
89
Q1 = 3.25th item = 3rd item + 0.25 of item = 37 + 0.25 (43 - 37) Q1 = 38.5 Q3 =9.75th item = 9 + 0.75rd item = 72 + 0.75 (80- 72) Q3 = 78 QD = =
1 (Q3 – Q1) 2 1 (78 – 38.3) 2
QD = 19.75 Coef. QD = =
Q 3 − Q1 Q 3 + Q1 78 − 38.5 78 + 38.5
= 0.339
3. Compute quartile deviation. and its Coefficient for the data given below:
114
x
f
Cf
58
15
15
59
20
35
60
32
67 Q1 Class
61
35
102
62
33
135
63
22
157 Q3 Class
64
20
177
65
10
187
65
8
195
N = 195 n + 1th Q1 = size 4 =
195 + 1th size 4
Q1 = 48.78th size and corresponding to cf 67, which gives Q1 = 60
Q3 = =
3 ( n + 1) th size 4 3 (196) th = 146.33 th size . 4
It lies in 157, cf. Against cf 157 Q3 = 63 QD =
1 (Q3 – Q1) 2
=
1 (63 – 60) 2
QD = 1.5 Coef. QD = =
Q 3 − Q1 Q 3 + Q1 63 − 60 3 = 63 + 60 123
= 0.024 Merits of Quartile Deviation 115
•
It is very easy to compute
•
It is not affected by extreme values of variable.
•
It is not at all affected by open and class intervals.
Demerits of Quartile Deviation •
It ignores completely the portions below the lower quartile and above the upper of quartile.
•
It is not capable for further mathematical treatment.
•
It is greatly affected by fluctuations in the sampling.
•
It is only the positional average but not mathematical average.
116
Session – 8 Measures of Dispersions
Mean Deviation Mean deviation is the average differences among the items in a series from the mean itself or median or mode of that series. It is concerned with the extent of which the values are dispersed about the mean or median or the mode. It is found by averaging all the deviations from control tendency. These deviations are taken into computations with regard to negative sign. Theoretically the deviations of item are taken preferably from median instead than from the mean and mode. Merits of Mean Deviation •
It is rigidly defined and easy to compute.
•
It takes all items in to considerations and gives weight to deviation according to these sign.
•
It is less affected by extreme values.
•
It removes all irregularities by obtaining deviation and provides correct measures.
Demerits of Mean Deviation •
It is not suitable for algebraic treatments.
•
It is positive which is not justified mathematically.
•
It is not satisfactory measure when the deviations are taken from mode.
•
It is not suitable when class intervals are open end.
117
Formula to compute Mean Deviation If xi is variant and takes the values x1, x2, x3, …….. xn with average. A (mean, median, mode), then mean deviation from the average – A is defined by MD =
∑ xi − A N
For the grouped data MD =
∑f xi − A N
Coefficient of MD =
MD Mean
1. Compute MD and CMD from mean for the given data below. X
d = xi − x
21
26.55
32
15.55
38
9.55
41
6.55
49
1.45
54
6.45
59
11.45
66
18.45
68
20.45
Σx = 428
Σ x i − x = Σd= 116.45 x=
∑
i =1
MD =
n
xi
∑ xi − x N
x=
=
428 = 47.35 9
116.45 9
MD = δ = 12.938 Coefficient of MD = =
MD Avg 12.938 = 0.272 47.55
118
2. Following are the wages of workers. Find mean deviation from median and its coefficient. x
Wages
x i − Me = x i − 47
59
17
30
32
22
25
67
25
22
43
32
15
22
43
4
17
47 M
0
64
55
8
55
59
12
47
64
17
80
67
20
25
80
33
25
Σ x i − M = 186
Σ x i − M = 186
11 + 1 th item Median = 2 11 + 1 = 6th item = 2 Me = 47
MD = =
∑ x i − Me N 186 = 16.91 11
Coefficient of MD = =
MD Median 16.91 = 0.359 47
3. Compute MD about its mode and its coefficient.
119
x
f
d = x i − Mode
fd
20
6
100
600
40
19
80
1520
60
40
60
2400
80
23
40
920
100
65
20
1300
120 Mode
83 Modal class
0
0
140
55
20
1100
160
20
40
800
180
9
60
5401
Σf = 320
Σf x i − Mode = 9180
the highest frequency is 83 and hence Z = 120 MD=
∑ x i − Mode N
9180 Median = 320 = 28.68
Coefficient of MD =
28.68 120
= 0.239
120
4. Find out the mean deviation from the data given below about its median. Salaries
40
50
50-100
100-200
200-400
No. of Employees
22
18
10
8
2
x
No. of Employees
x(mv)
cf
d = x i − Me
fd
40
22
40
22
10
220
50
18
50
40
0
0
50-100
10
75
50
25
250
100-200
8
150
58
100
800
200-400
2
300
60
250
500
Σf = 60
Σf x i − Me = 1770
N + 1 Median = 2
th
item
=
60 + 1 2
=
61 = 30.5 2
It lies in 40 cf and against 40 cf discrete value is 50
MD =
∑ x i − Median N
1770 = 60 MD = 29.5
Coefficient of MD = =
MD Median
29.5 50 = 0.59
Session – 9 Measures of Dispersions 121
Standard Deviation Standard deviation is the root of sum of the squares of deviations divided by their numbers. It is also called ‘Mean error deviation’. It is also called mean square error deviation (or) Root mean square deviation. It is a second moment of dispersion. Since the sum of squares of deviations from the mean is a minimum, the deviations are taken only from the mean (But not from median and mode). The standard deviation is Root Mean Square (RMS) average of all the deviations from the mean. It is denoted by sigma (σ). Characteristics of standard deviation 1. Standard deviation and coefficient of variation possesses all these properties which a good measure of dispersion should possess. 2. The process of squaring the deviation eliminates negative sign and makes mathematical computations easy. Merits 1. It is based on all observations. 2. It can be smoothly handled algebraically. 3. It is a well defined and definite measure of dispersion. 4. It is of great importance when we are making comparison between variability of two series. Merits 1. It is difficult to calculate and understand. 2. It gives more weightage to extreme values as the deviation is squared. 3. It is not useful in economic studies. Standard deviation If the variant xi takes the values of x1, x2 ………….. xn the standard deviation denoted by σ and it is defined by σ=
(
∑ xi − x N
)
2
The quantity σ2 is called variance.
122
Alternate Expressions For raw data σ2 =
()
∑ x2 − x n
For a grouped data σ2 =
2
()
∑ fx 2 − x n
2
For a grouped data with step deviation method σ =
∑ fd 2 ∑ fd − N N
2
Coefficient of variance It is defined as the ratio to be equal to standard deviation divided by mean. σ The percentage form of CV is given by CV = x 100 x
123
Problems 1. Ten students of a class have obtained the following marks in a particular subject out of 100. Calculate SD and CV for the given data below. (x)
d = (x1 = 38.5)
marks
d = (x1 - x )
1.
5
- 33.5
1122.25
2.
10
- 28.4
812.25
3.
20
- 18.5
342.25
4.
25
- 13.5
182.25
5.
40
1.5
2.25
6.
42
3.5
12.25
7.
45
6.5
42.25
8.
48
9.5
90.25
9.
70
31.5
992.25
10.
80
41.5
1722.25
Sl. No.
Σ(x1 - x )2 = Σ d2 = 5320.50
Σx = 385
x=
∑x N
=
385 10
= 38.5
(
σ=
∑ xi − x N
σ=
5320.5 10
(x1 - x )2
)
2
= 23.066
CV =
σ x 100 x
CV =
23 x 100 38.5
CV = 59.9%
124
2.
Compute standard deviation and coefficient of varience for following data of 100 students marks. Class
f
Mid point
Class
d
fd
fd2
x 1 – 10
3
0.5 – 10.5
5.5
-2
-6
12
11 – 20
16
10.5 – 20.5
15.5
-1
-16
16
21 – 30
26
20.5 – 30.5
25.5
0
0
0
31 – 40
31
30.5 – 40.5
35.5
1
31
31
41 – 50
16
40.5 – 50.5
45.5
2
32
64
51 – 60
8
50.5 – 60.5
55.5
3
24
72
Σfd = 65
Σfd2= 195
N = Σf = 100 a = 25.5 d=
x − a x − 25.5 = =d h 10
d=
15.5 − 25.5 − 10 = = −1 10 10
x=a+h+
∑ fd N
65 = 25.5 + 10 100
= 25.5 + 6.5
x 32 σ=h
∑ fd 2 ∑ fd − N N
σ = 10
195 65 − 100 100
CV =
σ x 100 x
CV =
12.359 x 100 32
2
2
= 12.359
= 38.62%
125
3. The AM and SD of a set of nine items are 43 and 5 respectively if an item of value 63 is added, find the mean and SD. x=
∑ xi N
Σxi = x x N Σxi = 43 x 9 Σx = 387
for 9 items
Σx = 387 + 63 for 10 item Σx = 450 Modified mean x =
∑ x 450 = N 10
x = 45 σ=5
x = 43 σ2 =
()
∑ x2 −x N
for 9 items
2
∑ x2 2 25 = − ( 43) 9 25 =
∑ x2 − 1849 9
25 + 1849 =
∑ x2 9
∑ x2 = 1874 9 Σx2 = 1874 Σx2 = 16866
for 9 items
If 63 is added Σx2 = 16866 + (63)2 = 20835 for 10 items Modified
()
σ2 =
∑ x2 −x N
σ2 =
20835 2 − ( 45) 10
2
σ2 = 7.64 is modified SD.
126
4. The mean of 5 observations is 4.4. and variance is 8.24 and if the 3 items of the five observations are 1, 2 and 6. Find the values of other two observations. w.k.t. x =
∑x N
4.4 =
∑x N
Σx = 22
σ2 =
()
∑ x2 −x N
2
8.24 =
∑ x2 2 − ( 4.4) 5
8.24 =
∑ x2 − 19.36 9
8.24 + 19.36 =
∑ x2 5
Σx2 = 138 Σx2 = 12 + 22 + 62 + x12 + x22 138 = 1 + 4 + 36 + x12 + x22 97 = x12 + x22 x12 + x22 = 97
---- (1)
Σx = 1 + 2 + 6 + x1 + x2 22 = 9 + x1 + x2 x1 + x2 = - 13
---- (2) put (2) in (1)
x2 = 13 – x1 by (1) & (2) x12 + (13 – x1)2 = 97 x12 + 169 + x12 – 26x1 = 97 2 x12 – 26x1 + 72 = 0 x12 – 13x1 + 36 = 0
127
x1 = x1 =
-b±
b 2 − 49 2a
- (-13) ± 169 − 4 x 36 2
x1 =
13 ± 5 2
x1 =
13 5 ± 2 2
x1 = 6.5 ± 2.5 x1 = 9 or x1 = 4 x1 = 9
x2 = 4
128
5. The mean and S.D. of the frequency distribution of a continuous random variable x are 40.604 and 7.92 respectively. Change of origin and scale is given below. Determine the actual class interval. d
-3
-2
-1
0
1
2
3
4
f
3
15
45
57
50
36
25
9
d
f
fd
fd2
MV
CI
-3
3
-9
27
22.5
20-25
-2
15
-30
60
29.5
25-30
-1
45
-45
45
32.5
30-35
0
57
0
0
37.5
35-40
1
50
50
50
42.5
40-45
2
36
72
144
47.5
50-55
3
25
75
225
52.5
55-60
4
9
36
144
57.5
N = 240
Σfd = 149
Σfd2 = 695
x=a+h
∑ fd N
40.604 = a + h
40.604 = a + 0.62h
∑ fd 2 ∑ fd σ=h − N N 7.92 = h
----- (1) 2
695 149 − 240 240
2
= h 2.895 − 0.620 7.92 = h x 1.584 h = 4.998 h=5 Put h = 5 in equation (1) 40.604 = a + 0.62 x 5 a = 37.5
129
149 240
Combined Standard Deviation Suppose we have different samples of various sizes n1, n2, n3 …….. having means x1, x2, x3 and standard deviation σ1, σ2, σ3 ……. then combine standard deviation can be computed by the following formula. σ2 (n1 + n2) = n1 (σ12 + d12) + n2 (σ22 + d22) d1 = x 1 − x d2 = x 2 − x 1. The mean’s of two samples of sizes 50 and 100 respectively are 54.1 and 50.3 and there standard deviations are 8 and 7 respectively obtain the SD for combined group. n1 = 50
n2 = 100
x 1 = 54.1
x 2 = 50.3
σ1 = 8
σ2 = 7
x=
n1 x1 + n 2 x 2 (n 1 + n 2 )
x=
(50 x 54.1) + (100 x 50.3) 50 + 100
x = 51.56 σ2 (n1 + n2) = n1 (σ12 + d12) + n2 (σ22 + d22) d1 = x 1 − x d2 = x 2 − x d1 = 94.1 – 51.56 d1 = 2.54
d12 = 6.45
d2 = 50.3 – 51.56 d2 = - 1.26
d22 = 1.56
σ2 150 = 50 (82 + 6.45) + 100 (72 + 1.58) 3σ2 = (64 + 6.45) + 2 (49 + 1.58) 3σ2 = 70.45 + 2 x 50.58 σ = 7.56
130
2. The mean wage is Rs. 75 per day, SD wage is Rs. 5 per day for a group of 1000 workers and the same is Rs. 60 and Rs. 4.5 for the other group of 1500 workers. Find mean and standard deviation for the entire group. We have by data, x 1 = 75, σ1 = 5, n1 = 1000 x 2 = 60, σ2 = 450, n2 = 1500 Let x and σ be the mean and SD of the entire group. Consider x = i.e., x =
n1 x1 + n 2 x 2 n1 + n 2
1000 x 75 + 1500 x 60 =60 1000 + 1500
Also we have, (n1 + n2) σ2 = n1 (σ12 + d12) + n2 (σ22 + d22), where d1 = x 1 - x = 75 – 66 = 9; d2 = x 2 - x = 60 – 66 = -6 ∴ (1000 + 1500) σ2 = 1000 (52 + 92) + 1500 (4.52 + (-6)2) ∴ σ 2 = 76.15 or σ = 8.73
3. The runs scored by 3 batsman are 50, 48 and 12. Arithmtic mean’s respectively. The SD of there runs are 15, 12 and 2 respectively. Who is t he most consistent of the three batsman? If the one of these three is to be selected who is to be selected? A
B
C
AM ( x )
50
48
12
SD(σ)
15
12
2
CVA =
σA x 100 xA
CVA =
15 x 100 50
CVA = 30%
CVB =
σB x 100 xB
CVB =
12 x 100 48
CVB = 25%
CVC =
σC x 100 xC
CVC =
2 x 100 12
CVC = 16.66% Evaluation Criteria 1. Less CV indicates more constant player and hence more consistent player is (Player C) 2. Highest rune scorer = x A = 50
4. The coefficient of variation of the two series are 75% and 90% with SD 15 and 18 respectively compute there mean. CVA = 75% CVB = 80% σA = 15 σB = 18
CV =
σ x 100 x
75 =
15 x 100 xA
90 =
x A = 20
18 x 100 xA
x A = 20
5. Goals scored by two teams A & B in a foot ball season are as shown below. By calculating CV in each, find which team may be considered as more consistent. No. of matches
No. of goals
Team (A)
Team (B)
x
A-team
B-team
fx
fx
0
27
17
0
0
1
9
9
9
9
2
8
6
16
12
3
5
5
15
15
4
4
3
16
12
N = Σf = 53
Σf = 40
Σfx = 56
Σfx2 = 48
Team (A)
Team (B)
fx2
fx2
0
0
9
9
32
24
45
45
64
48
Σfx2 = 150
Σfx2 = 126
x A=
∑ fx 56 = = 1.056 N 53
xB=
∑ fx 48 = = 1.2 N 40
()
∑ fx 2 σ = − x A N 2
2
=
150 2 = 1.30 − (1.056 ) = 1.715 = σ A 53
24
()
∑ fx 2 − x N
2
σ = B
2
=
126 2 − (1.2 ) = 1.95 = σB = 1.30 40
CVA =
σA 1.30 x 100 = 123.8% x 100 = xA 1.056
CVB =
σB 1.30 x 100 = 109% x 100 = xB 1.2
Since, CVB < CVA, team B is more consistent player 6. The prices of x and y share A & B respectively state which share more stable in its value. Price A
(xi = 53)
Price - A
(xi = 105)
(x)
(xi = x )
(4)
(xi = x )
55
2
4
108
3
9
54
1
1
107
2
4
52
-1
1
105
0
0
53
0
0
105
0
0
56
3
9
106
1
1
58
5
25
107
2
4
52
-1
1
104
-1
1
50
-3
9
103
-2
4
51
-2
4
104
-1
1
49
-4
16
101
-4
16
Σx = 530
(xi = x )2
Σ(xi= x )2 = 70
Σx = 1050
(xi = x )2
Σx(xi= x )2 = 40
25
x A=
∑x 530 = = 53 N 10
xB=
∑ x 1050 = = 105 N 10
σ=
70 = σ = 2.64 10 A
σ=
40 =σ=2 B 10
A
B
CVA = CVB =
σA x σB x
x 100 =
2.64 x 100 = 4.98% 53
x 100 =
2 x 100 = 1.903% 105
Since, CVB is less share B is more stable. 7. A student while computing the coefficient of variation obtained the mean and SD of 100 observations as 40 and 5.1 respectively. It was later discovered that he had wrongly copied an observation as 50 instead of 40. Calculate the correct coefficient of variation.
>> x =
∑x ∑x i.e. 40 = n 100
∴ Σx (incorrect) = 4000 Now correct Σx = 4000 – 50 + 40 = 3990 ∴ correct x =
3990 = 39.9 100
()
∑ x2 Let us consider σ = − x n 2
( 5.1) 2
2
∑x2 2 − ( 40) 100
=
i.e. ( 40 ) + ( 5.1) = 2
2
∑x2 ∑ x2 or = 1626.01 100 100
26
∴ Σx2 (incorrect) = 100 x 1626.01 = 162601 Now correct Σx2 = 162601 – (50)2 + (40)2 = 161701 ∴ correct σ2 = correct i.e., correct σ2 =
(
∑ x2 − correct x n
)
2
161701 2 − ( 39.9) = 25 100
Now correct efficient of variation =
σ x 100 x
5 x 100 = 12.56% 39.9 Hence correct C.V. = 12.53%
27
8. The mean and SD of 21 observations are 30 and 5 respectively. It was subsequently noted that one of the observations 10 was incorrect. Omit it and determine the mean and SD of the rest.
>> x =
∑x ∑x or ∑ x = 630 i.e. 30 = n 21
∴ incorrect Σx = 630 Now omitting the incorrect value 10, Σx = 630 – 10 = 620
New
n = 21 – 1 = 20 New x =
620 = 31 20
Next consider σ 2 =
( 5)
2
()
∑ x2 − x n
2
∑ x2 2 = − ( 30 ) 100
i.e. 900 + 25 =
∑ x2 21
∴ incorrect ∑ x 2 = 925 x 21 = 19425 Again omitting the incorrect value 10. New
Σx = 19425 –(10)2 = 19325, n = 20
Hence new σ 2 = new
(
∑ x2 − new x 20
)
2
19325 − (31) 2 = 5.25 20 ∴ New σ =
5.25 = 2.29
9. The mean of 200 items was 50. Later on it was discovered that two items were misread as 92 and 8 instead of 192 and 88. Find out the correct mean. >> x =
∑x ∑x or ∑ x = 10000 i.e. 50 = n 200
∴ incorrect Σx = 10000 Correct Σx = 10000 – 92 – 8 + 192 + 88 = 10180 28
∴ Correct mean =
10180 = 50.9 200
10. Find the missing frequencies in the following data given that the median is 137.2. Class
Frequency
100110
110120
120130
130140
140150
150100
1 06170
1 70180
15
44
133
F1
125
F2
35
16
N=600
>> We prepare the table with the column of cumulative frequencies and use the formula for median. Class
Frequency
cf
100-110
15
15
110-120
44
59
120-130
133
192
130-140
f1
192 + f1
140-150
125
317 + f1
150-160
f2
317 + f1 + f2
160-170
35
352 + f1 + f2
170-180
16
368 + f1 + f2
Median class
N = 600
Median = 1 +
h N − c f 2
We can take the median class as 130-140 since median is given to be 137.2 l=
130 + 130 = 130 , h = 10 f = f1, c = 192 2
∴ 137.2 = 130 +
10 (300 - 192) f1
i.e., 137-2 – 130 =
1080 i.e., 7.2 f1 = 1080 or f1 150 f1
But the last cumulative frequency must be equal to N = 600
29
i.e.
368 + f1 + f2 = 600 368 + 150 + f2 = 600 ∴ f2 = 82
Thus f1 = 150, f2 = 82
30
Relationship between various measures of dispersion We have some of following relationships among the various methods of measures of dispersion 1. Mean ± QD covers 50% of observations of the distribution 2. Mean ± MD covers 57.5% of observations 3. Mean ± 1 σ includes 68.27% of observations 4. Mean ± 2 σ includes 95.45% of observations 5. Mean ± 3 σ includes 99.73% of observations 6. QD = 6745 σ = 7. MD = 8. QD =
2 σ 3
2 4 xσ= σ A 5 5 MD 6
9. Combining the results we get 3 QD = 2 SD and 5 MD = 4 SD that is also equal to 6 QD. 10. Range = 6 times SD. SOURCES AND REFERENCES 8. Statistics for Management, Richard I Levin, PHI / 2000. 9. Statistics, RSN Pillai and Bagavathi, S. Chands, Delhi. 10. An Introduction to Statistical Method, C.B. Gupta, & Vijaya Gupta, Vikasa Publications, 23e/2006. 11. Business Statistics, C.M. Chikkodi and Salya Prasad, Himalaya Publications, 2000. 12. Statistics, D.C. Sancheti and Kappor, Sultan Chand and Sons, New Delhi, 2004. 13. Fundamentals of Statistics, D.N. Elhance and Veena and Aggarwal, KITAB Publications, Kolkata, 2003. 14. Business Statistics, Dr. J.S. Chandan, Prof. Jagit Singh and Kanna, Vikas Publications, 2006.
31
CORRELATION ANALYSIS Concept and Importance of Correlation We may come across certain series wherein there may be more than one variable. A distribution in which each variable assumes two values is called a Bivariate Distribution. If we measure more than two variables on each unit of a distribution, it is called Multivariate Distribution. In a bivariate distribution, we may be interested to find if there is any relationship between the two variables under study. The Correlation is a statistical tool which studies the relationship between two variables and the correlation analysis involves various methods and techniques used for studying and measuring the extent of the relationship between the two variables. Correlation analysis is used as a statistical tool to ascertain the association between two variables. “When the relationship is of a quantitative nature, the appropriate statistical tool for discovering & measuring the relationship and expressing it in a brief formula is known as correlation.” - Croxton & Cowden “Correlation is an analysis of the covariation between two or more variables.” - A. M. Tuttle “Correlation Analysis contributes to the understanding of economic behaviour, aids in locating the critically important variables on which others depend, may reveal to the economist the connections by which disturbances spread and suggest to him the paths through which stabilizing forces may become effective.” -
W.
A.
Neiswanger “The effect of correlation is to relation is to reduce the range of uncertainty of our prediction.” -
Tippett
32
The problem in analyzing the association between two variables can be broken down into three steps. o We try to know whether the two variables are related or independent of each other. o If we find that there is a relationship between the two variables, we try to know its nature and strength. This means whether these variables have a positive or a negative relationship and how close that relationship is. o We may like to know if there is a causal relationship between them. This means that the variation in one variable causes variation in another. When data regarding two or more variables are available, we may study the related variation of these variables. For e.g. in a data regarding heights (x) and weights (y) of students of a college, we find that those students who have greater height would have greater weight. Also, students who have lesser height would have lesser weight. This type of related variation among variables is called correlation. Correlation may be (i) Simple correlation (ii) Multiple correlation (iii) Partial correlation. Simple correlation concerns with related variation among two variables. Multiple correlation and partial correlation concern with related variation among three or more variables. Two variables are said to be correlated when they vary such that a. The higher values of one variable correspond to the higher values of the other and the lower values of the variable correspond to the lower values of the other. or b. The higher values of one variable correspond to the lower values of the other. Generally, it can be seen that those who are tall will have greater weight, and those who are short will have lesser weight. Thus height (x) and weight (y) of persons show related variation. And so they are correlated. On the other hand production (x) and price (y) of vegetables show variation in opposite directions. Here the higher the production the lower would be the price.
33
In both the above examples, the variables x and y show related variation. And so they are correlated. TYPES OF CORRELATION Correlation is positive (direct) if the variables vary in the same directions, that is, if they increase and decrease together. Height (x) and weight (y) of persons are positively correlated. Correlation is negative (inverse) if the variables vary in the opposite directions, that is, if one variable increases the other variable decreases. Production (x) and price (y) of vegetables are negatively correlated. If variables do not show related variation, they are said to be non – correlated. If variables show exact linear relationship, they are said to be perfectly correlated. Perfect correlation may be positive or negative.
Correlation and Causation o The correlation may be due to chance particularly when the data pertain to a small sample. o It is possible that both the variables are influenced by one or more other variables. o There may be another situation where both the variables may be influencing each other so that we cannot say which is the cause and which is the effect.
Types of Correlation o Positive and Negative: If the values of the two variables deviate in the same direction i.e., if the increase in the values of one variable results, on an average, in a corresponding increase in the values of the other variable or if a decrease in the values of one variable results, on an average, in a corresponding decrease in the values of the other variable, correlation is said to be positive or direct. For example: Price & Supply of the commodity. On the other hand, correlation is said to be negative or inverse if the variables deviate in the opposite direction i.e., if 34
the increase (decrease) in the values of one variable results, on the average, in a corresponding decrease (increase) in the values of the other variable. For example: Temperature and Sale of Woolen Garments. o Linear and Non-Linear: The correlation between two variables is said to be linear if corresponding to a unit change in one variable, there is a constant change in the other variable over the entire range of the values. For example: y = ax + b. The relationship between two variables is said to be non-linear or curvilinear if corresponding to a unit change in one variable, the other variable does not change at a constant rate but at a fluctuating rate. When this is plotted in the graph this will not be a straight line. o Simple, Partial and Multiple: The distinction amongst these three types of correlation depends upon the number of variables involved in a study. If only two variables are involved in a study, then the correlation is said to be simple correlation. When three or more variables are involved in a study, then it is a problem of either partial or multiple correlation. In multiple correlation, three or more variables are studied simultaneously. But in partial correlation we consider only two variables influencing each other while the effect of other variable is held constant. For example: Let us suppose that we have three variables, number of hours studied (x); IQ (y); marks obtained (z). In a multiple correlation we will study the correlation between z with 2 variables x & y. In contrast, when we study the relationship between x & z, keeping an average IQ as constant, it is said to be a study involving partial correlation.
Methods of Correlation METHODS OF CORRELATION GRAPHIC SCATTER DIAGRAM
ALGEBRAIC COVARIENCE METHOD
Process of Calculating Coefficient of Correlation
RANK CORRELATION
CONCURRENT DEVIATION METHOD
35
o Calculate the means of the two series: X and Y. o Take deviations in the two series from their respective means, indicated as x and y. The deviation should be taken in each case as the value of the individual item minus (–) the arithmetic mean. o Square the deviations in both the series and obtain the sum of the deviationsquared columns. This would give ∑x2 and ∑y2. o Take the product of the deviations, that is, ∑xy. This means individual deviations are to be multiplied by the corresponding deviations in the other series and then their sum is obtained. o The values thus obtained in the preceding steps ∑xy, ∑x2 and ∑y2 are to be used in the formula for correlation. SCATTER DIAGRAM METHOD Scatter diagram is a graphic presentation of bivariate data. Here, bivariate data with n pairs of values is represented by n points on the xy – plane. The two variables are taken along the two axes, and every pair of values in the data is represented by a point on the graph. The pattern of distribution of points on the graph can be made use of for the rough estimation of degree of correlation between the variables. In the scatter diagram – a. If the points form a line with positive sloe (a line moving upwards), the variables are positively and perfectly correlated. b. If the points form a line with negative slope (a line moving downwards), the variables are negatively and perfectly correlated. c. If the points cluster around a line with positive slope the variables are positively correlated. d. If the points cluster around a line with negative slope, the variables are negatively correlated. e. If the points are spread all over the graph, the variables are non correlated.
36
f. Any other curve – form of spread of points indicates curvilinear relation between the variables. Scatter diagram is one of the simplest ways of diagrammatic representation of a bivariate distribution and provides us one of the simplest tools of ascertaining the correlation between two variables. Suppose we are given n pairs of values of two variables X and Y. For example, if the variables X and Y denote the height and weight respectively, then the pairs my represent the heights and weights (in pairs) of n individuals. These n points may be plotted as dots (.) on the x – axis and y – axis in the xy – plane. (It is customary to take the dependent variable along the x – axis.) the diagram of dots so obtained is known as scatter diagram. From the scatter diagram we can form a fairly good, though rough idea about the relationship between the two variables. The following points may be borne in mind in interpreting the scatter diagram regarding the correlation between the two variables: 1. If the points are very dense i.e very close to each other, a fairly good amount of correlation may be expected between the two variables. On the other hand, if the points are widely scattered, a poor correlation may be expected between them. 2. If the points on the scatter diagram reveal any trend (either upward or downward), the variables are said to be correlated and if no trend is revealed, the variables are uncorrelated. 3. If there is an upward trend rising from lower left hand corner and going upward to the upper right hand corner , the correlation is positive since this reveals that the values of the two variables are move in the same direction. If, on the other hand the points depict a downward trend from the upper left hand corner, the correlation is negative since in this case the values of the two variables move in the opposite directions. 4. In particular , if all the points lie on a straight line starting from the left bottom and going up towards the right top, the correlation is perfect and positive , and if all the points lie on a straight line starting from the left top and coming down to right bottom , the correlation is perfect and negative. 5. The method of scatter diagram is readily comprehensible and enables us to form a rough idea of the nature of the relationship between the two variables merely by 37
inspection of the graph. Moreover, this method is not affected by extreme observation whereas all mathematical formulae of ascertaining correlation between two variables are affected by extreme observations. However, this method is not suitable if the number of observations is fairly large. 6. The method of scatter diagram tells us about the nature of the relationship whether it is positive or negative and whether it is high or low. It does not provide us exact measure of the extent of the relationship between the two variables. 7. The scatter diagram enables us to obtain an approximate estimating line or line of best fit by free hand method. The method generally consists in stretching a piece of thread through the plotted points to locate the best possible line. KARL PEARSON’S COEFFICIENT OF CORRELATION (COVARIENCE METHOD; PRODUCT MOMENT) This is a measure of linear relationship between the two variables. It indicates the degree of correlation between the two variables. It is denoted by ‘r’. INTERPRETATION OF COEFFICIENT OF CORRELATION a. A positive value of r indicates positive correlation b. A negative value of r indicates negative correlation c. r = +1 means, correlation is perfect positive. d. r = -1 means, correlation is perfect negative. e. r = 0 (or low) means, the variables are non – correlated. Karl Pearson’s measure known as Pearsonian correlation co efficient between two variables ( series) X and Y , usually donated by r , is a numerical measure of linear relationship between them and is defined as the ratio of the covariance between X and Y , written as Cov (x, y) to the product of standard deviation of X and Y .
Assumptions of the Karl Pearson’s Correlation o The two variables X and Y are linearly related.
38
o The two variables are affected by several causes, which are independent, so as to form a normal distribution.
Coefficient of Determination The strength of r is judged by coefficient of determination, r2 for r = 0.9, r2 = 0.81. We multiply it by 100, thus getting 81 per cent. This suggests that when r is 0.9 then we can say that 81 per cent of the total variation in the Y series can be attributed to the relationship with X.
Rank Correlation Limitations of Spearman’s Method of Correlation o Spearman’s r is a distribution-free or non parametric measure of correlation. o As such, the result may not be as dependable as in the case of ordinary correlation where the distribution is known. o Another limitation of rank correlation is that it cannot be applied to a grouped frequency distribution. o When the number of observations is quite large and one has to assign ranks to the observations in the two series, then such an exercise becomes rather tedious and time-consuming. This becomes a major limitation of rank correlation.
Some Limitations of Correlation Analysis o Correlation analysis cannot determine cause-and-effect relationship. o Another mistake that occurs frequently is on account of misinterpretation of the coefficient of correlation and the coefficient of determination. o Another mistake in the interpretation of the coefficient of correlation occurs when one concludes a positive or negative relationship even though the two variables are actually unrelated.
39
40
Properties of Correlation Coefficient Property 1 - Limits for Correlation Coefficient Pearsonian correlation coefficient can not exceed 1 numerically. In other words it lies between 1 and -1. Symbolically: – 1 ≤ r ≤ 1. r = + 1 implies perfect positive correlation between the variables. Property 2 - Correlation Coefficient is independent of the change of origin and scale. Mathematically, if X and Y are given and they are transformed to the new variables U and V by the change of origin and scale viz, u = (x – A)/h
and
v = (y – B)/k
; h >0, k >0
Where A, B, h >0, k >0; then the correlation coefficient between x and y is same as the correlation coefficient between u and v i.e., r (x,y) = r ( u, v)
rxy = ruv
Property 3 - Two independent variables are uncorrelated but the converse is not true. Remarks: one should not be confused with the words of uncorrelation and independence. rxy = 0 i.e., uncorrelation between the variables x and y simply implies the absence of any linear (straight line) relationship between them. They may however, be related in some other form (other than straight line) e.g., quadratic (as we have see in the above example, logarithmic or trigonometric form. Property 5 - If the variables x and y is (+ 1) if the signs of a and b are different and (-1) if the signs of a and b are alike. Interpretation of r the following general points may be borne in mind while interpreting an observed value of correlation coefficient r: If r = -1 there is perfect negative correlation between the variables. In this scatter diagram will again be a straight line.
41
If r = 0, the variables are uncorrelated in other words there is no linear (straight line) relationship between the variables. However, r = 0 does not imply that the variables are independent. For other values of r lying between + 1 and – 1 there are no set guidelines for its interpretation. The maximum we can conclude is that nearer the value of r to 1, the closer is the relationship between the variables and nearer is the value of r to 0 the less close is the relationship between them. One should be very careful in interpreting the value of r as it is often misinterpreted. The reliability or the significance of the value of the correlation depends on a number of factors. One of the ways of testing the significance of r is finding its probable error, which in addition to the value of r takes into account the size of the sample also. Another more useful measure for interpreting the value of r is the coefficient of determination. It is observed there that the closeness of the relation ship between two variables is not proportional to r.
In total the Properties are: o Limits for Correlation Coefficient. o Independent of the change of origin & scale. o Two independent variables are uncorrelated but the converse is not true. o If variable x & y are connected by a linear equation: ax+by+c=0, if the correlation coefficient between x & y is (+1) if signs of a, b are different & (-1) if signs of a, b are alike.
Important Formulas:
r=
nΣ dx.dy - Σdx. Σdy √nΣdx2 – (Σdx)2. nΣdy2 -(Σdy)2
r=
Σxy √[Σx2.Σy2]
r = [Cov (x,y)} / [SD (x)*SD (y)]
42
The application of the formulaes depends on different situations. Following are some problems which are solved using different formulas. We can notice that irrespective of the formulas the answer will remain same. Problem Number 1, 2, 3 are solved with different formulas for the same data. xy
X
Y
x=X-X
y=Y-Y
x2
y2
39
47
-26
-19
676
361
65
53
0
-13
0
169
62
58
-3
-8
9
64
24
90
86
25
20
625
400
82
62
17
-4
289
16
500 -68
75
68
10
2
100
4
20
25
60
-40
-6
1600
36
240
98
91
33
25
1089
625
825
36
51
-29
-15
841
225
435
78
84
13
18
169
324
234
650
660
0
0
5398
2224
2704
X
Y dx=X-A dy=Y-A dx2
dy2
dxdy
39
47
-31
-13
961
169
403
65
53
-5
-7
25
49
35
62
58
-8
-2
64
4
X2
2704
√[5398*2224]
r = 0. 7804
Y2
XY
2209
1833
– (50)58 XY 10*2404 - ΣX. ΣY
nΣ 1521 47
62
40
2. 58√nΣX 3844 2 –(ΣX) 3364 3596 nΣY2 –(ΣY)2
90
86
8100
7396
82
62
6724
3844
5084
75
68
5625
4624
5100
25
60
625
3600
5100
98
91
9604
8281
8918
36
51
1296
2601
1836
78
84
6084
7056
6552
86
20
24
400
576
480
82
62
12
2
144
4
75
68
5
8
25
64
r 24=
25
60
-45
0
2025
0
0
98
91
28
31
784
961
868
36
51
-34
-9
1156
81
306
78
84
8
24
64
576
192
-50
58
5648 2484
2364
r=
Y
r=
39
16
90
650 660
X
494 0
r=
√[10*5648 – (-50)2 . 10*2484 – (58)2] 65 53 4225 2809 3445
r =7740 0.78
10*45604 – 650*660
43
. 2 650 660 47648 457842r=0. 45604 √10*47648 –(650) 10*45784–(660) 7804
Problem No 4: From the following data given calculate “n”: Correlation coefficient – 0.8; Summation of product deviations – 60; SD of y – 2.5; Summation of x2 – 90. x & y are the deviations from their arithmetic mean. Answer:
r = [Cov (x,y)} / [SD (x)*SD (y)] 0.8 = [1/n (60)] / [{√(90/n)}*(2.5)] 0.8*0.8 = [(1/n)*(1/n)*60*60] / [(90/n)*2.5*2.5] 0.8*0.8*2.5*2.5*90 = [(1/n)*(1/n)*60*60] n=10 Problem 5: A computer while calculating correlation coefficient between x & y from a pair of 25 observations. Summation X is 125, Summation X2 is 650; Summation Y is 100, Summation Y2 is 460; Summation of X&Y is 508. Later it is observed that two pairs of observations were taken as (6, 14) and (8,6) instead of (8, 12) and (6,8). Prove that the correct correlation coefficient is 0.67. Answer: When we apply the formula we get the answer. First applying the formula we need to find all terms. Then add all correct values [(8, 12) and (6,8)] after deducting wrong values [(6, 14) and (8,6)] from those terms. Now apply them in the formula. We get the answer as 2/3. Problem 6:
44
If the relation between two random variables x & y is: 2x+3y=4, then the correlation coefficient is: Answer: -1 (by the property) Problem 7: In two sets of variables X & Y with 50 observations each, following data was observed: AM of X is 10; SD of X is 3; AM of Y is 6; SD of Y is 2; coefficient of correlation is 0.3. However after subsequent verification one pair (10,6) was weeded out. What is the change in the correlation coefficient with the remaining 49 pairs of values? Answer: As that in problem first we need to find all terms in the formula. After that deduct the wrong values (10,6) from those terms. Now apply new terms in the formula again. We get the answer.
PROBABLE ERROR After computing the value of the correlation coefficient, the next step is to find the extent to which it is dependable. Probable error of correlation coefficient usually denoted by P.E (r) is an old measure of testing the reliability of an observed value of correlation coefficient in so far as it depends upon the condition of random sampling. If r is the observed correlation coefficient in a sample of n pairs of observation then its standard error, usually denoted by S.E (r) is given by
1 – r2 SE (r) = √n
PE (r) = SE (r) * 0.6745
The reason for taking the factor 0.6745 is that in a normal distribution 50% of the distribution lie in the rang μ ± 0.6745 σ is the s.d. 45
According to Secrist, “The probable error of the correlation coefficient is an amount which if added to and subtracted from the mean correlation coefficient, produces amounts within which the chances are even that a coefficient of correlation from a series selected a random will fall.
Uses of probable error The probable error of correlation coefficient may be used to determine the limits which the population correlation coefficient may be expected to lie. Limits for population correlation coefficient are 1. r ± P.E. (r) : This implies that if we take another random sample of the same size n from
the same population from which the first sample was taken, then the
observed value of the correlation coefficient , say, r1 in the second sample can be expected to lie within the limits given. 2. P.E. (r) may be used to test if an observed value of sample correlation coefficient is significant of any correlation in the population. The following guidelines may be used: a. If r < P.E. ( r ) i.e, if the observed value of r is less than its P.E., then the correlation is not at all significant. b. If r > P.E. ( r ) i.e, if the observed value of r is greater than 6 times its P.E., then r is definitely significant. c. In other situation nothing can be concluded with certainty. Important Remarks 1: Sometimes P.E. may lead to fallacious conclusions particularly when n , the number of pairs of observations is small. In order to use P.E. effectively, n should be fairly large. However a rigorous test for testing the significance of an observed sample correlation coefficient is provided by Student’s t test. Important Remarks 2: P.E. can be used only under the following conditions a. The data must have been drawn from a normal population. 46
b. The conditions of random sampling should prevail in selecting sampled observation. r < PE (r) – r is not at all significant; r > 6 PE (r) – r is significant; other cases nothing can be concluded with certainty. Problem 1: Comment whether the correlation coefficient is significant or not. X
Y
Dx=(X-60)/5 Dy=(Y-65)/5 dx2 dy2 dxdy
45 35 -3
-6
9
36
18
70 90 2
5
4
25
10
65 70 1
1
1
1
1
30 40 -6
-5
36
25
30
90 95 6
6
36
36
36
40 40 -4
-5
16
25
20
50 60 -2
-1
4
1
2
75 80 3
3
9
9
9
85 80 5
3
25
9
15
60 50 0
-3
0
9
0
2
-2
140 176 141
1 – r2 SE (r) =
√n 1 – (0.9)2
SE (r) =
PE (r) = SE (r) * 0.6745
PE (r) = 0.06 * 0.6745
√10
SE (r) = 0.0600
PE (r) = 0.0405
0.9 > 6 PE (r) [i.e.,0.2432] – r is highly significant
Working Note:
10*141 – (2) 2
nΣ dx.dy - Σdx. Σdy
r=
√nΣdx2 – (Σdx)2. nΣ dy2 -(Σdy)2
r = 0.90
r = √10*140 – (-2)2 . 10*176 – (2)2
CORRELATION IN BIVARIATE FREQUENCY TABLE If in a bivariate distribution the data are fairly large,
they may be summarized in the
form of a two way table. Here for each variable , the values are grouped into various classes ( not necessarily the same for both the variables) keeping in view the same 47
considerations as in the case of univariate distribution. For example, if there are m classes for the X – variable series and n classes for the Y – variable series then there will be m x n cells in the two – way table. By going through the different pairs of the values ( x, y) and using tally marks we can find the frequency for each cell and thus obtain the so called bivariate frequency table.
NΣfxy – (Σfx)(Σfy)
r=
√[N Σfx2 – (Σfx)2] [NΣfy2 – (Σfy)2]
NΣfuv – (Σfu)(Σfv)
r=
√[N Σfu2 – (Σfu)2] [NΣfv2 – (Σfv)2]
Food Expenditure
Family Income (Rs.) 200-300 300-400 400-500 500-600 600-700
(in %) 10-15
-
-
-
3
7
15-20 C-I
-
4
9
4
3
x 250 350 7 450 550 20-25 6 650 12
5
-
8
-
u
-2
25-30
CI
Y
-1 3 0
1
2 19
10
v
f
fv
fv2
fuv
12.5
-1 -
-
-
3
7
10
-10
10
-17
17.5
0
-
4
9
4
3
20
0
0
0
22.5
1
7
6
12
5
-
30
30
30
-15
27.5
2
3
10
19
8
-
40
80
160
-16
100 100
200
-48
f
10
20
40
20
10
fu
-20 -20
0
20
20
0
fu2
40
20
0
20
40
120
fuv -26 -26
0
18
-14
-48
48
100*(-48) – 0*100 √[(100*120)-0] √[(100*200)-(100)2]
r = -0.4381 Problem 2:
18
Age in Years 19 20 21
22.5
3
2
-
-
-
17.5
-
5
4
-
-
12.5
-
-
7
10
-
7.5
-
-
-
3
2
2.5
-
-
-
3
1
Marks
22
49
x
18
19
20
21
22
u
-2
-1
0
1
2
y
v
22.5
2
3
17.5
1
12.5
-12
f
fv
fv2
fuv
2
-4
-
-
-
5
10
20
-16
-
5
-5
4
-
-
9
9
9
-5
0
-
-
7
10
-
17
0
0
0
7.5
-1
-
-
-
3
2.5
-2
-
-
-
3
-3
-6
2
-4
5
-5
5
-7
1
-4
4
-8
16
-10
6
50
-38
f
3
7
11
16
3
40
fu
-6
-7
0
16
6
9
fu2
12
7
0
16
12
47
fuv -12
-9
0
-9
-8
-38
40*(-38) – 9*6 √[(40*47)-(9)2] √[(40*50)-(6)2]
r = -0.8373
RANK CORRELATION METHOD Sometimes we come across statistical series in which the variable under consideration are not capable of quantitative measurements but can be arranged in a serial order. This happens when we are dealing with qualitative characteristics ( attributes) such as honesty, beauty, character, morality, etc. Which cannot be measured quantitatively but can be arranged serially. In such situations Karl Pearson’s coefficient of correlation cannot be used as such. Charles Edward Spearman, a British psychologist, developed a formula in 1904 which consists in obtaining the correlation coefficient between the ranks of n individuals in the two attributes under study. The Pearson Correlation Coefficient between the ranks X and Y is called the rank correlation coefficient between the characteristics A and B for that group of individuals.
50
The students are assigned ranks in Statistics according to their marks in Statistics. Also, they are assigned ranks in Mathematics according to their marks in Mathematics. Then, the correlation between these two sets of ranks is called rank correlation. The coefficient of correlation computed for these ranks is called Spearman’s coefficient of rank correlation. In a bivariate data, if the values of the variables are ranked in the decreasing (or increasing) order, the correlation between these ranks is rank correlation. The coefficient of correlation computed for these rank is Spearman’s coefficient of rank correlation. It is denoted by ρ (Rho) If R1 and R2 are the ranks in the two characteristics, and d = R1 – R2 is the difference between the ranks, coefficient of rank correlation is – ρ = 1 - 6∑d2
n3 – n Since ρ is the product moment coefficient of correlation between the ranks , it is a value between -1 and +1 Karl Pearson’s coefficient of correlation can be calculated only if the characteristics under study are quantitative ( they should be numerically measurable) but, Spearman’s coefficient of rank correlation can be calculated even if the characteristics under study are qualitative.
If it is possible to assign ranks to the units with regard to the two
characteristics , co efficient of rank correlation can be calculated. REPEATED RANKS In case of attributes if there is a tie i.e. if any two or more individuals are placed together in any classification w.r.t an attribute or if in case of variable data there is more than one item with the same value in either or both the series, then Spearman’s formula for calculating the rank correlation coefficient breaks down, since in this case the variable X ( the ranks of individuals in characteristic A ( 1st series) and Y ( the ranks of individuals characteristic B ( 2nd series) do not take the values from 1 to n and consequently x ≠ y, while in proving we had assumed that x = y. 51
For the computation of coefficient of rank correlation, while ranking the values, two or more values may be equal. And so, a situation of ties may arise. In such a case, all those values which are equal are assigned with the same average rank. And then, the coefficient of rank correlation is found. Here, corresponding to every such repeated rank correlation is found. Here corresponding to every such repeated rank (which repeats m times), a factor (m3 – m) / 12 is added to ∑d2 In this case, common ranks are assigned to the repeated items. These common ranks are the arithmetic mean of the ranks which these items would have got if they were different from, each other and the next item will get the rank next to the rank used in computing the common rank. For e.g, suppose an item is repeated at rank 4. Then the common rank to be assigned to each item is ( 4 + 5) / 2 i.e, 4.5 which is the average of 4 and 5 , the ranks which these observations would have assumed if they were different. The next item will be assigned the rank 6. if an item is repeated thrice at
rank 7, then the common
rank to be assigned to each value will be ( 7+8+9)/ 3, i.e 8 which the arithmetic mean of 7,8 and 9 viz, the ranks these observation would have got if they were different from each other. The next rank to be assigned will be 10. If only a small proportion of the ranks are tied, this technique may be applied together with formula. If a large proportion of ranks are tied, it is advisable to apply an adjustment or a correction factor as explained: “In a formula add the factor
m (m2 – 1) / 12 to ∑d2, where m is the number of
times an item is repeated. This correction factor is to be added for each repeated value in both the series. REMARKS ON SPEARMAN’S RANK CORRELATION COEFFICIENT 1. Since Spearman’s rank correlation coefficient ρ is nothing but Pearson’s correlation coefficient between the ranks, it can be interpreted in the same way as the Karl Pearson’s correlation coefficient. 2.
Karl Pearson’s correlation coefficient assumes that the parent population from which sample observations are drawn is normal. If this assumption is
52
violated than we need a measure which is distribution – free (or non – parametric). A distribution free measure is one which does not make any assumptions about the form of the population. Spearman’s ρ is such a measure (i.e. distribution free), since no strict assumptions are made about the form of the population from which sample observations are drawn. 3. Spearman’s formula is easy to understand and apply as compared with Karl Pearson’s formula. The values obtained by the two formulae, viz Pearsonian r and Spearman’s ρ are generally different. The differences arise due to the fact that when ranking is used instead of full set of observations, there is always some loss of information. Unless many ties exist, the coefficient of rank correlation should be slightly lower than the Pearsonian coefficient. 4. Spearman’s formula is the only formula to be used for finding correlation coefficient if we are dealing with qualitative characteristics which cannot be measured quantitatively but can be arranged serially. It can also be used where actual data are given. In case of extreme observations, Spearman’s formula is preferred to Pearson’s formula. 5. Spearman’s formula has its limitation also. It is not practicable in the case of bivariate frequency distribution. For n > 30, this formula should not be used unless the ranks are given, since in the contrary case the calculations are quite time consuming.
When ranks are not repeated: Rank in A (x) Rank in B (y)
D=x-y
ρ = 1-
2
1
10
2
7
-5
3
2
1
6
6[∑D + {m(m2-1)/12}] -2
4
1
6
8
-2
2 n(n – 1) 4
Problem 1: 7
3
4
16
8
1
7
49
9
9
0
0
10
5
5
25
When ranks4 are repeated: 5
ρ = 1-
-9
2 6∑D D
81
2 n (n –1) 25
1 2 4 1
206
53
6*206
1-10(100-1) ρ = -0.24
Problem 2: Cost
Sales
X
Y
D
D2
39
47
8
10
-2
4
65
53
6
8
-2
4
62
58
7
7
0
0
90
86
2
2
0
0
82
62
3
5
-2
4
75
68
5
4
1
1
25
60
10
6
4
16
98
91
1
1
0
0
36
51
9
9
0
0
78
84
4
3
1
1
6*30
1-10(100-1) ρ = 0.82
30 54
Problem 3:
X
Y
R1
R2
D
D2
48
13
3
5.5
-2.5
6.25
33
13
5
5.5
-0.5
0.25
40
24
4
1
3
9
9
6
10
8.5
1.5
2.25
16
15
8
4
4
16
16
4
8
10
-2
4
65
20
1
2
-1
1
24
9
6
7
-1
1
16
6
8
8.5
-0.5
0.25
57
19
2
3
-1
1
6(∑D2+∑[m(m2-1)/12])
ρ =1-
n (n2 – 1)
6 (41+ 2 + 0.5 +0.5)
ρ =1-
10(102–1)
ρ = 0.7333
ALGEBRAIC METHOD (CONCURRENT 41 DEVIATIONS) This is very casual method of determining the correlation between two series when we are not very serious about its precision. This is based on the signs of the deviations. ( i.e. direction of the change) of the values of the variable from its preceding value and does not take into account the exact magnitude of the values of the variable. Thus, we put a plus (+) sign , minus (- ) sign, or equality (=) sign for the deviation if the value of the variable is greater than, less than or equal to the preceding value respectively. The deviation in the values of the two variables are said to be concurrent if they have the same sign i.e. either both deviation are positive , both are negative or both are equal. The formula used for computing correlation coefficient r by this method is given by
r = + √ + [(2c-n)/n]
55
Where c is the number of pairs of concurrent deviation and n is the number of pairs of deviation. In the formula plus / minus sign to be taken in side and outside the square root is of fundamental importance. Since -1 ≤ r ≤ 1 , the quantity inside the square root , viz, ± ( 2c – n) must be positive otherwise r will be imaginary which is not possible.
n
Thus, if (2c – n) is positive , e take positive sign in and outside the square root in and if ( 2c – n) is negative , we take negative sign in and outside the square root. Remarks 1: it should be clearly noted that here n is not the number of pairs of observation but it is the number of pairs of deviation and as such it is one less than the number of pairs of observation. Remarks 2: r computed by formula is also known as coefficient of concurrent deviations. Remarks 3: coefficient of concurrent deviations is primarily based o the following principle: “If the short time fluctuations of the time series are positively correlated or in other words, if their deviation is concurrent, their curves would move in the same direction and would indicate positive correlation between them”.
Year
Supply
x
1993
160
1994
164
+
1995
172
1996 1997
Price
y
xy
280
-
-
+
260
-
-
182
+
234
-
-
166
-
266
+
-
292
r = ± √ ± [(2c-n)/n] 1998
170
+
254
-
-
1999
178
+
230
-
-
2000
192
+
190
-
-
2001
186
-
200
+
-
r = ± √ ± [(0-8)/8] 56
r = -1
57
REGRESSION Literally the word regression means ‘return to the origin’. In statistics, the word is used in a different sense. If two variables are correlated, the unknown value of one of the variables can be estimated by using the known value of the other variable. The so estimated value may not be equal to the actually observed value, but it will be close to the actual value. Regression Analysis, in general sense, means the estimation or prediction of the unknown value of one variable from the known value of the other variable. The Regression Analysis confined to the study of only two variables at a time is termed as Simple Regression. But quite often the values of a particular phenomenon may be affected by multiplicity of causes. The Regression analysis for studying more than two variables at a time is known as Multiple Regression. In Regression Analysis there are two types of variables. The variable whose value is influenced or is to be predicted is called dependent variable. The variable which influences the values or used for prediction is called independent variable. The Regression Analysis independent variable is known as regressor or predictor or explanator while the dependent variable is also known as regressed or explained variable. LINEAR & NON-LINEAR REGRESSION If the given bivariate data are plotted on a graph, the points so obtained on the diagram will more or less concentrate around a curve, called the “Curve of Regression”. The mathematical equation of the Regression curve, is called the Regression Equation. If the regression curve is a straight line, we say that there is linear regression between the variables under study. If the curve of regression is not a straight line, the regression is termed as curved or non-linear regression. The property of the tendency of the actual value to lie close to the estimated value is called regression. In a wider usage regression is the theory of estimation of unknown value of a variable with the help of known values of the variables. The regression theory was first introduced and developed by Sir Francis Galton in the field of Genetics. 58
Here, firstly, a mathematical relation between the two variables is framed. This relation which is called regression equation is obtained by the method of least squares. It may be linear or non – linear. For a bivariate data on x and y, the regression equation obtained with the assumption that x is dependent on y is called regression of x on y. The regression of x on y is: (x – AM of x ) = bxy (y – AM of y) The regression equation obtained with the assumption that y is dependent on x is called regression of y on x. the regression of y on x is – (y – AM of y) = byx (x – AM of x) The following set of formulas explains all the terms given below:
bxy =
bxy=
r. бx
bxy =
бy
nΣxy - Σx.Σy 2
nΣy -(Σy)
2
Cov (x,y) бy2 Σdx.dy
bxy =
Σdy2
byx =
byx=
r. бy бx
byx =
nΣxy - Σx.Σy 2
nΣx -(Σx)
2
Cov (x,y) бx2 Σdx.dy
byx =
Σdx2
The regression of x on y is used for the estimation of x values and the regression of y on x is used for the estimation of y values. The graph of the regression equations are the regression lines.
PROPERTIES OF REGRESSION Regression coefficient are the coefficients of the independent variables in the regression equations. 1. The regression coefficient bxy is the change occurring in x for unit change in y. The regression coefficient byx is the change occurring in y for unit change in x.
59
2. The regression coefficient is independent of the origin of measurements of the variables. But, they are dependent on the scale. 3. The geometric mean of regression coefficients is equal to the coefficient of correlation (numerically). 4. The regression coefficients cannot be of opposite signs. If r is positive, both the regression coefficients will be positive. If r is negative, both the regression coefficients will be negative. If r is zero, both the regression coefficients will be zero. 5. Since coefficient of correlation, numerically cannot be greater than 1, the product of regression coefficients cannot be greater than 1.
PROPERTIES OF REGRESSION LINES There are two regression lines. 1. The regression lines intersect at ( x,y) 2. The regression lines have positive slope if the variables are positively correlated. They have negative slope if the variables are negatively correlated. 3. If there is perfect correlation, the regression lines coincide ( there will be only one regression line) LINES OF REGRESSION Line of regression is the lines which gives the best estimate of one variable for any given value of the other variable. In case of two variable say x & y, we shall have two regression equations; x on y and the other is y on x. Line of regression of y on x is the line which gives the best estimate for the value of y for any specified value of x. Line of regression of x on y is the line which gives the best estimate for the value of x for any specified value of y.
60
LINES OF REGRESSION OF y on x
(y - AM of y) = (x – AM of x) r. бy бx LINES OF REGRESSION OF x on y
(x – AM of x) = (y - AM of y) r. бx бy REMEMBER a. When r=0 i.e., when x & y are uncorrelated, then the lines of regression of y on x, and x on y are given as: y – y = 0 and x – x = 0. The lines are perpendicular to each other. b. When r=+1 then the two lines coincide. c. If the value of r is significant, we can use the lines of regression for estimation and prediction. d. If r is not significant, then the linear model is not a good fit and hence the line of regression should not be used for prediction. COEFFICIENTS OF REGRESSION a. bxy is the Coefficient of regression of x on y. b. byx is the Coefficient of regression of y on x. THEOREMS ON REGRESSION COEFFICIENTS a. The correlation coefficient is the Geometric Mean between the Regression Coefficients i.e., r2= bxy byx b. The sign to be taken before the square root is same as that of regression coefficients. c. If one of the regression coefficient is greater than one, then the other must be less than one. d. The AM of the modulus value of regression coefficients is greater than the GM of the modulus value of the Correlation Coefficient. 61
e. Regression coefficients are independent of change of origin but not of scale. Problem 1: X
Y
dx=X-X dy=Y-Y dx2 dy2 dxdy
91
71
1
1
1
1
1
97
75
7
5
49
25
35
105 69
18
-1
324
1
-18
121 97
31
27
961
729
837
67
-23
0
529
0
0
124 91
34
21
1156 441
714
51
39
-39
-31
1521 961
1209
73
61
-17
-9
289
81
153
111 80
21
10
441
100
210
57
-33
-23
1089 529
759
0
6360 2868 3900
70
47
900 700 0
Σdx.dy
bxy =
bxy =
Σdy2
Σdx.dy
byx =
3900 2868
1.361
byx =
Σdx2
3900
0.6132
(y-y) = byx (x-x)
(x-x) = bxy (y-y) (x-90) = 1.361(y-70)
(y-70) = 0.6132 (x-90)
x=1.361y - 5.27
y=0.6132x + 14.812
Problem 2: The data about the sales & advertisement expenditure of a firm is given below: Sales
Advertisement Expenditure
Means
40
6
Standard Deviations
10
1.5
6360
Coefficient of Correlation is 0.9
62
o Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores. o What should be the advertisement expenditure if the firm proposes a sales target of 60 crores of rupees? Answer:
(x-x) = bxy (y-y)
bxy =
r. бx
(y-y) = byx (x-x)
byx =
бy
r. бy бx
(x-40) = (0.9*10/1.5) (y-6)
(y-6) = (0.9*1.50/10) (x-40)
x = 6y+4
y = 0.135x+0.6
x = 6*10+4
y = 0.135*60+0.6
x = 64
y =8.7
Problem 3: Point out the consistency, if any, in the following statement: “The Regression Equation of y on x is 2y+3x=4 and the correlation coefficient between x & y is 0.8” Answer: Refer properties.
Problem 4:
63
By using the following data, find out the two lines of regression and from them compute the Karl-Pearson’s coefficient of correlation: ΣX=250; ΣY=300; ΣXY=7900; ΣX2=6500; ΣY2=10000; n=10 Answer:
nΣxy - Σx.Σy
bxy =
2
nΣy -(Σy)
byx =
2
10*7900 – 250*300
bxy =
byx =
10*10000 -(300)2
0.4
rxy
2
nΣxy - Σx.Σy nΣx2 -(Σx)2
10*7900 – 250*300 10*6500 -(250)2
1.6
rxy
2
= bxy* bxy
rxy = 0.8
= 1.6* 0.4
Problem 5: Find the two regression coefficients and hence the r. n=5; X=10; Y=20; Σ(X-4)2=100; Σ(Y-10)2=160; Σ(X-4)(Y-10)=80 Answer: U=X-4; U=X-4=6; ΣU= nU = 30. Similarly ΣV=50
byx= byx=
nΣUV - ΣU.ΣV nΣU2 -(ΣU)2
5*80 – 30*50 5*100 -(30)2
= (11 4)
byx= byx=
nΣUV - ΣU.ΣV nΣV2 -(ΣV)2
5*80 – 30*50 5*160 -(50)2
= (11 17)
64
r = √(11/4)(11/17) = 1.33 ( it is impossible)
Time Series Generally, planning of economic and business activities is based on predictions of production, demand, sales etc. The future can be predicted by a detailed study of the past variations. Thus, future demand can be predicted by studying the variations in the demand for last few years. A time series may be defined as a collection of readings belonging to different time periods, of some economic variable or composite of variables. A series of observations of a phenomenon recorded at successive points of time is called Time Series. It is a chronological arrangement of statistical data regarding the phenomenon. Generally, time series are those of production, demand, sales, price, imports, exports, bank rate, value of money, etc. Usually in time series equidistant points of time are considered. There may be weekly, monthly, yearly, etc recordings. A graphical presentation of a time series is called Historigram. COMPONENTS OF A TIME SERIES In a time series, the observations vary with time. The variation occurring in any period is the result of many factors. The effects of these factors may be summed up as four components. They are – a. Trend. ( Secular trend, Long Term Movement) b. Seasonal Variation. Cyclical variation ( Business Cycle) c. Irregular variation ( Random Fluctuation, Erratic Variation) d. Cyclical Variation An analytical Study of different components of a time series, the effects of these components, etc is called analysis of time series. The utility of such analysis is –
65
a. Understanding the past behaviour of the variable b. Knowing the existing nature of variation c. Predicting the future trend d. Comparison with other similar variables. Trend (Secular Trend) Trend is the overall change taking place in the time series over a long period of time. It is the change taking place in a period of many years. Most of the time series show a general tendency to increase, decrease or to remain constant over a long period of time. Such an overall change occurring is the trend. Examples a. Steady increase in the population of India in the past many years is an upward trend. b. Steady increase in the price of gold in last many years is an upward trend. c. Due to availability of greater medical facilities, death rate is decreasing. Thus, death rate shows a downward trend. d. Atmospheric temperature at a place, though show short time variation, does not show significant upward or downward trend. The root cause of trend is technological advancement, growth of population change in tastes etc. Trend is measured, mainly by the method of moving averages and by the method of least squares. Seasonal Variation The regular and periodic variation in a time series is called seasonal variation. Generally, the period of seasonal variation would generally, the period of seasonal variation would be within one year. The factors causing seasonal variation are (1) weather condition, (2) customs, tradition and habits of people. Seasonal variation is predictable. Examples
66
a. An increase in the sales of woolen cloths during winter. b. An increase in the sales of note – books during the month of June, July and August. c. An increase in atmospheric temperature during summer. Cyclical Variation (Business Cycle) Cyclical Variation is an oscillatory variation which occurs in four stages viz – prosperity, recession, depression and recovery. Generally, such variation occurs in economic and business activities. They occur in a gap of more than one year. One cycle consisting of four stages occurs in a period of few years. The period is not definite. Generally, the period is 5 to 10 years. Many Economists have explained the causes of cyclical variation. Each of them is significant. Irregular variation (Random Fluctuation) Apart from the regular variations, most of the time series show variations which are totally unexpected. Irregular variations occur as a result of unexpected happenings such as wars, famines, strikes, floods etc. they are unpredictable. Generally, the effect of such variation lasts for a short period. Examples a. An increase in the price of vegetables due to a strike by the railway employees. b. A decrease in the number of passengers in the city buses, occurring as a result of strike by public sector employees. c. An increase in the number of deaths due to earthquakes. Measurement Of Trend o Graphic (or Free-hand Curve fitting) Method o Method of Semi-Averages o Method of Curve Fitting by the Principle of Least Squares o Method of Moving Averages 67
METHOD OF SEMI-AVERAGES Problem 1: Estimate value for 2000. If the actual sales figures are 35000 units, how do you account for the difference between the figures obtained? Years
1993
1994
1995
1996
1997
1998
Sales
20
24
22
30
28
32
Answer:
Year
Sales (‘000s)
1993
20
1994
24
1995
22
1996
30
1997
28
1998
32
3 yearly Semi Avg
66
90
Semiaverage
22
30
Year
Trend Values (‘000s)
1993
22 – 2.667
19.333
1994
22
22
1995
22 + 2.667
24.667
1996
30 - 2.667
27.333
1997
30
30
1998
30 + 2.667
32.667
1999
32.667 + 2.667
35.334
2000
35.334 + 2.667
38
(30-22) = 8 8/3 = 2.667 The difference is because of the assumption that there is a linear relationship between the given time series values. Moreover, the effects of seasonal, cyclical and irregular variations have been completely neglected.
68
Problem 2: From the following series find the Trend by Semi Average method. Estimate the value for the year 1999. Year Value
90
91
92
93
94
95
96
97
98
170
231 261 267 278 302 299 298 340
Answer:
Year
Values 4 yearly
Semi-
Semi-Totals Average 1990
170
1991
231
1992
261
1993
267
1994
278
1995
302
1996
299
1997
298
1998
340
929
232
1239
310
(310 – 232) = 78 78 / 5. Estimate of the year 1999: 310+(5/2)*(78/5) = 349
69
METHOD OF CURVE FITTING: PRINCIPLE OF LEAST SQUARES Fitting of Linear Trend: y = a + bx To find a & b: (i) ∑y = na + b∑x; (ii) ∑xy = a ∑x + b ∑x2 Fitting of a Second Degree (Parabolic) Trend: y = a + bx + cx2 To find a, b & c: (i) ∑y = na + b∑x + c∑x2 (ii) ∑xy = a∑x + b∑x2 + c∑x3 (iii) ∑x2y = a ∑x2 + b∑x3 + c∑x4 Problem 3: Fit a linear trend from the following data. Estimate the production for the year 1999. Verify ∑(y-ye) = 0 where ye is the corresponding trend value of y. Year
1990
1992
1994
1996
1998
Production
18
21
23
27
16
Answer: Let us consider the year 1994 to be the mid point (It would be nice to take this as the mid point as there are odd number of years). Year
Production
x
x2
xy
Trend Values
(y-ye)
1990
18
-4
16
-72
20.6
-2.6
1992
21
-2
4
-42
20.8
0.2
1994
23
0
0
0
21
2
1996
27
2
4
54
21.2
5.8
1998
16
4
16
64
21.4
-5.4
40
4
105
0
70
Fitting of Linear Trend: y = a + b x To find a & b: ∑y = n a + b∑ x
105 = a*5 + b*0
∑xy = a ∑x + b ∑x2
a = 21
4 = a*0 + b*40
b = 0.1
Therefore the equation will be given by: y = 21 + 0.1x Estimated production of 1999: y = 21 + 0.1*5
y=21.5 thousands of units.
Problem 4: Calculate the quarterly trend values by the method of least squares for the following quarterly data for the last 5 years given below: Year 1994 1995 1996 1997 1998
I Quarter 60 68 80 108 160
II Quarter 80 104 116 152 184
III Quarter 72 100 108 136 172
IV Quarter 68 88 96 124 164
Answer: Year
Total
Average
U
U2
Uy
Trend Values
1994 1995
280 360
70 90
-2 -1
4 1
140 -90
64 88
1996
400
100
0
0
0
112
71
1997
520
130
1
1
130
136
1998
680
170 560
2 0
4 10
340 240
160
Fitting of Linear Trend: y = a + b U To find a & b: ∑y = n a + b∑ U
560 = a*5 + b*0 a = 112 ∑Uy = a ∑U + b ∑U2 240 = a*0 + b*10 b = 24
Therefore the equation will be given by: y = 112 + 24x Therefore the quarterly increment is : (24/4)=6 By the calculations we come to know that the quarterly increment is 6. Therefore the values for second & third Quarters of 1994 are: 64 - (6/2) & 64 + (6/2) respectively.
Year
I Quarter
II Quarter
III Quarter
IV Quarter
1994
55
61
67
73
1995
79
85
91
97
1996
103
109
115
121
1997
127
133
139
145
1998
151
157
163
169
72
73
Problem 1: Fit an equation of the form y = a + b x + c x2 to the data given below. X
1
2
3
4
5
Y
25
28
33
39
46
Answer:
X
Y
x
x2
x3
x4
xY
Yx2
Trend Values
1
25
-2
4
-8
16
-50
100
24.88
2
28
-1
1
-1
1
-28
28
28.26
3
33
0
0
0
0
0
0
32.92
4
39
1
1
1
1
39
39
38.86
5
46
2
4
8
16
92
184
46.08
10
0
34
53
351
171
Fitting of a Second Degree (Parabolic) Trend: ∑y = na + b∑x + c∑x2
171 = 5a+0b+10c
…..(i)
∑xy = a∑x + b∑x2 + c∑x3
53=0a+10b+0c
…..(ii)
74
∑x2y = a ∑x2 + b∑x3 + c∑x4
351=10a+0b+34c
…..(iii)
By (ii) b = 5.3; Solving (i) and (iii) [Multiply (i) by 2 and deduct that from (iii)] we get c = o.64 (14c = 9) and a = 32.92 (171-10*0.64=5a) Therefore the equation is: y = 32.92 + 5.3 x + 0.64 x2 Problem 2: Fit an equation of the form y = A. Bx to the data given below x
1
2
3
4
5
y
1.6
4.5
13.8
40.2
125
Answer:
x
y
Y= log y
Yx
x2
Trend Values
1
1.6
0.2041
0.2041
1
1.6
2
4.5
0.6532
1.3064
4
4.6
3
13.8
1.1399
3.4197
9
13.8
4 5 15
40.2 125
1.6042 2.0969 5.6983
6.4168 10.4845 21.8315
16 25
41.1 122.3
Fitting of a Exponential Curve: y = A. Bx
…..(i)
Taking Logarithm we get: log y = log A+ x log B Y = a + bx
…..(ii); Y = log y; a = log A; b = log B
…..(iii)
Equation (ii) can be written as: ∑Y = na + b∑x
5.6983 = 5a + 15b
…..(iv)
∑xY = a∑x + b∑x2
21.8315 = 15a+55b
…..(v)
75
By solving (iv) & (v) we get b = 0.4737 & a = -0.2814 Take Antilog we get A = 0.5231; B = 2.977; Therefore the trend equation is: y = 0.5231*(2.977)x METHOD OF MOVING AVERAGES This is the simple and flexible method of measuring trend. Moving Average is an averaging process that smoothens out the fluctuations and ups & downs in the given data. The Moving Average of period ‘m’ is a series of successive averages of m overlapping values at a time, starting with 1st, 2nd, 3rd value and so on. Problem 3: Calculate 5 yearly Moving Average from the data given below: 10; 14; 18; 22; 26; 30; 34; 38; 42; 46 Answer: Year
Values
5 yearly Moving Total
Average
1
10
2
14
3
18
90
18
4
22
110
22
5
26
130
26
6
30
150
30
7
34
170
34
8
38
190
38
9
42
210
42
10
46
11
50
76
Problem 4: Calculate 4 yearly Moving Average from the following data: 37.4; 31.1; 38.7; 39.5; 47.9; 42.6 Answer: Year
Production
4 yearly Moving
2
Total
Moving Total Moving
Average
PeriodCentered Average
1991
37.4
1992
31.1 146.7
1993
38.7 157.2
1994
36.675
1995
47.9
1996
42.6
37.99
81.475
40.74
39.300
39.5 168.7
75.975
42.175
SEASONAL VARIATIONS The variations due to such forces which operate in a regular periodic manner with period less than one year. The objectives of studying this is as follows: o To isolate seasonal variations: To determine the effect of seasonal swings on the values of a given phenomenon. o To eliminate them: To determine the value of the phenomenon if there were no seasonal ups & downs.
77
Methods: o Method of “Simple Averages” o “Ratio to Trend” Method o “Ratio to Moving Averages” Method o “Link Relative” Method SIMPLE AVERAGES This is the simplest method of measuring the seasonal variations in a time series and involves the following steps: o
Arrange the data by years & months
o
Compute the average for the months
o
Compute the overall average
o
Obtain seasonal Indices for different months
Problem 5: Compute the seasonal index from the data given: Quarter
1990
1991
1992
1993
1994
1995
I
3.5
3.5
3.5
4.0
4.1
4.2
II
3.9
4.1
3.9
4.6
4.4
4.6
III
3.4
3.7
3.7
3.8
4.2
4.3
IV
3.6
4.8
4.0
4.5
4.5
4.7
Answer:
78
Year
I Qtr.
II Qtr.
III Qtr.
IV Qtr.
1990
3.5
3.9
3.4
3.6
1991
3.5
4.1
3.7
4.8
1992
3.5
3.9
3.7
4.0
1993
4.0
4.6
3.8
4.5
1994
4.1
4.4
4.2
4.5
1995
4.2
4.6
4.3
4.7
TOTAL
22.8
25.5
23.1
26.1
A.M. Seasonal
3.8
4.25
3.85
4.35
93.6
104.7
94.8
107.1
Index
X = 4.06 {(3.8+4.25+3.85+4.35)/4} o
{(3.8/4.06)*100}=93.6
o
{(4.25/4.06)*100}=104.7
o
{(3.85/4.06)*100}=94.8
o
{(4.35/4.06)*100}=107.1
RATIO TO TREND This is a method which is an improvement over the previous method. This is on the assumption that seasonal fluctuations for any season are a constant factor of the trend. This involves the following steps: o
Compute the trend values by the appropriate method
o
Assuming multiplicative model, trend is eliminated.
o
Arrange values according to the years, months or quarters
o
These seasonal indices are adjusted to the total of 1200 for monthly data or 400 for quarterly data.
Problem 6:
79
Using “Ration to Trend” method, determine seasonal index. Year
I Quarter II Quarter III Quarter IV Quarter
1
68
60
61
63
2
70
58
56
60
3
68
63
68
67
4
65
56
56
62
5
60
55
55
58
Answer: Year
Total
Average
x
x2
xy
Trend values
1
252
63.0
-2
4
-126
64.3
2
244
61.0
-1
1
-61
62.85
3
266
66.5
0
0
0
61.4
4
242
60.5
1
1
60.5
59.95
5
224
56.0
2
4
112
58.5
307
0
10
-14.5
Fitting of Linear Trend: y = a + b x To find a & b: ∑y = n a + b∑ x
307 = a*5 + b*0
∑xy = a ∑x + b ∑x2
a = 61.4
-14.5 = a*0 + b*10
b = -1.45
Therefore the equation will be given by: y = 61.4 -1.45x Quarterly values will be: increment of (-1.45/2 = -0.36) Between II & III quarter: - 0.36/2 = -0.18
Year
Trend Values
Trend Eliminated Values
I Quarter II Quarter III Quarter IV Quarter I Quarter II Quarter III Quarter IV Quarter
1
64.84
64.48
64.12
63.76
104.9
93.05
95.13
98.81
2
63.39
63.03
62.67
62.61
110.4
92.02
89.36
96.29
3
61.94
61.58
61.22
60.86
109.8
102.3
111.1
110.1
80
4
60.50
60.14
59.78
59.42
107.4
98.10
93.68
104.3
5
59.06
58.70
58.34
57.98
101.6
93.70
87.42
100.03
Total
534.1
479.2
476.7
509.6
Average
106.8
95.84
95.33
101.9
Adjusted Seasonal Indices
106.9
95.9
95.4
101.9
Sum of the averages: 106.8 + 95.84 + 95.33 + 101.9 = 399.90 Trend Eliminated Values are: (Given Value for that Quarter / Trend Value for that Quarter)* 100 Therefore the Correction Factor is:
400/ 399.90
RATIO TO MOVING AVERAGES This is a method which is an improvement over the previous method. This is a widely used measure which involves the following steps: o
Obtain 12-month (4-quarter) moving average values.
o
Express the original values as a percentage of centered moving average.
o
Arrange these according to the years/months/quarter
o
These indices should be 1200 or 400.
Problem 7: Calculate the seasonal indices. 1991
I Quarter
68
II Quarter
62
III Quarter
61
63.125
IV Quarter
63
62.250
81
1992
1993
I Quarter
65
62.375
II Quarter
58
62.750
III Quarter
66
62.875
IV Quarter
61
63.875
I Quarter
68
64.125
II Quarter
63
64.500
III Quarter
63
IV Quarter
67
Answer: Ratio to Moving Averages: (61/63.125)*100 = 96.63; (63/62.250)*100 = 101.20; ….. and so on. Trend Eliminated Values Year 1991 1992 1993 Total
I Quarter II Quarter III Quarter IV Quarter 104.21 106.04 210.25
92.43 97.67 190.1
96.63 104.97 201.6
101.20 95.50 196.7
Averages
105.13
95.05
100.80
98.35
Adjusted Seasonal Indices
105.31
95.21
100.97
98.52
399.33
LINK RELATIVES This is the value of the given phenomenon in any season expressed as a percentage of its value in the preceding season. This involves the following steps: o
Convert the original data into link relatives.
o
Average these link relatives for each month.
o
Convert Link Relatives into Chain relatives.
o
Obtain CR for the first month
o
Obtain Corrected Chain relatives.
Problem 8:
82
Wheat Prices (10 Kgs.) Year
1990
1991
1992
1993
Quarter I Qtr. (Jan- Mar)
75
86
90
100
II Qtr. (Apr – June)
60
65
72
78
III Qtr. (Jul – Sept.)
54
63
66
72
IV Qtr. (Oct. – Dec.)
59
80
85
93
Answer: Note: Link Relatives for any month = (Current Month’s Value / Previous Month’s Value) * 100 Chain Relative for any month = (Link Relative of that month * Chain Relative of the preceding month) / 100 New CR for the First Quarter: (LR of I Qtr. * CR of last Qtr.)/100 (123.303 * 89.81) / 100 =112.54 d = ¼(New CR of first Qtr. -100) = ¼(112.54 – 100) = 3.135 Adjusted CR: 78.395 – 3.135 = 75.26; 72.69 – 6.27 = 66.42; 89.81 – 9.405 = 80.41 Year 1990 1991 1992 1993
I Quarter II Quarter III Quarter IV Quarter 145.76 112.5 117.65
80 75.58 80 78
90 96.92 91.67 92.31
109.26 126.98 128.79 129.17
83
Total 375.91 Average 125.3 Chain Relative 100 Adjusted CR 100 Seasonal Indices 124.2
313.58 78.395 78.395 75.26 93.47
370.90 92.725 72.69 66.42 82.49
494.20 123.55 89.81 80.41 99.87
322.09 400
CYCLICAL VARIATIONS This is an approximate or crude method of measuring cyclical variations, which consists of estimating trend, seasonal components and then eliminating their effect from the given Time Series. RANDOM VARIATIONS These can not be estimated accurately, we can not obtain an estimate the variance of random components.
84
INDEX NUMBERS Index number is an indicator of the level of a phenomenon at a specific point of time in comparison with its level at some other specific point of time. Index numbers may be of varying price, production, growth rate, imports, exports, cost of living, etc. Generally, index numbers of various economic activities are found useful. For Economists, index numbers are of use at every stage of planning, policy making, decision making etc. and so, index numbers may very be called ‘Economic Barometers’. Just as Barometers measure atmospheric pressure, index numbers measure changes occurring in economic field. An index number is a statistical device designed to measure relative level of a group of related variables over a period of time and space. In other words it is a number which expresses the overall level of a group of related variables at a given time called ‘Current Period’ as compared to the level a some other time called ‘Base Period’. Generally, index numbers are expressed in percentage. Thus, if index number of wholesale prices of food articles in 1995 as compared to 1990 is 150, the implication is that overall level of wholesale prices of food articles I 1995 is 150% of the level in 1990. Here, 1995 is the current year and 1990 is the base year. Index number can very well be calculated for individual variables. For instance, if price of a commodity is Rs. 5 in 1992 and Rs. 8 in 1995, the index number of price for the year 1995 with respect to the base 1992 is P = (8/5)* 100 = 160. That is, the price of the commodity in 1995 is 160% of its price in 1992. Here, since only a single variable is considered, the index number is called ‘Relative’. In this particular case, it is the ‘Price Relative’. Price Relative is the price in the current year expressed as a percentage of the price in the base year. If p0 and p1 are the prices of a commodity in the base year and the current year respectively, the price relative is P = (p1/p0)* 100. This is an indicator which reflect the relative changes in the level of certain phenomenon in any given period (or over a specified period of time) called the current period with respect to its values in some fixed period, called base period selected for comparison
85
DEFINITION “Index Numbers are statistical devices designed to measure the relative change in the level of a phenomenon (variable or group of variables) with respect to time, geographical location or other characteristics such as income, profession etc.” Generally index numbers are of three types. 1. Price index number 2. Quantity index number 3. Value index number Various price index numbers which are in use are wholesale price index number, consumer price index number, etc. The price index number may be of different groups of commodities – food articles, laboratory equipments etc. Price Index Numbers indicate the general level of prices
of articles in the current period as compared to that of the
base period. Quantity Index Numbers are index numbers of quantity of goods imported or exported, quantity of agricultural produce etc. Value Index Numbers are the index numbers of the total money value of transaction taking place. Note 1: price index is 125 means price level in the current year is 125% of price level in the base year. Note 2: Average price level in 1990 is double the average price level in 1980 means index numbers of price for 1990 with base 1980 is 200. Note 3: index number for 1995 with base 1970 is 325 means average price level has increased by 225% from 1970 to 1995. PROBLEMS IN CONSTRUCTION 86
o
The Purpose of Index Numbers
o
Selection of Commodities or Items
o
Data for Index Numbers
o
Selection of Base Period
o
Type of Average to be used
o
System of Weighting
o
Choice of formula
IMPORTANT NOTATIONS o
p0: Price of the Commodity in the Base Period
o
p1: Price of the Commodity in the Current Period
o
q0: Quantity of a Commodity consumed or purchased during the Base Period
o
q1: Quantity of a Commodity consumed or purchased in the Current Period
o
w: Weight assigned to a commodity according to its relative importance in the group.
o
I: Simple Index Number or Price Relative obtained on expressing current year price as a percentage of the base year price and is given by: I = Price Relative = (p1/p0)*100
o
P01: Price Index Number for the Current Year w.r.t. the Base Year
o
P10: Price Index Number for the Base Year w.r.t. the Current Year
o
Q01: Quantity Index Number for the Current Year w.r.t. the Base Year
o
Q10: Quantity Index Number for the Base Year w.r.t. the Current Year
o
V01: Value Index Number for the Current Year w.r.t. the Base Year
o
p0j: Price for the jth commodity in the Base Year, j = 1,2,3 … n.
o
p1j: Price for the jth commodity in the Current Year
USES OF INDEX NUMBER 1. Index numbers are useful to governments in formulating policies regarding economic activities such as taxation, imports and exports, grant of license to new firms, bank rate. 2. Index number are useful in comparing variation in production , price etc.
87
3. Index numbers help industrialist and businessman in planning their activities such as production of goods, their stock etc. 4. Consumer price index number is used for the
fixation of salary and grant of
allowance to employees. 5. Consumer price index numbers are used for the evaluation of purchasing power of money. LIMITATIONS OF INDEX NUMBERS 1. While constructing index numbers, some representative items alone are made use of. The index number so obtained may not indicate the changes in the concerned fields accurately. 2. As customs and habits change from time to time the use of commodities also varies. And so, it is not possible to assign proper \weights to various items. 3. Many formulae are used for the construction of index numbers. These formulae give different values for the index. 4. There is ample scope for bias in the construction of index numbers. By altering the price quotation or by improper selection of items, index numbers can be manipulated. STEPS IN THE CONSTRUCTION OF INDEX NUMBERS The various steps in the construction of index numbers are – o Defining (Stating) the purpose of the index number. o Selecting the base period o Selecting the items o Obtaining price quotations o Selecting the appropriate systems of weights. o Selecting the appropriate formula. 1. Defining (Stating) the purpose of the index number.
88
At the very outset, the purpose of the index number should be decided. As different index numbers are useful for different purposes, the purpose on hand may need a particular index number. A clear definition of purpose will help in the selection of the right index number. While constructing the index number, the selection of items, base periods, weights, etc, depend mainly on the purpose. Absence of clear definition of purpose often leads to construction of an unsuitable index number. 2. Selecting the base period. While constructing an index number, appropriate base period should be selected. The base period should be selected. The base period should be economically stable. There should not be abnormal variations. The period should be free of wars, floods, famines, etc. it should not be too distant from the current period. Again, the consumption pattern during the two periods should not differ much. Depending on the situation, fixed base index number or chain base index number may be preferred. 3. Selecting the items. Selection of items is mainly based on the purpose of the index number. Items differ with the purpose. For example, a wholesale price index number requires items which are transacted at the wholesale market. A consumer price index number requires items which are consumed by the particular group of people. However, in a consumer price index number, items differ with the habits, customs and standard of living. Generally, there are many items that could be included in the index number. But the list can be reduced by selecting representative items only. 4. Obtaining price quotations. After selecting the items for constructing an index number, price quotations for these items should be obtained. Since price is likely to vary from place to place, it is better to obtain price quotations from different places. Also, it is advisable to obtain price quotations from different agencies. Then, the prices should be averaged. Again prices are likely to vary during the span of the base period and also during the span of the current
89
period. Hence, it is better to collect price quotations at regular intervals. These quotations should be averaged and the average should be used in the construction. 5. Selecting the appropriate systems of weights. The items considered in constructing index numbers often have varied importance; weights are attached to the items. Mostly, these weights are quantities in the base period, those in the current period or these in any other period. Sometimes, a combination of quantities in different periods may be considered as weights. 6. Selecting the appropriate formula The selection of formula is based mainly on the availability of data regarding quantities, Laspeyre’s, Paasche’s, fisher’s or any other index number is calculated. While selecting the formula care should be taken to see that maximum use of available data is made. PRICE INDEX NUMBER The various price index numbers in common use are – o Laspeyre’s index number o Paasche’s index number o Marshall – Edgeworth index number o Fischer’s ideal index number. QUANTITY INDEX NUMBERS Generally, quantity index numbers are calculated by adopting price as weights. Some of the quantity index numbers are o Laspeyre’s Quantity index number o Paasche’s Quantity index number o Marshall – Edgeworth Quantity index number o Fischer’s ideal Quantity index number.
90
Tests for an Index Number A good index number should satisfy the following tests. 1. Time reversal test 2. Factor reversal test. Time reversal test. This test is proposed by Irving Fisher. According to him, an index number (formula) should be such that when the base year and current year are interchanged (reversed) the resulting index number should be the reciprocal of the earlier. The time reversal test requires that the index number computed backwards should be the reciprocal of the index number computed forwards, except for the constant of proportionality. Let P01 be the index number (based on certain formula) for the period ‘1’ with respect to the base period ‘0’. Let P10 be the index number (based on the same formula) for the period ‘0’ with respect to the base period ‘1’. Then, the particular index number (formula) satisfies time reversal test if - P01 x P10 = 1 Here, P01 and P10 are mere ratios – they should not be expressed as percentages. Time reversal test is not satisfied by Laspeyre’s and Paasche’s index numbers. But it is satisfied by Marshall – Edgeworth and Fischer’s ideal index numbers. Factor Reversal Test This test also proposed by Irving Fisher. Here, the argument is that the index number (formula) should be such that the price index and quantity index computed according to the formula should both be quality effective in indicating changes. Factor reversal test requires that the product of the index number of price (with quantities as weights) and the index number of quantity (with prices as weights) should indicate net change in value taking place in between the two periods. 91
Thus if, P01 and Q01 are mere ratios – they should not be expressed as percentages. Fisher’s index number satisfies factor reversal test. But, Laspeyre’s, Paasche’s and Marshall – Edgeworth index numbers do not satisfy this test. BIAS IN AN INDEX NUMBER Generally, if price of a commodity shows significantly high increase, its use will decrease. The consumers lessen the use of such commodities. Thus, if base year quantities are used as weights, the greater variation of price will get greater weightage than needed. Therefore such an index number will be an overestimate of the actual situation. Thus, Laspeyre’s index number which uses the base year quantities as weights, is generally an over estimate. It shows upward bias. On the other hand, if current year quantities are used as weights, the greater variations will be paid lesser importance than needed. This leads to a downward bias. Thus, Paasche’s index number, which uses current year quantities as weights, is generally an under estimate. It shows downward bias. However, fisher’s and Marshall – Edgeworth index numbers make use of base as well as current year quantities and so, they are free of bias. FISHER’S INDEX NUMBER IS ‘IDEAL’. Fisher’s index number is called ‘Ideal Index Number’ because of the following reasons. o It is a geometric mean which is considered as the appropriate average for averaging ratios. o
It takes into account the base year quantities as well as the current year quantities.
o It is free of bias. o It satisfies both time reversal and factor reversal test. CONSUMER PRICE INDEX NUMBER Consumer Price Index Number is an index number of the cost met by a specified class of consumers in buying a ‘Basket of goods and services’. Here, Basket of goods and
92
services’ means goods and services needed in day to day life of the specified class of consumers. The pattern of consumption of goods is different in different classes. And so, the general index numbers fail to indicate the changes in costs with regard to various classes of consumers. Here, ‘Class of consumers’ means group of consumers having almost identical pattern of consumption. Generally, the classes are those of workers of a factory, people belonging to a particular community, government employees, etc. USES OF CONSUMER PRICE INDEX NUMBERS 1. Consumer Price Index Numbers indicate the changes in the consumer prices. And so, they help governments in formulating policies regarding control of price, taxation, imports and exports of commodities, etc. 2. They are used in granting allowances and other facilities to employees. 3. They are used for the evaluation of purchasing power of money. They are used for deflating money. 4. They are used for comparing changes in the coat of living of different classes of people. STEPS IN THE CONSTRUCTION OF CONSUMER PRICE INDEX NUMBER The steps in the construction of a consumer price index number are – 1. Defining Scope and Coverage At the very outset, it is necessary to decide the class of consumers for which the index number is required. The class may be that of bank employees, government employees, merchants, farmers etc. In any case the geographical coverage should also be decided. That is, the locally, city or town where the class dwells should be mentioned. Anyhow the consumers in the class should have almost the same pattern of consumption. 2. Conducting family budget enquiry and selecting the weights. Having decided about the scope and coverage, the next step is to conduct a sample survey of consumer families regarding their budget on various items. The survey should cover a reasonably good number of representative families. It should be conducted during
a 93
period of economic stability. In the survey, information regarding commodities consumed by the families, their quality, and the respective budget are collected. The items included in the index number are classified generally under the heads (1) Food, (2) Clothing, (3) Fuel and lighting, (4) Miscellaneous. Sufficiently large number of representative items is included under each head. 3. Obtaining price quotations The quotations of retail prices of different commodities are collected from local market. The quotation is collected from different agencies and from different places. Then, they are averaged and the averages are made use of. The price quotations of the current period and that of the base periods should be collected. 4. Computing the index number. There are two methods of computation of consumer price index number. They are – a. Aggregative expenditure method. b. Family budget method. Aggregative expenditure method Here the quantities used in the base year are taken as weights. Thus, the consumer price index number by this method is: P01 = (Total expenditure in the current year / Total expenditure in the base year) x 100 Family budget method: Consumer price index number by this method is the weighted arithmetic mean of the price relatives. The weights assigned are the expenditure in a normal period. Thus, the consumer price index number is: P01 = (∑WI / ∑W) where W = P0Q0 and I = (P1/P0) METHODS (along with formulas) o Simple (Unweighted) Aggregate Method:
∑ p1 ∑ p0
*100
∑ q1 ∑ q0
*100
94
P01 =
Q01 =
o Weighted Aggregate Method:
∑ wp1 P01 =
∑ wp0
*100
o Lapeyre’s Price Index or Base Year Method:
∑p1q0
La
P01 =
∑p0q0
*100
o Paasche’s Price Index: P01Pa = o Fisher’s Price Index
∑p1q1 ∑p0q1
*100
P01F= [P01Pa *P01La]1/2 o
Marshall Edgeworth Price Index Number:
∑p1q1 + ∑p1q0
P01Ma =
∑p0q1 + ∑p0q0
*100
Problem 1: From the following compute Price Index Numbers using all four methods.
Commodities
1970 Quantity
Price A B C D
20 50 40 20
1980 Quantity
Price 8 10 15 20
40 60 50 20
6 5 15 25
Answer:
95
1970
1980
Commodities
p0q0
p0q1
p1q0
p1q1
p0
q0
p1
q1
A
20
8
40
6
160
120
320
240
B
50
10
60
5
500
250
600
300
C
40
15
50
15
600
600
750
750
D
20
20
20
25
400
500
400
500
1660
1470
2070
1790
Answer: Laspeyre’s Index Number:
∑p1q0 ∑p0q0
2070
*100
1660
*100
124.699
Paasche’s Index number:
∑p1q1 ∑p0q1
1790
*100
*100
121.77
1470
Fisher’s Ideal Index Number:
[P01F= P01Pa *P01La]1/2
[124.699*121.77 ]1/2
123.32 96
Marshall Edgeworth Index Number:
∑p1q1 + ∑p1q0
1790 + 2070
*100
∑p0q1 + ∑p0q0
*100
1470 + 1660
123.23
Problem 2: From the following construct index number of the group of four commodities by using Fishers Ideal method Commodities
Base Year
Current Year
Price
Expenditure
Price
Expenditure
A
2
40
5
75
B
4
16
8
40
C
1
10
2
24
D
5
25
10
60
Answer:
Commodities
Base Year
Current Year
q0
q1
p1q0
p0q1
75 8
20 4
15 5
100 32
30 20
24
10
12
20
12
p0
p0q0
p1
p1q1
A B
2 4
40 16
5 4
C
1
10
2
97
D
5
25
10
60
91
5
6
199
Fisher’s Ideal Price Index
√
[P01F= P01Pa *P01La]1/2
202 * 199 91 * 92
*100
50
30
202
92
219.12
TEST OF CONSISTENCY o
Unit Test: This test requires that the Index Number formula should be independent of the units in which the prices or the quantities of various commodities are quoted. All those formulas which were discussed earlier other than Simple Aggregate of Prices (Quantities) satisfy this test.
o
Time Reversal Test
:
P01 * P10 = 1
Other than Laspeyre’s & Paasche’s Index Numbers all others satisfy this test. o
Factor Reversal Test:
P01 * Q01 =
[∑p1q1/ ∑p0q0]
Problem 3: From the following check whether (i) Laspeyre’s (ii) Paasche’s (iii) Fishers Index Numbers satisfy the Time & factor Reversal Tests commodities
Base Year Price Quantity
Current Year Price Quantity
A B
6.5 2.8
500 124
10.8 2.9
560 148
C D
4.7 10.9
69 38
8.2 13.4
78 24
E
8.6
49
10.8
27
Answer:
98
Commodities
p0q0
p1q0
p0q1
p1q1
A
3250
5400
3640
6048
B
347.2
359.6
414.4
429.2
Laspeyre’s
C
324.3
565.8
366.6
639.6
Index Number: 101.21
D
414.2
509.2
261.6
321.6
E
421.4
529.2
232.2
291.6
4757.1
7363.8
4914.8
7730
Laspeyre’s
Price
Index
Number: 154.80
Paasche’s
Price
Quantity
Index
Number: 157.28 Paasche’s
Quantity
Index Number: 104.97 Fisher’s Ideal Price Index Number: 156.03
Fisher’s Ideal Quantity Index Number:
103.01 By trail we can find that Fisher’s Index Number satisfies both the tests. Problem 4: From the following calculate Cost Of Living Index Number Commodities
Base Year Price
Current Year Price
Weights
A
30
47
4
B
8
12
1
C
14
18
3
D
22
15
2
E
25
30
1
Answer:
99
Commodities
P
WP
A
156.67
626.67
B
150
150
C
128.57
385.71
D
68.18
136.36
E
120
120 1418.74
1418.74/11 = 128.98 1418.74/11 = 128.98
Notes by: Prof.Sudheer Pai, RNSIT, Bangalore
PROBABILITY 100
INTRODUCTION Suppose a coin is tossed. The toss may result in the occurrence of 'Head' or in the occurrence of 'Tail'. Here, the chances of head and tail are equal*. In other words, the probability of occurrence of head is ½ and the probability of occurrence of tail is ½ Thus, Probability is a numerical measure which indicates the chance of occurrence. There are three systematic approaches to the study of probability. They are 1. The classical approach 2. The empirical approach. 3. The axiomatic approach. Each of these approaches has its own merits and demerits. Chance has a part to play in almost all activities. In every such activity, there is indefiniteness. For example, 1. 2. 3.
A new-born child may be male or female. A stone aimed at a mango on a tree may hit it or it may not. A student who takes P.U.E. examination may score any mark between 0 and 100.
In the midst of such indefiniteness, predictions are made. This necessitates a systematic study of probabilistic happenings.
RANDOM EXPERIMENT (Stochastic experiment, Trial) There are two types of experiments. They are— (i) Deterministic experiment and (ii) Random experiment. A deterministic experiment, when repeated under the same conditions, results in the same outcome. It has a unique outcome. Random experiment is an experiment which may not result in the same outcome when repeated under the same conditions. It is an experiment which does not have a unique outcome. For example, 1. The experiment of 'Toss of a coin' is a random experiment. It is so because when a coin is tossed the result may be 'Head' or it may be 'Tail'. 2. The experiment of 'Drawing a card randomly from a pack of playing cards' is a random experiment. Here, the result of the draw may be any one of the 52 cards.
101
SAMPLE SPACE The set of all possible outcomes of a random experiment is the Sample space. The sample space is denoted by S. The outcomes of the random experiment (elements of the sample space) are called sample points or outcomes or cases. A sample space with finite number of outcomes is a finite sample space. A sample space with infinite number of outcomes is an infinite sample space. Ex1. While throwing a die, the sample space is S = {1, 2, 3, 4, 5, 6}. This is a finite sample space. Ex2. While tossing two coins simultaneously, the sample space is S = {HH, HT, TH, TT}. This is a finite sample space. Ex3. Consider the toss of a coin successively until a head is obtained. Let the number of tosses be noted. Here, the sample space is S= {1, 2, 3,4....}.This is an infinite sample space.
EVENT Even is a subnet of the sample space. Events are denoted by A, B, C etc. An event which does not contains any outcome is a null event (impossible event). It is denoted by Φ. An event which has only one outcome is an ELEMENTARY EVENT OR SAMPLE EVENT. An event which has more than outcome is a compound event. An event which contains all the outcomes is equal to the sample and it is called sure event or certain event. Ex.1. While throwing a die, A={2,4,6} is an events. It is the event that the throw results in an even number. Here, A is a compound event. Ex.2. While tossing two coins, A={TT} is an event. It is the event that the toss results in two tails. Here, A is a simple event. The outcomes which belong to an event are said to be favourable to that event. The event happens whenever the experiment results in a favourable outcomes . Otherwise, the event does not happen While throwing a die, the event A = {2,4,6} has three favourable outcomes, namely, 2,4 and 6. Where the throw results in 2,4 or 6, event A occurs.
COMPLEMENT OF AN EVENT
102
Let A be an event. Then, Complement of A is the event of non-occurrence of A. It is the event constituted by the outcomes which are not favourable to A. The complement of A is denoted by A′ or Ā or Ac. While throwing a die, If A = {2,4,6}, its complement is A′ = {1,3,5}. Here, A is the event that throw result in an even number. A′ is the event that throw does not result in an even number. That is, A′ is the event that throw result in an odd number.
SUB-EVENT. Let A and B be two events such that event A occurs whenever event B occurs. Then, event B is sub-event of event A. While throwing a die, let A = {2,4,6} and B = {2}. Here, B is a sub-event of event A. That is, B ⊂ A.
UNION OF EVENTS. Definition: Union of two or more events is the event of occurrence of at least one of these events. Thus, union of two events A and B is the event of occurrence of at least one of them. The union of A&B is denoted by A∪B or A+B or AorB. Ex1. While tossing two coins simultaneously, let A = {HH} and B = {TT} be two events. Then, their union is A∪B = {HH, TT}. Here, A is the event of occurrence of two heads and B is the event of occurrence of two tails. Their union A∪B is the event of occurrence of two heads or two heads or two tails. Ex2. While throwing a die, let A = {2,4,6}, B = {3,6} and C = {4,5,6} be three events. Then, their union is A∪B ∪C = {2,3,4,5,6}.
INTERSECTION OF EVENTS Intersection of two or more events is the event of simultaneous occurrence of all these events. Thus, Intersection of two events A and B is the event of occurrence of both of them. The intersection of A and B is denoted by A∩B or AB or A and B. Ex1. While tossing two coins, let A = {HH,TT} B′ = {HH,HT,TH} be two events. Then, their intersection is A∩B = {HH}. Ex2. While throwing a die, let A = {2,4,6}, B = {3,6} and C = {4,5,6} be three events. Then, their intersection is A∩B∩C = {6}.
103
EQUALLY LIKELY EVENTS (Equiprobable events) Two or more events are equally likely if they have equal chance of occurrence. That is, equally likely events are such that none of them has greater chance of occurrence than the others. Ex. 1. While tossing a fair coin, the outcomes 'Head' and 'Tail' are equally likely. Ex.2. While throwing a fair die, the events A={2,4,6}, B = {1,3, 5}&C={ 1,2, 3} are equally likely. A sample space is called an equiprobable space if the outcomes are equally likely. For instance, the sample space S = {1, 2, 3, 4, 5, 6} of throw of a fair die is equiprobable space because the six outcomes are equally likely.
MUTUALLY EXCLUSIVE EVENTS (Disjoint events) Two or more events are mutually exclusive if only one of them can occur at a time. That is, the occurrence of any of these events totally excludes the occurrence of the other events. Mutually exclusive events cannot occur together. Ex. 1. While tossing a coin, the outcomes 'Head1 and 'Tail' are mutually exclusive because when the coin is tossed once, the result cannot be Head as well as Tail. Ex.2. While throwing a die, the events A = {2, 4, 6}, B= {3,5} and C = {1} are mutually exclusive. If A is an event, A and A' are mutually exclusive. It should be noted that intersection of mutually exclusive events is a null event. EXHAUSTIVE EVENTS (Exhaustive set of events) A set of events is exhaustive if one" or the other of the events in the set occurs whenever the experiment is conducted. That is, the set of events exhausts all the outcomes of the experiment The union of exhaustive events is equal to the sample space. Ex.1. While throwing a die, the six outcomes together are exhaustive. But here, if any one of these outcomes is leftout, the remaining five outcomes are not exhaustive. Ex.2. While throwing a die, events A = {2,4, 6},B = {3, 6} and C = {1,5,6} together are exhaustive.
104
THE CLASSICAL APPROACH CLASSICAL (MATHEMATICAL, PRIORI) DEFINITION Let a random experiment have n equally likely, mutually exclusive and exhaustive outcomes. Let m of these outcomes be favourable to an event A. Then, probability of A is — P(A) =
m Number of favourable=outcomes n Total number of outcomes
Limitations of classical definition: This definition is applicable only when (i) The outcomes are equally likely, mutually exclusive and exhaustive. (ii) The number of outcomes n is finite.
RESULT 1 P(A) is a value between 0 and 1. That is, 0 < P(A) < 1. Proof: Let a random experiment have n equally likely, mutually exclusive and exhaustive outcomes. Let m of these outcomes be favourable to event A. Then,P(A) = m n Here, the least possible value of m is 0. Also, the highest possible value of m is n. And so, 0 ≤ m ≤ n. 0 m n ≤ ≤ n n n ⇒ 0 ≤ p ( A) ≤ 1 Thus, P(A) is a value between 0 and 1. RESULT 2 P(A') = 1 - P(A). That is, P(A) = 1 - P(A'). Proof: In a random experiment with n equally likely, mutually exclusive and exhaustive outcomes, if m outcomes are favourable to event A, the remaining (n-m) outcomes are favourable to the complementary event A'. Therefore,
105
Thus, P(A') = 1 - P(A). That is, P(A) = 1 - P(A'). Exercise 1: a. Find the probability of head in the toss of a fair coin. Solution: The sample space is 5 = {H,T}. There are n- 2 equally likely, mutually exclusive and exhaustive outcomes. One outcome, namely H is favourable to the event 'A : toss results in head'. Thus, m = 1.
∴ P[head] = P(A) = m/n = ½ b. Find the probability that a throw of an unbiased die results in (i) an ace (number 1) (ii) an even number (iii) a multiple of 3. Solution: The sample space is S = {1,2,3,4,5,6]. There are n = 6 equally likely, mutually exclusive and exhaustive outcomes. Let events A, Band C be— A : throw results in an ace (number 1) B : throw results in an even number C: throw results in a multiple of 3 (i) Event A has one favourable outcome. ∴P[ace] = P(A)= m/n = 1/6 (ii) Event B has 3 favourable outcomes, namely, 2, 4 and 6. ∴P [even number] = P(B) = m/n = 3/6= ½ (iii) Event C has 2 favourable outcomes, namely, 3 and 6 P [multiple of 3] = P(C)= m/n = 2/6 = 1/3 c. A bag contains 3 white, 4 red and 2 green balls. One ball is selected at random from the bag. Find the probability that the selected ball is (i) white (ii) non-white (iii) white or green. Solution: The bag totally has 9 balls. Since the ball drawn can be any one of them, there are 9 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B and C be A: selected ball is white B: selected ball is non-white C: selected ball is white or green (i) There are 3 white balls in the bag. Therefore, out of the 9 outcomes, 3 are favourable to event A.
106
∴P [white ball] = P(A) = 3/9 = 1/3
(ii) Event B is the complement of event A. Therefore, = 2/3
∴ P(non-white ball] = P(B) = 1 - P(A) = 1 – 1/3
(iii) There are 3 white and 2 green balls in the bag. Therefore, out of 9 outcomes, 5 are either white or green. ∴ P[white or green ball] = P(C) = 5/9 d. One card is drawn from a well-shuffled pack of playing cards. Find the probability that the card drawn (i) is a Heart (ii) is a King (iii) belongs to red suit (iv) is a King or a Queen (v) is a King or a Heart. Solution: A pack of playing cards has 52 cards. There are four suits, namely, Spade, Club, Heart and Diamond (Dice). In each suit, there are thirteen denominations - Ace (1), 2, 3, 10, Jack (Knave), Queen and King. A card selected at random may be any one of the 52 cards. Therefore, there are 52 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B, C, D and E be — A: selected card is a Heart B: selected card is a King C: selected card belongs to a red suit. D: selected card is a King or a Queen E: selected card is a King or a Heart (i) There are 13 Hearts in a pack. Therefore, 13 outcomes are favourable to event A. ∴ P [Heart] = P(A) =13/52 = ¼ (ii) There are 4 Kings in a pack. Therefore, 4 outcomes are favourable to event B.
∴P[King] = P(B)=4/52 =1/13 (iii)There are 13 Hearts and 13 Diamonds in a pack. Therefore, 26 outcomes are favourable to event C.
∴ P [Red card] = P(C) =26/52 = ½ (iv) There are 4 Kings and 4 Queens in a pack. Therefore, 8 outcomes are favourable to event D. ∴ P[King or Queen] = P(D) = 8/52 = 2/13 (v) There are 4 Kings and 13 Hearts in a pack. Among these, one card is Heart-King. Therefore, (4+13-1) = 16 outcomes are favourable to event E. ∴ P[King or Heart] = P(E) =16/52 = 4/13
107
e. The selection can be any one of the eight numbers. Therefore, there are 8 equally Hkely, mutually exclusive and exhaustive outcomes. Let events A and B be— Solution: A bag contains 8 tickets which are marked with the numbers 1,2,3,.. 8. Find the probability that a ticket drawn at random from the bag is marked with (i) an even number (ii) a multiple of 3. A: selected number is even. B: selected number is a multiple of 3. (i) Four of the selections, namely, 2, 4, 6 and 8 are favourable to event A. ∴ P [even number] = P(A) = 4/8 = ½ (ii)
Two of the selections, namely, 3 and 6 are favourable to event B. ∴ P[multiple of 3] = P(B) = 2/8 = ¼ .
Exercise 2: a. A fair coin is tossed twice. Find the probability that the tosses result in (i) two heads (ii) at least one head. b. Two fair dice are rolled. Find the probability that (i) both the dice show number 6 (ii) the sum of numbers obtained is 7 or 10 (iii) the sum of the numbers obtained is less than 11 (iv) the sum is divisible by 3. c.
A box has 5 white, 4 red and 3 green balls. Two balls are drawn at random from the box. Find the probability that they are (i) of the same colour (ii) of different colours.
d.
Two cards are drawn at random from a pack of cards. Find the probability that (i) both are Spades (ii) both are Kings (iii) one is Spade and the other is a Heart (v) the cards belong to the same suit (v) the cards belong to different suits.
e.
A bag has 9 tickets marked with numbers 1, 2, 3,……9. Two tickets are drawn at random from the bag. Find the probability that both the numbers drawn are (i) even (ii) odd.
Solution:
108
a. The sample space is 5 = (HH, HT, TH, TT}, There are four equally likely, mutually exclusive and exhaustive outcomes.-Let events A and B be— A : the tosses result in 2 heads B : the tosses result in at least one head. (i) One outcome, HH is favourable to event A. ∴P[two heads] = P(A) = ¼ (ii) 3 outcomes HH, HT and TH are favourable to event B. ∴ P[at least one head] = P(B) = ¾ b. The sample space is S= {(1,1), (1, 2), (1,3)…… (1,6) (2,1), (2, 2), (2,3) ....(2,6) ………………………………………….. (6,1), {6, 2), (6, 3)….(6,6)} There are 6x6 = 36 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B, C and D be — A : both the dice show number 6 B : sum of the numbers obtained is 7 or 10 C: sum of the numbers obtained is less than 11. D : sum of the numbers obtained is divisible by 3. (i) One outcome, namely, (6, 6) is favourable to event A. ∴P[6 on both the dice] = P(A) = 1/36 (ii) Nine outcomes, namely, (6,1), (5,2), (4, 3), (3,4), (2,5), (1,6), (6, 4), (5, 5) and (4,6) are favourable to event B.
∴P[sum is 7 or 10] = 9/36 = ¼ (iii) The complement of event C is— C': sum is 11 or 12. Event C' has three favourable outcomes, namely, (6,5), (5, 6) and (6, 6). P[sum is less than 11] = 1 – P[sum is 11 or 12] = 1-3/36 = 1-1/12 = 11/12 (iv) The sum is divisible by 3 if it is 3, 6, 9 or 12. Therefore, the outcomes favourable to event D are (2, 1), (1,- 2), (5,1), (4,2), (3,3), (2, 4), (1,5), (6, 3), (5, 4), (4, 5), (3, 6) and (6, 6). Thus, 12 outcomes are favourable. P[sum is divisible by 3] = 12/36 = 1/3.
109
c. The box totally has 12 balls. A random draw of two balls has 12C2 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be— A : the balls drawn are of the same colour B : the balls drawn are of different colours. (i)Events happens when the drawn balls are both white or both red or both green. Out of 12 C2 selections, 5C2 selections are both white,4C2 selections are both red and 3C2 selections are both green. Thus, 5C2+4C2 + 3C2 outcomes are favourable to event A. 5 C2 + 4 C2 + 3 C2 P[balls of same colour] = 12 C2 10 + 6 + 3 = 66 19 = 0.2879 66 (ii) Event B is the complement of event A. Therefore, P [balls of different colours] = 1 - P[same colour] = 1- P(A) = 1 - 19/66 = 47/66
d. A random draw of 2 cards from a pack of 52 cards has 52C2 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B, C, D and E be— A: both the cards drawn are Spades B: both the cards drawn are Kings. C: the cards drawn are one Spade and one Heart. D: the cards belong to the same suit. E: the cards belong to different suits. (i) Since there are 13 Spades in a pack, event A has 13C2 favourable outcomes. Therefore, 13 C2 13 × 6 1 = = P[both spades]= 52 C 2 26 × 51 17 (ii) Since there are 4 Kings in a pack, event B has 4C2 favourable outcomes. Therefore, 4 C2 2×6 1 = = P[both Kings] = 52 C 2 26 × 51 221 (iii) Here, one card should be a Spade and the other should be a Heart. From 13 Spades, one Spade can be had in 13C1 ways. From 13 Hearts, one Heart can be had in 13C1 ways. Thus, 13C1 X 13C1 outcomes are favorable to event C. Therefore,
110
C1 ×13 C1 P [a Spade and a Heart] = 52 C2 13 × 13 13 = = 26 × 51 102 13
(iv) Here, the cards should be 2 Spades or 2 Clubs or 2 Hearts or 2 Diamonds. There are 13 cards of each suit. In each case, a selection of two cards can be made in 13C2 ways. Thus, totally the number of favourable cases is 13C2 + 13C2 + 13C2 + 13C2 C 2 + 13 C 2 + 13 C 2 + 13 C 2 P[cards of same suit] = 52 C2 4 × 78 4 = = 26 × 51 17 (v) Events E is the complement of event D. Therefore, P[cards of different suits] = 1 – P[cards of same suit] = 1 – 4/17 = 13/17 13
(e) There are 9C2 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be A : both the selected numbers are even. B : both the selected numbers are odd. (i) Out of 9 numbers, 4 numbers, namely, 2,4,6 and 8 are even. Therefore, 4C2 selections will have two even numbers. Therefore, 4 C2 = 6 / 36 = 1 / 6 P[both even] = P(A) = 9 C2 (ii) Out of 9 numbers, 5 numbers, namely, 1,3,5,7 and 9 are odd. Therefore, 5C2 selections will have two odd numbers. Therefore, 5 C2 = 10 / 36 = 5 / 18 P[both odd] = P(B) = 9 C2 Exercise 3: A bag contains 3 red, 4 green and 3 yellow marbles. Three marbles are randomly drawn from the bag. What is the probability that they are of (i) the same colour (ii) different colours (one of each colour)? Solution : There are 10C3 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be A: Selected marble are of the same colour. B: Selected marbles are of different colours 111
(i)The marbles drawn should be 3 red or 4 green or 3 yellow. Therefore, 3C3 + 4C3 + 3C3 outcomes are favourable to events A, Therefore, 3
P [marbles of the same colour] =
C3 + 4 C3 + 3 C3 10 C3 1+ 4 +1 1 = = 120 20
(ii) The marbles should be one of each colour. Therefore, 3C1 x3C1 x 3C1 outcomes are favourable. Therefore, 3
P [marbles of different colours] =
C1 + 4 C1 + 3 C1 10 C3 3 = 10
112
THE AXIOMATIC APPROACH Consider a random experiment with sample space S. Associated with this random experiment, many events can be defined. Let for every event A, a real number P(A) be assigned. Then, P(A) is the probability of event A, if the following axioms are satisfied. Axiom 1 : P(A) ≥ 0 Axiom 2 : P(S) - 1, S being the sure event. Axiom 3 : For two mutually exclusive events A and B, P(A ∪ B) = P{A) + P(B) Note that the third axiom can be generalised for any number of mutually exclusive events.
ADDITION THEOREM PROBABILITY Exercise 12: (i) S how that P(A) = 1 – P(A') (ii) Show that probability is a value between 0 and 1. (iii)
Show that P(Ф) = 0 where Ф is null event.
Solution: (i) If A and A' are complementary events, A ∪ A' = S. By the axiom 2, P(S) = 1. And so, P(A ∪ A') =1 .... Result 1 But A and A' are mutually exclusive events. Therefore, by the axiom 3, P(A ∪ A') = P(A) + P(A') By the results 1 and 2, P(A) + P(A') = 1 That is, P(A) = 1-(A') (ii) Let A be an event. Then, by the axiom ],
....Result 2
P(A)≥0 If A' is the complementary event of A, P(A') = 1 – P(A) But, by axiom1,,P(A') ≥0
....Result 1
Therefore, 1 - P(A) ≥ 0
....Result 2
And so, P(A)≤ By the results 1 and 2, 0 ≤ P(A) ≤ 1 That is, probability is a value between 0 and 1. (iii) If A is an event and if Φ is a null event, A ∪ Φ = A ∴ P ( A ∪ φ ) = P ( A) ….. Result 1
113
But, A and Φ are mutually exclusive. Therefore P ( A ∪ φ ) = P ( A) + P (φ )
….. Result 2
By the result 1 and 2 P(A) + P(Φ) = P(A) That is, P(Φ) = P(A) – P(A) = 0 ADDITION THEOREM PROBABILITY For two events A and B, Show that
Solution : For events A and B,
Here, A∩B and A`∩B are mutually exclusive. Therefore, by axiom 3, ---Result 1
Also,
Here, A∩B and A`∩B are mutually exclusive therefore, By result 1 and result 2 -------Result 2 Exercise: Show that (i) P(A ∪ B) ≤ P{A) + P(B) (ii) P(A ∩ B) = P(A) + P(B) - P(A ∪ B) Solution : (i) The addition theorem is— P (A ∪ B) = P(A) + P(B) - P(A ∩ B) Here, P(A ∩ B) ≥ 0. Therefore, P(A ∩ B) ≤ P(A) + P(B). (ii) The additional theorem is ---P(A ∪ B) = P(A) + P(B) - P(A ∩ B) 114
⇒ P(A ∩ B) = P(A) + P(B) - P(A ∪ B) Also, note that P(A ∪ B) + P(A ∩ B) = P(A) + P(B)
SOLVED PROBLEMS Exercise: Write down the sample space for each of the following random experiments. (i) A coin is tossed three times and the result of each throw is noted, (ii) A coin is tossed three times and the number of heads obtained is noted, (iii) A couple goes on producing children until a male child is born. The number of female children born is noted, (iv) In case (iii) above, instead of noting the 'Number of female children', the 'Number of children bom' is noted, (v) A tetrahedron (a solid with four triangular surfaces) whose sides are painted red, red, blue and green is thrown. The colour of the side which touches the ground is noted. (vi) Blood of husband and wife are tested and the blood group (whether O, A, B or AB) in each case is identified. (vii)
A person is randomly selected and his religion is noted.
Solution: (i)
S= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT)
(ii)
S= {0,1,2,3}
(iii)
S ={0,1,2,3,....}
(iv)
5= {1,2, 3, 4
(v)
S = {red, blue, green}
(vi)
S = {(O,O), (O, A), (O, B), (O, A3), (A, O), (A, A), (A, B), (A, AB), (B, O),
}
(B, A), (B, B), (B, AB), (AB, O), (AB,A), (AB,B), (AB,AB)} (vii)
S = { Hindu, Christian, Muslim, Jain, Jew, ....}
Exercise (i) Given the equiprobable sample space S = {1, 2, 3,4, 5, 6] and the event A = {1, 3, 5}, find P(A). (ii) Given the sample space S = {1, 2, 3, 4, 5, 6} and the events A = { 1 , 3, 5} and B = {2, 4, 6}. If P(A) = 1/3 find P(B). (iii) If 5 = {E1, E2) is the sample space and if P(E1) = 0.3, find P(E2). Solution: (i) Since the sample space is equiprobable, mathematical definition can be used for finding probability.
115
∴
P(A)
Number of favourable outcomes = 3/6 = 1/2 Total number of outcomes
(ii) Here, events A and B are complementary.
∴
P(B) = 1 – P(A) = 1 – 1/3 = 2/3
(iii)
Here, E1, E2 are complementary events.
∴
P(E2) = 1 – P(E1) = 1 – 0.3 = 0.7 Exercise: (i)
If P(A) = 1/3, find P(A').
(ii) (iii)
If P(A) = 1/2, P(B)= ¾ and P(A ∩ B) = ¼, find P(A ∪ B). If P(A) = 1/8, P(B) = 1/6 and P(A ∪ B) = ¼, find P(A ∩ B)
(iv)
If P(A) = ½ and P(A ∩ B) = ¼ find P(B|A).
Solution: (i)
P(A') = 1-P(A) = 1 – 1/3 = 2/3 P(A ∪ B) = P(A) +P(B) – P(A ∩ B)
(ii)
= 1/2 +3/4 – 1/4 = 1 (iii)
By additional theorem ----P(A ∪ B) = P(A) +P(B) – P(A ∩ B) ⇒ P(A ∩ B) = P(A) + P(B) - P(A ∪ B)
= 1/8 + 1/6 – 1/4 = 1/24 1 (iv) P(B|A) = P ( A ∩ B ) = 4 = 1 2 1 P ( A) 2 Exercise : If P(A) = 0.8, P(B) = 0.5 and P(A ∪ B) = 0.9 find P(A|B). Are A and B independent events? Solution: By additional theorem---P(A ∪ B) = P(A) + P(B) – P(A ∩ B)
∴
P(A ∩ B) = P(A) +P(B) - P(A ∪ B)
= 0.8 + 0.5 – 0.9 = 0.4 And so,
P(B|A) =
P ( A ∩ B ) 0.4 = = 0.8 P ( A) 0.5
Thus, P(A|B) = 0.8 Here, P(A|B) = P(A). Therefore, events A and B are independent.
116
Exercise : Three unbiased dice are thrown once. Find the probability that all the three dice show the number 6. Solution : When 3 dice are thrown, there are 6 x 6 x 6 = 216 equally, mutually exclusive and exhaustive outcomes. of these 216 outcomes, 1 outcome, namely, (6, 6, 6) is favourable. Therefore probability of all the three dice showing the number 6 is P[all the three result in the number 6] =1/216 Exercise : A fair coin is tossed five times. Find the probability of obtaining (i) head in all the tosses, (ii) head in at least one of the tosses. Solution: There are 25 = 32 equally likely, mutually exclusive and exhaustive outcomes. Out of them, one outcome is HHHHH and another outcome is TTTTT. Therefore, (i) P[head in ail tosses] = 1/32 (ii) P[at least one head] = 1 – P[tail in all tosses] = 1-1/32 = 31/32 Note : Whenever probability of the event “at least one” has to be found, it is easier to find it by using the probability of the complementary event as follows. P[at least one]= P[none]
Exercise : There are 20 persons. 5 of them are graduates. 3 persons are randomly selected from these 20 persons. Find the probability that at least one of the selected persons is graduate. Solution: From 20 persons, 3 persons can be selected in 20C3 ways. Thus, there are 20C3 equally likely, mutually exclusive and exhaustive outcomes. Since there are 15 persons who are not graduates, P[ at least one is graduate] = 1 – P[none is graduate]
= 1−
15
C3 91 = 1 − 20 228 C3
137 = 0.6 228 Exercise :
117
In a college, there are five lecturers. Among them, three are doctorates. If a committee consisting three lecturers is formed , what is the probability that at least two of them are doctorates ? Solution: From the five lecturers, three lecturers can be selected in 5C3 ways. Thus, there are 5C3 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be — A : Two of the selected lecturers are doctrates. B : All the three selected lecturers are doctrates. Then, events has 3C2 x 2C1 favourable outcomes. And, event B has C3 favourable outcomes. Here, events A and B are mutually exclusive. .'. P[at least two doctrates] = P[two or three doctrates] = P( A ∪B) = P ( A) + P ) B ) =
3
3 C C 2 ×2 C1 +5 3 5 C3 C3
3 ×2 1 + 10 10 7 = = 0.7 10 =
PROBLEMS:3 What is the probability that there will be 53 Sundays in a randomly selected (i) leap year (ii) non-leap year? Solution: (i) A leap year has 366 days, Out of them, 7*52 = 364 days make 52 complete weeks. The remaining two days may occur in any of the following pattern --(Sunday, Monday), (Monday, Tuesday), (Tuesday, Wednesday), (Wednesday, Thursday), (Thursday, Friday), (Friday, Saturday) and (Saturday, Sunday). Out of these 7 cases which are equally likely, mutually exclusive and exhaustive, 2 cases namely (Sunday, Monday) and (Saturday, Sunday) have Sunday. Therefore, P[leap year has 53 Sundays]=2/7 (ii) A non- leap year has 365 days. Out of them, 364 days make 52 complete weeks. The remaining one day may be Sunday, Monday, ---- Saturday. Out of these 7 possibilities, only one is Sunday. Therefore, P[non-leap year has 53 Sundays]=1/7
118
CONDITIONAL PROBABILITY Let A and B be two events. Then, conditional probability of £ given A is the probability of happening of B when it is known that A has already happened. On the other hand, the probability of happening of B when nothing is known about happening of A is called unconditional probability of B. The conditional probability of B given A is denoted by P{B\A). The unconditional probability is P{B). Let P(A) > 0. Then, conditional probability of event B given A is defined as-----
If P(A) = 0, the conditional probability P(B\A) is not defined. If A and B are independent events, occurrence of B will be independent of occurrence of A. Therefore, the conditional and unconditional probabilities are equal. That is, P(B\A) = P(B). P( A ∩ B) P ( B) = P ( A) That is, P(A ∩ B) = P(A).P(B)
INDEPENDENT EVENTS Two events A and B are independent if and only if P(A ∩ B) = P(A).P(B) If two events are independent, the occurrence or non-occurrence of one does not depend on the occurrence or non-occurrence of the other.
MULTIPLICATION THEOREM Let A and B be two events with respective probabilities P(A) and P(B). Let P(B/A) be the conditional probability of event B given that event A has happened. Then, the probability of simultaneous occurrence of A and B is –
119
If the events are independent, the statement reduces to -
MULTIPLICATION THEOREM Proof: By the definition of conditional probability, for P(A)>0,
If A and B are independent, by the definition of independence,
Exercise. a. A card is drawn at random from a pack of cards. (i) What is the probability that it is a heart ? (ii) If it is known that the card drawn is red, what is the probability that it is a heart? b. A fair coin is tossed thrice. What is the probability that all the three tosses result in heads ? Solution: a. There are 52 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be — A : card drawn is red. B : card drawn is heart. There are 26 red cards and 13 hearts in a pack of cards. Therefore, event A has 26 favourable outcomes and event B has 13 favourable outcomes. Event A ∩ B has 13 favourable outcomes because when any of the 13 hearts is drawn A ∩ B happens. Therefore, P(A) = 26/52. P(B) = 13/52 and P(A ∩ B)=13/52 (i)The unconditional probability of drawing a heart is --P(B) = 13/52 = ¼ (ii) The conditional probability of drawing a heart given that it is red card is-----
120
13 P( A ∩ B) 52 = 1 P(B/A) = = 2 26 P ( A) 52 B. Let events A, B, and C be-----A: the first toss results in head B: the second toss results in head. C: the third toss results in head. Then, P(A) = P(B) =P(C) = ½ Since A, B, and C are results of three different tosses, they are independent. Therefore, probability that all the three tosses result in head is --P[ 3 heads] = P(A ∩ B ∩ C) = P(A).P(B).P(C) 1 1 1 1 = × × = 2 2 2 8 Exercise: Two fair dice are rolled. If the sum of the numbers obtained is 4, find the probability that the numbers obtained on both the dice are evenSolution: Let events A and B be — A: the sum of the numbers is 4 B: the numbers on both the dice are even P( A ∩ B) Here, we have to find ----P(B/A) = P ( A) Event A has 3 favourable outcomes, namely, (1,3),(2,2) and (3,1)
P( B | A) =
P( A ∩ B) B( A)
∴P[Sum 4] = P(A) = 3/36
Event (A ∩ B) has 1 favourable outcomes, namely, (2,2). ∴P[Sum 4 and number even] = P(A ∩ B) = 1/36 Thus, P[Number even given Sum 4] 1 = 36 = 1 3 3 36 Exercise: A box has 1 red and 3 white balls. Balls are drawn one after one from the box. Find the probability that the two balls drawn would be red if a. the ball drawn first is returned to the box before the second draw is made. (Draw with replacement). b. the ball drawn first is not returned before the second draw is made. (Draw without replacement).
121
Solution: Let
A : the first ball drawn is red B : the second ball drawn is red.
Draw with replacement: Here, P(A) =1/4 Also, since the first ball- is returned before the second draw is made, P(B|A) =1/4
∴P[Two balls are red] = P(A ∩ B)
= P(A).P(B|A) = 1/4 * 1/4 =1/16 Draw without replacement: Here, Since the first ball drawn is not returned before the second draw is made, P(B|A) = 0/4 .'. P [Two balls are red} = P(A ∩ B) = P(A).P(B|A) = ¼ * 0/4 = 0 PROBLEMS:1 The probability that a contractor will get a plumbing contract is 2/3 and probability that he will not get an electrical contract is 5/9. If the probability of getting at least one of these contracts is 4/5, what is the probability that he will get both? Solutions: Let A: contractor gets plumbing contract B: contractor gets electrical contract Then, P(A) = 2/3 P(B`) = 5/9 and P(A ∪ B) = 4/5 Therefore, P(B) = 1-P(B`) = 4/9 By addition theorm we have, P(A ∪ B) = P(A) +P(B) – P(A ∩ B) That is, P(A ∩ B) = P(A) +P(B) – P(A ∪ B) Therefore, P[he gets both plumbing and electrical contract] = P(A ∩ B) = P(A) +P(B) – P(A ∪ B) = 2 / 3 + 4 / 9 − 4 / 5 = 14 / 45
PROBLEMS:3
122
A can solve 90 percent of the problems given in a book and B can solve 70 percent. What is the probability that at least one of them will solve a problem selected at random. Solutions: event A : student A solve the problem event B : student B solve the problem. P(at least one solve the problem) = 1-P(none solve the problem) = 1− P A ∩ B
(
)
= 1 − P ( A).P ( B) = 1 − (0.10)(0.30) = 0.97 PROBLEMS:4 The probability that a trainee will remain with a company 0.6, The probability that an employee earns more ten Rs.10,000 per year 0.5. The probability an employee is trainee who remained with the company or who earn more then Rs.10,000 per year is 0.7. What is the probability earn more than Rs.10,000 per year given that he is a trainee who stayed with the company Solutions: event A: A trainee will remain with the company Event B: A trainee earns more than Rs. 10,00. Given P(A) = 0.6 P(B) = 0.5 P(A ∪ B) = 0.7 We need to find P ( A ∩ B ) P ( A) + P ( B ) − P( A ∪ B ) 0.4 P ( B | A) = = = = 0.67 P( A) P ( A) 0.6 PROBLEMS:5 Suppose that one of the three men, a politician a bureaucrat and an educationist will be appointed as VC of the university. The probabilities of there appointment are respectively 0.3,0.25,and 0.45. The probability that these people will promote research activities if there are appointed is 0.4,0.7 and 0.8 respectively. What is the probability that research will be promoted by the new VC Solutions: event A: Politician appointed as VC event B: bureaucrat appointed as VC event C: Educationist appointed as VC event D: promotion of research activities ∴= P ( A ∩ P ) + P ( B ∩ D) + P (C ∩ D). = P ( D | A).P ( A) + P ( D | B ).P( B ) + P ( D | C ).P (C ) = (0.3)(0.4) + (0.25)(0.7) + (0.45)(0.8) = 0.655
123
PROBLEMS:6 A box contains 4 green and 6 white bolls another box contains 7 green and 8 white bolls. Two bolls are transferred from box 1 to box 2 and then a boll is drawn from box 2. What is the probability that it is white? event A: transferred balls are green event B: transferred balls are white event C: Among transferred balls one green & 1 white event D: selection of a white ball from box 2. ∴= P ( A ∩ D) + P ( B ∩ D ) + P (C ∩ D) = P ( D | A).P ( A) + P ( D | B ).P( B ) + P ( D | C ).P (C ) =
6 C2 8 C2 10 4 C1 ×6 C1 9 . + . + 10 × 10 C2 17 10 C2 17 C2 17 4
= 0.5412 PROBLEMS:7 Probabilities of Husband’s and wife’s selection to a post are 1/5 and 1/7 respectively, what is the probability that. •Both of them will be selected. •Exactly one of them will be selected •None of them will be selected Solutions: 1 event A: selection of Husband P(A) = 5 1 event B: selection of Husband P(B) = 7 (i) P(both of them will be selected) = P(A ∩ B) =P(A).P(B) 1 1 1 = × = 5 7 35 (ii) P(exactly one of them will bw selected) P ( A ∩ B ) + P ( B ∩ A) = = P ( A).P ( B ) + P ( B ).P ( A) 4 1 6 1 10 = × + × = . 5 7 7 5 35 P( A ∩ B) (iii) P(none of them will be selected) = = P ( A).P ( B ) 4 6 24 = × = 5 7 35
124
Random variable INTRODUCTION Suppose two fair coins are tossed. Here, the sample space is 5 = {TT, TH, HT, HH} Suppose to each of the four sample points in this sample space, a number is assigned as follows. Sample point
TT
TH
HT
HH
Number
0
1
1
2
Here, the assigned numbers indicate the number of heads obtained in each case. Let 'the number of heads' be denoted by X. Then, X is a function on the sample space. It takes the values 0,1 and 2 with probabilities — P[X=0] = P[no head] = ¼ P[X=1] = P[one head] = ½ P[X=2] = P[two head] = ¼ Here, X is called Random variable or Variate. RANDOM VARIABLE Random variable is a function which assigns a real number to every sample point in the sample space. The set of such real values is the range of the random variable. There are two types of random variable, namely, Discrete random variable and Continuous random variable. A Variable X which takes values x1,x2,….xn with probabilities p1,p2,….pn is a Discrete random variable. Here, the value x1,x2,….xn from the range of the random variable. A random variable whose range id uncountable infinite is a Continuous random variable. Ex1. Let X denote the number of heads obtained while tossing two fair coins. Then, X is a random variable which takes the values 0,1 and 2 wit respective probabilities ¼, ½ and ¼ . Here, X is a discrete random variable. Ex. 2. Let X denote the number obtained while throwing a fair die. Then, X is a discrete random variable taking values 1, 2, 3, 4, 5 and 6 with probability 1/6 each
125
Ex. 3. Let X denote the weight of apples. Then, X is a continuous random variable. Generally, random variables are denoted by X, Y, Z, etc. If X is a random variable, the values taken by X are denoted by x (small letter). PROBABILITY MASS FUNCTION Let X be a discrete random variable. And let p(x) be a function such that p(x) = P[X=x]. Then, p(x) is the probability mass function of X. Here, (i)p(x) ≥0 for all x (ii)∑p(x) = 1 A similar function is defined for a continuous random variable X. Its is called probability density function (p.d.f.). It is denoted by f(x). PROBABILITY DISTRIBUTION A systematic presentation of the values taken by a random variable and the corresponding probabilities is called probability distribution of the random variable.
Session 4 MATHEMATICAL EXPECTATION Mathematical expectation of a random variable Let X be a discrete random variable with probability mass function p(x). Then, mathematical expectation of X is --- E(X) = ∑x.p(x)
Mathematical expectation of a function h(x) of X Let X be a discrete random variable with probability mass function p(x). Then, mathematical expectation of any function h(X) of X is ---E[h(X)] = ∑h(x).p(x) Exercise 1 : Two fair coins are tossed once. Find the mathematical expectation of the number of heads obtained. Solution : Let X denote the number of heads obtained. Then, X is a random variable which takes the values 0, 1 and 2 with respective probabilities ¼ ½ and ¼ and That is, x 0 1 2 p(x)
¼
½
¼
126
The mathematical expectation of the number of head is
E ( X ) = ∑ x. p( x ) = 0 × RESULTS
1 1 1 + 1× + 2 × = 1 4 2 4
1.For a random variable X, the Arithmetic Mean is E(X). 2.For a random variable X, the Variance is Var(X) = E[X-E(X)]2 = E(X)2- [E(X)]2 The Standard Deviation is the square – root of the variance. Exercise: A bag has 3 white and 4 red balls. Two balls are randomly drawn from the bag. Find the expected number of white balls in the draw. Solution: Let X denote the number of white balls obtained in the draw. Then, X is a random variables which takes the values 0, 1 and 2 with respective probabilities – 4
C2 2 = 3 C 2 × 4 C1 4 C2 7 = 7 P(0) = P[one white & one red] = 7 C2
P(0) = P[both red] =
7
P(0) = P[both white] =
3 7
C2 1 = C2 7
The probability distribution of X is – x p(x)
0
1
2
2/7
4/7
1/7
E ( X ) = ∑ x. p( x) 2 4 1 6 + 1× + 2 × = 7 7 7 7 = 1(approximately ) = 0×
Thus, one white ball is expected in the draw
THEORETICAL PROBABILITY DISTRIBUTIONS In day to day life, we come across many random variables such as -----1.Number of male children in a family having three children. 2.Number of passengers getting into a bus at the bys stand .
127
3.I.Q. of children 4.Number of stones thrown successively at a mango on the tree until the mango in hit 5.Marks scored by a candidate in the P.U.E. examination. For a quick analysis of distributions of such random variables, we consider their theoretical equivalents. These equivalent distributions are originated according to certain theoretical assumptions and restrictions. Such theoretically designed distributions are called theoretical distributions. There are many types (families) of theoretical distributions. Some of them (i) Bernoulli distribution (ii) Binomial distribution (iii) Poisson distribution (iv) Hypergeometric distribution (v) Normal distribution. The Bernoulli distribution and the Binomial distribution were discovered by James Bernoulli during the first decade of eighteenth century. These works were published posthumously in 1713. The Poisson distribution was introduced by S.D. Poisson in 1837. The Normal distribution was introduced by De Moivre in 1753. This distribution is also called Gaussian distribution.
BERNOULLI EXPERIMENT A random variable X which assumes values 1 and 0 with respective probabilities p and q = 1-p is called Bernoulli variables The Bernoulli distribution is--x 1 0 p(x)
p
q
Note1: Bernoulli distribution has one constant, namely, p. This constant p is called parameter of the Bernoulli distribution. Different values of p(where 0
View more...
Comments