statistics for management

January 14, 2017 | Author: Sreenivas Kodamasimham | Category: N/A
Share Embed Donate


Short Description

Download statistics for management...

Description

LESSON – 1 STATISTICS FOR MANAGEMENT

Session – 1

Duration: 1 hr

Meaning of Statistics The term statistics mean that the numerical statement as well as statistical methodology. When it is used in the sense of statistical data it refers to quantitative aspects of things and is a numerical description. Example: Income of family, production of automobile industry, sales of cars etc. There quantities are numerical. But there are some quantities which are not in themselves numerical but can be made so by counting. The sex of a baby is not a number, but by counting the number of boys, we can associate a numerical description to sex of all new born babies, for an example, when saying that 60% of all live-born babies are boy. This information then, comes within the realm of statistics.

Definition The word statistics can be used is two senses, viz, singular and plural. In narrow sense and plural sense, statistics denotes some numerical data (statistical data). In a wide and singular sense statistics refers to the statistical methods. Therefore, these have been grouped under two heads – ‘Statistics as a data” and “Statistics as a methods”.

Statistics as a Data Some definitions of statistics as a data are a) Statistics are numerical statement of facts in any department of enquiring placed in relation to each other. - Powley b) By statistics we mean quantities data affected to a marked extent by multiplasticity of course. - Yule and Kendall c) By statistics we mean aggregates of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for predeterminated purpose and placed in relation to each other. - H. Secrist This definition is more comprehensive and exhaustive. It shows more light on characteristics of statistics and covers different aspects. Some characteristics the statistics should possess by H. Secrist can be listed as follows.

1

 Statistics are aggregate of facts  Statistics are affected to a marked extent by multiplicity of causes.  Statistics are numerically expressed  Statistics should be enumerated / estimated  Statistics should be collected with reasonable standard of accuracy  Statistics should be placed is relation to each other.

Statistics as a methods Definition a) “Statistics may be called to science of counting” - A.L. Bowley b) “Statistics is the science of estimates and probabilities”. - Boddington c) Dr. Croxton and Cowden have given a clear and concise definition. “Statistics may be defined as the collection, presentation, analysis and interpretation of numerical data”. According to Croxton and Cowden there are 4 stages. a) Collection of Data A structure of statistical investigation is based on a systematic collection of data. The data is classified into two groups i) Internal data and ii) External data Internal data are obtained from internal records related to operations of business organisation such as production, source of income and expenditure, inventory, purchases and accounts. The external data are collected and purchased by external agencies. The external data could be either primary data or secondary data. The primary data are collected for first time and original, while secondary data are collected by published by some agencies. b) Organisations of data The collected data is a large mass of figures that needs to be organised. The collected data must be edited to rectify for any omissions, irrelevant answers, and wrong computations. The edited data must be classified and tabulated to suit further analysis. c) Presentation of data

2

The large data that are collected cannot be understand and analysis easily and quickly. Therefore, collected data needs to be presented in tabular or graphic form. This systematic order and graphical presentation helps for further analysis. d) Analysis of data The analysis requires establishing the relationship between one or more variables. Analysis of data includes condensation, abstracting, summarization, conclusion etc. With the help of statistical tools and techniques like measures of dispersion central tendency, correlation, variance analysis etc analysis can be done. e) Interpretation of data The interpretation requires deep insight of the subject. Interpretation involves drawing the valid conclusions on the bases of the analysis of data. This work requires good experience and skill. This process is very important as conclusions of results is done based on interpretation. We can define statistics as per Seligman as follows. “Statistics is a science which deals with the method and of collecting, classifying, presenting, comparing and interpretating the numerical data collected to throw light on enquiry”.

Importance of statistics In today’s context statistics is indispensable. As the use of statistics is extended to various field of experiments to draw valid conclusions, it is found increased importance and usage. The number of research investigations in the field of economics and commerce are largely statistical. Further, the importance and statistics in various fields are listed as below. a) State Affairs: In state affairs, statistics is useful in following ways 1. To collect the information and study the economic condition of people in the states. 2. To asses the resources available in states. 3. To help state to take decision on accepting or rejecting its policy based on statistics. 4. To provide information and analysis on various factors of state like wealth, crimes, agriculture experts, education etc. b) Economics: In economics, statistics is useful in following ways 1. Helps in formulation of economic laws and policies 2. Helps in studying economic problems 3. Helps in compiling the national income accounts. 4. Helps in economic planning. c) Business 1. Helps to take decisions on location and size 2. Helps to study demand and supply

3

3. Helps in forecasting and planning 4. Helps controlling the quality of the product or process 5. Helps in making marketing decisions 6. Helps for production, planning and inventory management. 7. Helps in business risk analysis 8. Helps in resource long term requirements, in estimating consumers preference and helps in business research. d) Education: Statistics is necessary to formulate the polices regarding start of new courses, consideration of facilities available for proposed courses. e) Accounts and Audits: 1. Helps to study the correlation between profits and dividends enable to know trend of future profits. 2. In auditing sampling techniques are followed.

Functions of statistics Some important functions of statistics are as follows 1. To collect and present facts in a systematic manner. 2. Helps in formulation and testing of hypothesis. 3. Helps in facilitating the comparison of data. 4. Helps in predicting future trends. 5. Helps to find the relationship between variable. 6. Simplifies the mass of complex data. 7. Help to formulate polices. 8. Helps Government to take decisions.

Limitations of statistics 1. Does not study qualitative phenomenon. 2. Does not deal with individual items. 3. Statistical results are true only on an average. 4. Statistical data should be uniform and homogeneous. 5. Statistical results depends on the accuracy of data. 6. Statistical conclusions are not universally true. 7. Statistical results can be interpreted only if person has sound knowledge of statistics.

Distrust of Statistics 4

Distrust of statistics are due to lack of knowledge and limitations of its uses, but not due to statistical sciences. Distrust of statistics are due to following reasons. a) Figures are manipulated or incompleted. b) Quoting figures without their context. c) Inconsistent definitions. d) Selection of non-representative statistical units. e) Inappropriate comparison f)

Wrong inference drawn.

g) Errors in data collection.

Statistical Data Statistical investigation is a long and comprehensive process and requires systematic collection of data in large size. The validity and accuracy of the conclusion or results of the study depends upon how well the data were gathered. The quality of data will greatly influence the conclusions of the study and hence importance is to be given to the data collection process. Statistical data may be classified as Primary Data and Secondary Data based on the sources of data collection.

♦ Primary data Primary data are those which are collected for the first time by the investigator / researchers and are thus original in character. Thus, data collected by investigator may be for the specific purpose / study at hand. Primary data are usually in the shape of raw materials to which statistical methods are applied for the purpose of analysis and interpretation.

♦ Secondary data Secondary have been already collected for the purpose other than the problem at hand. These data are those which have already been collected by some other persons and which have passed through the statistical analysis at least once. Secondary data are usually in the shape of finished products since they have been already treated statistically in one or the other form. After statistical treatment the primary data lose their original shape and becomes secondary data. Secondary data of one organisation become the primary data of other organisation who first collect and publish them.

Primary Vs Secondary Data

5

 Primary data are originated by researcher for specific purpose / study at hand while secondary data have already been collected for purpose other than research work at hand.  Primary data collection requires considerably more time, relatively expensive. While the secondary data are easily accessible, inexpensive and quickly obtained. Table – A compression of Primary and Secondary Data Primary data Secondary data Collection purpose

For the problem at hand

For other problems

Collection process

Very involved

Rapid and easy

Collection cost

High

Relatively low

Collection time

Long

Short

Suitability

Its suitability is positive

It may or may not suit the object of survey

Originality

It is original

It is not original

Precautions

No extra precautions required to use the data

It should be used with extra case

Limitations of secondary data a) Since secondary data is collected for ‘some other purpose, its usefulness to current problem may be limited in several important ways, including relevancies and accuracy. b) The objectives, nature and methods used to collect secondary data may not be appropriate to present situation. c) The secondary data may not be accurate, or they may not be completely current or dependable.

Criteria for evaluating secondary data Before using the secondary data it is important to evaluate them on following factors a) Specification and methodology used to collect the data b) Error and accuracy of data of the data c) The currency d) The objective – The purpose for which data were collected e) The nature – content of data f) The dependability

Sources of data 6

Primary source – The methods of collecting primary data. When data is neither internally available nor exists as a secondary source, then the primary sources of data would be approximate. The various method of collection of primary data are as follows a) Direct personal investigation -

Interview

-

Observation

b) Indirect or oral investigation c) Information from local agents and correspondents d) Mailded questionnaires and schedules e) Through enumerations

Secondary source – The methods of collecting secondary data i)

Published Statistics a) Official publications of Central Government Ex: Central Statistical Organisation (CSO) – Ministry of planning

ii)

-

National Sample Survey Organisation (NSSO)

-

Office of the Registrar General and Census Committee – GOI

-

Director of Statistics and Economics – Ministry of Agriculture

-

Labour Bureau – Ministry of Labour etc.

Publications of Semi-government organisation Ex:

iii)

-

The institute of foreign trade, New Delhi

-

The institute of economic growth, New Delhi.

Publication of research institutes Ex:

iv)

-

Indian Statistical Institute

-

Indian Agriculture Statistical Institute

-

NCRET Publications

-

Indian Standards Institute etc.

Publication of Business and Financial Institutions Ex:

v)

-

Trade Association Publications like Sugar factory, Textile mill, Indian chamber of Industry and Commerce.

-

Stock exchange reports, Co-operative society reports etc.

News papers and periodicals 7

Ex: vi)

The Financial Express, Eastern Economics, Economic Times, Indian Finance, etc.

Reports of various committees and commissions Ex:

vii)

-

Kothari commission report on education

-

Pay commission reports

-

Land perform committee reports etc.

Unpublished statistics -

Internal and administrative data like Periodical Loss, Profit, Sales, Production Rate, Balance Sheet, Labour Turnover, Budges, etc.

Classification and Tabulation The data collected for the purpose of a statistical inquiry some times consists of a few fairly simple figures which can be easily understood without any special treatment. But more often there is an overwhelming mass of raw data without any structure. Thus, unwidely, unorganised and shapeless mass of collected is not capable of being rapidly or easily associated or interpreted. Unorganised data are not fit for further analysis and interpretation. In order to make the data simple and easily understandable the first task is not condense and simplify them in such a way that irrelevant datas are removed and their significant features are stand out prominently. The procedure adopted for this purpose is known as method of classification and tabulation. Classification helps proper tabulation. “Classified and arranged facts speak themselves; unarranged, unorganised they are dead as mutton”. - Prof. J.R. Hicks

♦ Meaning of Classification Classification is a process of arranging things or data in groups or classes according to their resemblances and affinities and gives expressions to the unity of attributes that may subsit among a diversity of individuals.

♦ Definition of Classification Classification is the process of arranging data into sequences and groups according to their common characteristics or separating them into different but related parts. - Secrist The process of grouping large number of individual facts and observations on the basis of similarity among the items, is called classification. - Stockton & Clark

Characteristics of classification 8

a) Classification performs homogeneous grouping of data b) It brings out points of similarity and dissimilating c) The classification may be either real or imaginary d) Classification is flexible to accommodate adjustments

Objectives / purposes of classifications i) To simplify and condense the large data ii) To present the facts to easily in understandable form iii) To allow comparisons iv) To help to draw valid inferences v) To relate the variables among the data vi) To help further analysis vii) To eliminate unwanted data viii)To prepare tabulation

Guiding principles (rules) of classifications Following are the general guiding principles for good classifications a) Exhaustive: Classification should be exhaustive. Each and every item in data must belong to one of class. Introduction of residual class (i.e. either, miscellaneous etc.) should be avoided. b) Mutually exclusive: Each item should be placed at only one class c) Suitability: The classification should confirm to object of inquiry. d) Stability: Only one principle must be maintained throughout the classification and analysis. e) Homogeneity: The items included in each class must be homogeneous. f) Flexibility: A good classification should be flexible enough to accommodate new situation or changed situations.

Modes / Types of Classification Modes / Types of classification refers to the class categories into which the data could be sorted out and tabulated. These category depends on the nature of data and purpose for which data is being sought.

Important types of classification a) Geographical (i.e. on the basis of area or region wise) b) Chronological (On the basis of Temporal / Historical, i.e. with respect to time) c) Qualitative (on the basis of character / attributes) d) Numerical, quantitative (on the basis of magnitude) 9

a) Geographical Classification In geographical classification, the classification is based on the geographical regions. Ex:

Sales of the company (In Million Rupees) (region – wise) Region Sales North

285

South

300

East

185

West

235

b) Chronological Classification If the statistical data are classified according to the time of its occurrence, the type of classification is called chronological classification. Sales reported by a departmental store Sales Month (Rs.) in lakhs January

22

February

26

March

32

April

25

May

27

June

29

July

30

August

30

c) Qualitative Classification In qualitative classifications, the data are classified according to the presence or absence of attributes in given units. Thus, the classification is based on some quality characteristics / attributes. Ex: Sex, Literacy, Education, Class grade etc. Further, it may be classified as a) Simple classification

b) Manifold classification

i) Simple classification: If the classification is done into only two classes then classification is known as simple classification.

10

Ex:

a) Population in to Male / Female b) Population into Educated / Uneducated

ii) Manifold classification: In this classification, the classification is based on more than one attribute at a time. Ex: Population

Smokers

Literate

Non-smokers

Illiterate

Male

Male

Illiterate

Literate

Female

Female

Male

Male

Female

Female

d) Quantitative Classification: In Quantitative classification, the classification is based on quantitative measurements of some characteristics, such as age, marks, income, production, sales etc. The quantitative phenomenon under study is known as variable and hence this classification is also called as classification by variable. Ex: For a 50 marks test, Marks obtained by students as classified as follows Marks

No. of students

0 – 10

5

10 – 20

7

20 – 30

10

30 – 40

25

40 – 50

3

Total Students = 50 In this classification marks obtained by students is variable and number of students in each class represents the frequency.

Meaning and Definition of Tabulation

11

Tabulation may be defined as systematic arrangement of data is column and rows. It is designed to simplify presentation of data for the purpose of analysis and statistical inferences.

Major Objectives of Tabulation 1. To simplify the complex data 2. To facilitate comparison 3. To economise the space 4. To draw valid inference / conclusions 5. To help for further analysis

Differences between Classification and Tabulation 1. First data are classified and presented in tables; classification is the basis for tabulation. 2. Tabulation is a mechanical function of classification because is tabulation classified data are placed in row and columns. 3. Classification is a process of statistical analysis while tabulation is a process of presenting data is suitable structure.

Classification of tables Classification is done based on 1. Coverage (Simple and complex table) 2. Objective / purpose (General purpose / Reference table / Special table or summary table) 3. Nature of inquiry (primary and divided table). Ex: a) Simple table: Data are classified based on only one characteristic Distribution of marks Class Marks

No. of students

30 – 40

20

40 – 50

20

50 – 60

10

Total

50

b) Two-way table: Classification is based on two characteristics 12

No. of students

Class Marks

Boys

Girls

Total

30 – 40

10

10

20

40 – 50

15

5

20

50 – 60

3

7

10

28

22

50

Total

Frequency Distribution Frequency distribution is a table used to organize the data. The left column (called classes or groups) includes numerical intervals on a variable under study. The right column contains the list of frequencies, or number of occurrences of each class/group. Intervals are normally of equal size covering the sample observations range. It is simply a table in which the gathered data are grouped into classes and the number of occurrences which fall in each class is recorded.

♦ Definition A frequency distribution is a statistical table which shows the set of all distinct values of the variable arranged in order of magnitude, either individually or in groups with their corresponding frequencies. - Croxton and Cowden A frequency distribution can be classified as a) Series of individual observation b) Discrete frequency distribution c) Continuous frequency distribution a) Series of individual observation Series of individual observation is a series where the items are listed one after the each observations. For statistical calculations, these observation could be arranged is either ascending or descending order. This is called as array. Ex: Roll No.

Marks obtained in statistics paper

1

83

2

80

13

3

75

4

92

5

65

The above data list is a raw data. The presentation of data in above form doesn’t reveal any information. If the data is arranged in ascending / descending in the order of their magnitude, which gives better presentation then, it is called arraying of data.

Discrete (ungrouped) Frequency Distribution If the data series are presented in such away that indicating its exact measurement of units, then it is called as discrete frequency distribution. Discrete variable is one where the variates differ from each other by definite amounts. Ex: Assume that a survey has been made to know number of post-graduates in 10 families at random, the resulted raw data could be as follows. 0, 1, 3, 1, 0, 2, 2, 2, 2, 4 This data can be classified into an ungrouped frequency distribution. The number of post-graduates becomes variable (x) for which we can list the frequency of occurrence (f) in a tabular from as follows; Number of post graduates (x)

Frequency (f)

0

2

1

2

2

4

3

1

4

1

The above example shows a discrete frequency distribution, where the variables has discrete numerical values.

Continuous frequency distribution (grouped frequency distribution) Continuous data series is one where the measurements are only approximations and are expressed in class intervals within certain limits. In continuous frequency distribution the class interval theoretically continuous from the starting of the frequency distribution till the end without break. According to Boddington ‘the variable which can take very intermediate value between the smallest and largest value in the distribution is a continuous frequency distribution. Ex:

14

Marks obtained by 20 students in students exam for 50 marks are as given below – convert the data into continuous frequency distribution form. 18

23

28

29

44

28

48

33

32

43

24

29

32

39

49

42

27

33

28

29

By grouping the marks into class interval of 10 following frequency distribution table can be formed. Marks

No. of students

0-5

0

5 – 10

0

10 – 15

0

15 – 20

1

20 – 25

2

25 – 30

7

30 – 35

4

35 – 40

1

40 – 45

3

45 – 50

2

Technical terms used in formulation frequency distribution a) Class limits: The class limits are the smallest and largest values in the class. Ex: 0 – 10, in this class, the lowest value is zero and highest value is 10. the two boundaries of the class are called upper and lower limits of the class. Class limit is also called as class boundaries. b) Class intervals The difference between upper and lower limit of class is known as class interval. Ex: In the class 0 – 10, the class interval is (10 – 0) = 10. The formula to find class interval is gives on below

15

i=

L−S R

L = Largest value S = Smallest value R = the no. or classes Ex: If the marks of 60 students in a class varies between 40 and 100 and if we want to form 6 classes, the class interval would be i=

L−S R

100 − 40 6

=

=

60 6

= 10

L = 100 S = 40 K=6

Therefore, class intervals would be 40 – 50, 50 – 60, 60 – 70, 70 – 80, 80 – 90 and 90 – 100.

♦ Methods of forming class-interval a) Exclusive method (overlapping) In this method, the upper limits of one class-interval is the lower limit of next class. This methods makes continuity of data. Ex: Marks

No. of students

20 – 30

5

30 – 40

15

40 – 50

25

A student whose mark is between 20 to 29.9 will be included in the 20 – 30 class. Better way of expressing is Marks

No. of students

20 to les than 30

5

(More than 20 but les than 30) 30 to les than 40

15

40 to les than 50

25

Total Students

50

16

b) Inclusive method (non-overlaping) Ex: Marks

No. of students

20 – 29

5

30 – 39

15

40 – 49

25

A student whose mark is 29 is included in 20 – 29 class interval and a student whose mark in 39 is included in 30 – 39 class interval.

♦ Class Frequency The number of observations falling within class-interval is called its class frequency. Ex: The class frequency 90 – 100 is 5, represents that there are 5 students scored between 90 and 100. If we add all the frequencies of individual classes, the total frequency represents total number of items studied.

♦ Magnitude of class interval The magnitude of class interval depends on range and number of classes. The range is the difference between the highest and smallest values is the data series. A class interval is generally in the multiples of 5, 10, 15 and 20. Sturges formula to find number of classes is given below K = 1 + 3.322 log N. K = No. of class log N = Logarithm of total no. of observations Ex: If total number of observations are 100, then number of classes could be K = 1 + 3.322 log 100 K = 1 + 3.322 x 2 K = 1 + 6.644 K = 7.644 = 8 (Rounded off) NOTE: Under this formula number of class can’t be less than 4 and not greater than 20.

♦ Class mid point or class marks The mid value or central value of the class interval is called mid point.

17

Mid point of a class =

(lower limit of class + upper limit of class) 2

♦ Sturges formula to find size of class interval Size of class interval (h) =

Range 1 + 3.322 log N

Ex: In a 5 group of worker, highest wage is Rs. 250 and lowest wage is 100 per day. Find the size of interval. h=

250 − 100 Range = = 55.57 ≅ 56 1 + 3.322 log N 1 + 3.322 log 50

Constructing a frequency distribution The following guidelines may be considered for the construction of frequency distribution. a) The classes should be clearly defined and each observations must belong to one and to only one class interval. Interval classes must be inclusive and nonoverlapping. b) The number of classes should be neither too large nor too small. Too small classes result greater interval width with loss of accuracy. Too many class interval result is complexity. c) All interval should be of the same width. computations. The width of interval =

This is preferred for easy

Range Number of classes

d) Open end classes should be avoided since creates difficulty in analysis and interpretation. e) Intervals would be continuous throughout the distribution. This is important for continuous distribution. f) The lower limits of the class intervals should be simple multiples of the interval. Ex: A simple of 30 persons weight of a particular class students are as follows. Construct a frequency distribution for the given data. 62

58

58

52

48

53

54

63

69

63

57

56

46

48

53

56

57

59

58

53

52

56

57

52

52

53

54

58

61

63

♦ Steps of construction Step 1

18

Find the range of data

(H) Highest value = 70 (L) Lowest value = 46

Range = H – L = 69 – 46 = 23 Step 2 Find the number of class intervals. Sturges formula K = 1 + 3.322 log N. K = 1 + 3.222 log 30 K = 5.90 Say K = 6 ∴ No. of classes = 6 Step 3 Width of class interval Width of class interval =

Range 23 = 3.883 ≅ 4 = Number of classes 6

Step 4 Conclusions all frequencies belong to each class interval and assign this total frequency to corresponding class intervals as follows. Class interval

Tally bars

Frequency

46 – 50

|||

3

50 – 54

|||| |||

8

54 – 58

|||| |||

8

58 – 62

|||| |

6

62 – 66

||||

4

66 – 70

|

1

Cumulative frequency distribution Cumulative frequency distribution indicating directly the number of units that lie above or below the specified values of the class intervals. When the interest of the investigator is on number of cases below the specified value, then the specified value represents the upper limit of the class interval. It is known as ‘less than’ cumulative frequency distribution. When the interest is lies in finding the number of cases above specified value then this value is taken as lower limit of the specified class interval. Then, it is known as ‘more than’ cumulative frequency distribution. The cumulative frequency simply means that summing up the consecutive frequency.

19

Ex: Marks

No. of students

‘Less than’ cumulative frequency

0 – 10

5

5

10 – 20

3

8

20 – 30

10

18

30 – 40

20

38

40 – 50

12

50

In the above ‘less than’ cumulative frequency distribution, there are 5 students less than 10, 3 less than 20 and 10 less than 30 and so on. Similarly, following table shows ‘greater than’ cumulative frequency distribution. Ex: Marks

No. of students

‘Less than’ cumulative frequency

0 – 10

5

50

10 – 20

3

45

20 – 30

10

42

30 – 40

20

32

40 – 50

12

12

In the above ‘greater than’ cumulative frequency distribution, 50 students are scored more than 0, 45 more than 10, 42 more than 20 and so on.

Diagrammatic and Graphic Representation The data collected can be presented graphically or pictorially to be easy understanding and for quick interpretation. Diagrams and graphs gives visual indications of magnitudes, groupings, trends and patterns in the data. There parameter can be more simply presented in the graphical manner. The diagrams and graphs helps for comparison of the variables.

Diagrammatic presentation

20

A diagram is a visual form for presentation of statistical data. The diagram refers various types of devices such as bars, circles, maps, pictorials and cartograms etc.

Importance of Diagrams 1. They are simple, attractive and easy understandable 2. They give quick information 3. It helps to compare the variables 4. Diagrams are more suitable to illustrate discrete data 5. It will have more stable effect in the reader’s mind. Limitations of diagrams 1. Diagrams shows approximate value 2. Diagrams are not suitable for further analysis 3. Some diagrams are limited to experts (multidimensional) 4. Details cannot be provided fully 5. It is useful only for comparison

General Rules for drawing the diagrams i) Each diagram should have suitable title indicating the theme with which diagram is intended at the top or bottom. ii) The size of diagram should emphasize the important characteristics of data. iii) Approximate proposition should be maintained for length and breadth of diagram. iv) A proper / suitable scale to be apoted for diagram v) Selection of approximate diagram is important and wrong selection may mislead the reader. vi) Source of data should be mentioned at bottom. vii) Diagram should be simple and attractive viii)Diagram should be effective than complex.

Some important types of diagrams a) One dimensional diagrams (line and bar) b) Two-dimensional diagram (rectangle, square, circle) c) Three dimensional diagram (cube, sphere, cylinder etc.) d) pictogram

21

e) Cartogram a) One dimensional diagrams (line and bar) In one dimensional diagrams, the length of the bars or lines are taken into account. Width of the bars are not considered. Bar diagrams are classified mainly as follows. i) Line diagram ii) Bar diagram -

Vertical bar diagram

-

Horizontal bar diagram

-

Multiple (compound) bar diagram

-

Sub-divided (component) bar diagram

-

Percentage subdivided bar diagram

i) Line diagram This is simplest type of one dimensional diagram. On the basis of size of the figures, heights of the bar / lines are drawn. The distance between bars are kept uniform. The limitation of this diagram are it is not attractive cannot provide more than one information. Ex: Draw the line diagram for the following data Year No. of students passed in first class with distinction

2001

2002

2003

2004

2005

2006

5

7

12

5

13

15

No. of students passed in FCD

16

(15)

14

(13) (12)

12 10 8 6 4

(7) (5)

2001

(5) 2002

2003

2004

Year

22

2005

2006

Indication of diagram: Highest FCD is at 2006 and lowest FCD are at 2001 and 2004. b) Simple bars diagram A simple bar diagram can be drawn using horizontal or vertical bar. In business and economics, it is very a common diagram. Vertical bar diagram The annual expresses of maintaining the car of various types are given below. Draw the vertical bar diagram. The annual expenses of maintaining includes (fuel + maintenance + repair + assistance + insurance). Type of the car

Expense in Rs. / Year

Maruthi Udyog

47533

Hyundai

59230

Tata Motors

63270 Source: 2005 TNS TCS Study Published at: Vijaya Karnataka, dated: 03.08.2006

70000 63270

65000

59230

60000 55000 47533

50000 45000 40000 35000 30000

Maruthi Udyog

Hyundai

Tata Motors Source: 2005 TNS TCS Study

Published at: Vijaya Karnataka, dated: 03.08.2006 Indicating of diagram a) Annual expenses of Maruthi Udyog brand car is comparatively less with other brands depicted

23

b) High annual expenses of Tata motors brand can be seen from diagram. ♦ Horizontal bar diagram World biggest top 10 steel makers are data are given below. Draw horizontal bar diagram. Steel maker

Arcelo r Mittal

Nippo n

Prodn. in million tonnes

110

32

POSCO JFE

31

US Stee l

NUCOR

24

20

18

30

Tangshan

16

Thyssen-krupp

17

RIVA

18

NUCOR

18

Top - 10 Steel Makers

BAO Steel

RIVA

Thyssen -krupp

Tangshan

18

17

16

20

US Steel

24

BAO Steel JFE

30

POSCO

31

Nippon

32 110

Arcelor Mittal

0

20

40

60

80

100

120

Production of Steel (Million Tonnes)

Source: ISSB Published by India Today

♦ Compound bar diagram (Multiple bar diagram) Multiple bar diagrams are used to provide more information than simple bar diagram. Multiple bar diagram provides more than one phenomenon and highly useful for direct comparison. The bars are drawn side by side and different columns, shades hatches can be used for indicating each variables used. Ex: Draw the bar diagram for the following data. Resale value of the cars (Rs. 000) are as follows. Year (Model)

Santro

Zen

Wagonr

2003

208

252

248

2004

240

278

274

24

2005

261

296

302

350

Value in Rs.

300 250

261 240 208

296 278 252

302 274 248

2 Model of Car

3

200 150 100 50 0 1

Santro

Zen

Wagnor

Source: True value used car purchase data Published by: Vijaya Karnataka, dated: 03.08.2006 Ex: Represent following in suitable diagram Class

A

B

C

Male

1000

1500

1500

Female

500

800

1000

1500

2300

2500

Total

25

Population (in Nos.)

2500 2000 1500

1500

0

800

1000

1500

1500

2

3

500

1000 500

2500

2300

1000

1

Class Male

Female

Ex: Draw the suitable diagram for following data Investment in 2004 in Rs.

Mode of investment

Investment in 2005 in Rs.

Investment

%age

Investment

%age

NSC

25000

43.10

30000

45.45

MIS

15000

25.86

10000

15.15

Mutual Fund

15000

25.86

25000

37.87

LIC

3000

5.17

1000

1.52

58000

100

66000

100

Total

110 100

5.17

1.52

25.86

37.87

25.86

15.15

43.10

45.45

% of Investment

90 80 70 60 50 40 30 20 10 0

2004

2005

Year

26

Two-dimensional diagram In two-dimensional diagram both breadth and length of the diagram (i.e. area of the diagram) are considered as area of diagram represents the data. The important two dimensional diagrams are a) Rectangular diagram b) Square diagram a) Rectangular diagram Rectangular diagrams are used to depict two or more variables. This diagram helps for direct comparison. The area of rectangular are kept in proportion to the values. It may be of two types. i)

Percentage sub-divided rectangular diagram

ii)

Sub-divided rectangular diagram

In former care width of the rectangular are proportional to the values, the various components of the values are converted into percentages and rectangles are divided according to them. While later case is used to show some related phenomenon like cost per unit, quality of production etc. Ex: Draw the rectangle diagram for following data Expenditure in Rs.

Item Expenditure

Family A

Family B

Provisional stores

1000

2000

Education

250

500

Electricity

300

700

House Rent

1500

2800

Vehicle Fuel

500

1000

3500

7000

Total

Total expenditure will be taken as 100 and the expenditure on individual items are expressed in percentage. The width of two rectangles are in proportion to the total expenses of the two families i.e. 3500 : 7000 or 1 : 2. The height of rectangles are according to percentage of expenses. Monthly expenditure Item Expenditure

Family A (Rs. 3500)

Family B(Rs. 7000)

Rs.

%age

Rs.

%age

Provisional stores

1000

28.57

2000

28.57

Education

250

7.14

500

7.14

Electricity

300

8.57

700

10

27

House Rent

1500

42.85

2800

40

Vehicle Fuel

500

12.85

1000

14.28

Total

3500

100

7000

100

Provisonal Stores Electricity House Rent

Education Vehicle Fuel

% of Expenditure

100

80

60

40

20

0

A

B

Family

b) Square diagram To draw square diagrams, the square root is taken of the values of the various items to be shown. A suitable scale may be used to depict the diagram. Ratios are to be maintained to draw squares. Ex: Draw the square diagram for following data 4900

2500

1600

Solution: Square root for each item in found out as 70, 50 and 40 and is divided by 10; thus we get 7, 5 and 4.

28

6000

4900

5000

4000

3000

2000

1000

0

2500 1600 4 1

5

7

2

3

29

Pie diagram Pie diagram helps us to show the portioning of a total into its component parts. It is used to show classes or groups of data in proportion to whole data set. The entire pie represents all the data, while each slice represents a different class or group within the whole. Following illustration shows construction of pie diagram.

Draw the pie diagram for following data Revenue collections for the year 2005-2006 by government in Rs. (crore)s for petroleum products are as follows. Draw the pie diagram. Customs

9600

Excise

49300

Corporate Tax and dividend

18900

States taking

48800

Total

126600

Solution: Item / Source

Value in crores

Angle of circle

%ge

9600

9600 x 360 = 27.30 o 126600

7.58

Excise

49300

49300 x 360 = 140.20 o 126600

39.00

Corporate Tax and Dividend

18900

18900 x 360 = 53.70 o 126600

14.92

State’s taking

48800

48800 x 360 = 138.80 o 126600

38.50

126600

360o

Customs

Total

30

100

7.58

Customs

38.5

Excise

39

Corporate Tax and Dividend State’s taking

14.92 Source: India Today 19 June, 2006

Choice or selection of diagram There are many methods to depict statistical data through diagram. No angle diagram is suited for all purposes. The choice / selection of diagram to suit given set of data requires skill, knowledge and experience. Primarily, the choice depends upon the nature of data and purpose of presentation, to whom it is meant. The nature of data will help in taking a decision as to one-dimensional or two-dimensional or threedimensional diagram. It is also required to know the audience for whom the diagram is depicted. The following points are to be kept in mind for the choice of diagram. 1. To common man, who has less knowledge in statistics cartogram and pictograms are suited. 2. To present the components apart from magnitude of values, sub-divided bar diagram can be used. 3. When a large number of components are to be shows, pie diagram is suitable.

Graphic presentation A graphic presentation a visual form of presentation graphs are drawn on a special type of paper known are graph paper. Common graphic representations are a) Histogram b) Frequency polygon c) Cumulative frequency curve (ogive)

Advantages of graphic presentation 1. It provides attractive and impressive view 31

2. Simplifies complexity of data 3. Helps for direct comparison 4. It helps for further statistical analysis 5. It is simplest method of presentation of data 6. It shows trend and pattern of data Difference between graph and diagram Diagram

Graph

1. Ordinary paper can be used 2. It is attractive understandable

1. Graph paper is required

and

easily 2. Needs some effect to understand

3. It is appropriate and effective to 3. It creates problem measure more variable 4. It can’t be used for further analysis

4. Can be used for further analysis

5. It gives comparison

5. It shows variables

6. Data are rectangles

represented

by

relationship

between

bars, 6. Points and lines are used to represent data

Frequency Histogram In this type of representation the given data are plotted in the form of series of rectangles. Class intervals are marked along the x-axis and the frequencies are along the y-axis according to suitable scale. Unlike the bar chart, which is one-dimensional, a histogram is two-dimensional in which the length and width are both important. A histogram is constructed from a frequency distribution of grouped data, where the height of rectangle is proportional to respective frequency and width represents the class interval. Each rectangle is joined with other and the blank space between the rectangles would mean that the category is empty and there is no values in that class interval. Ex: Construct a histogram for following data. Marks obtained (x) No. of students (f)

Mid point

15 – 25

5

20

25 – 35

3

30

35 – 45

7

40

45 – 55

5

50

55 – 65

3

60

65 – 75

7

70

Total

30

For convenience sake, we will present the frequency distribution along with mid-point of each class interval, where the mid-point is simply the average of value of lower and upper boundary of each class interval.

32

Frequency (No. of students)

7 6 5 4 3 2 1 0 15

25

45

35

55

65

75

Class Interval (Marks)

Frequency polygon A frequency polygon is a line chart of frequency distribution in which either the values of discrete variables or the mid-point of class intervals are plotted against the frequency and those plotted points are joined together by straight lines. Since, the frequencies do not start at zero or end at zero, this diagram as such would not touch horizontal axis. However, since the area under entire curve is the same as that of a histogram which is 100%. The curve must be ‘enclosed’, so that starting mid-point is jointed with ‘fictitious’ preceding mid-point whose value is zero. So that the beginning of curve touches the horizontal axis and the last mid-point is joined with a ‘fictitious’ succeeding mid-point, whose value is also zero, so that the curve will end at horizontal axis. This enclosed diagram is known as ‘frequency polygon’. Ex: For following data construct frequency polygon. Marks (CI) No. of frequencies (f)

Mid-point

15 – 25

5

20

25 – 35

3

30

35 – 45

7

40

45 – 55

5

50

55 – 65

3

60

65 – 75

7

70

33

10

Frequency

8

A Frequency polygon

6

4

2

0 0

10

20

30

40

50

60

70

80

90

100

Mid point (x)

Cumulative frequency curve (ogive) ogives are the graphic representations of a cumulative frequency distribution. These ogives are classified as ‘less than’ and ‘more than ogives’. In case of ‘less than’, cumulative frequencies are plotted against upper boundaries of their respective class intervals. In case of ‘grater than’ cumulative frequencies are plotted against upper boundaries of their respective class intervals. These ogives are used for comparison purposes. Several ogves can be compared on same grid with different colour for easier visualisation and differentiation. Ex: Marks (CI)

No. of frequencies (f)

Mid-point

Cum. Freq. Less than

Cum. Freq. More than

15 – 25

5

20

5

30

25 – 35

3

30

8

25

35 – 45

7

40

15

22

45 – 55

5

50

20

15

55 – 65

3

60

23

10

65 – 75

7

70

30

7

34

Less than Cumulative Frequency

Less than give diagram

30

'Less than' ogive 25

20

15

10

5 20

30

40

50

60

70

Upper Boundary (CI)

Less than give diagram

35

'More than' ogive

More than Ogive

30

25

20

15

10

10

20

30

40

50

Lower Boundary (CI)

35

60

70

LESSON – 1 STATISTICS FOR MANAGEMENT

Session – 2

Duration: 1 hr

Classification and Tabulation The data collected for the purpose of a statistical inquiry some times consists of a few fairly simple figures, which can be easily understood without any special treatment. But more often there is an overwhelming mass of raw data without any structure. Thus, unwieldy, unorganised and shapeless mass of collected is not capable of being rapidly or easily associated or interpreted. Unorganised data are not fit for further analysis and interpretation. In order to make the data simple and easily understandable the first task is not condense and simplify them in such a way that irrelevant data are removed and their significant features are stand out prominently. The procedure adopted for this purpose is known as method of classification and tabulation. Classification helps proper tabulation. “Classified and arranged facts speak themselves; unarranged, unorganised they are dead as mutton”. - Prof. J.R. Hicks

♦ Meaning of Classification Classification is a process of arranging things or data in groups or classes according to their resemblances and affinities and gives expressions to the unity of attributes that may subsit among a diversity of individuals.

♦ Definition of Classification Classification is the process of arranging data into sequences and groups according to their common characteristics or separating them into different but related parts. - Secrist The process of grouping large number of individual facts and observations on the basis of similarity among the items is called classification. - Stockton & Clark

Characteristics of classification e) Classification performs homogeneous grouping of data f) It brings out points of similarity and dissimilarities. g) The classification may be either real or imaginary h) Classification is flexible to accommodate adjustments

Objectives / purposes of classifications

36

ix) To simplify and condense the large data x) To present the facts to easily in understandable form xi) To allow comparisons xii) To help to draw valid inferences xiii)To relate the variables among the data xiv)To help further analysis xv) To eliminate unwanted data xvi)To prepare tabulation

Guiding principles (rules) of classifications Following are the general guiding principles for good classifications g) Exhaustive: Classification should be exhaustive. Each and every item in data must belong to one of class. Introduction of residual class (i.e. either, miscellaneous etc.) should be avoided. h) Mutually exclusive: Each item should be placed at only one class i) Suitability: The classification should confirm to object of inquiry. j) Stability: Only one principle must be maintained throughout the classification and analysis. k) Homogeneity: The items included in each class must be homogeneous. l) Flexibility: A good classification should be flexible enough to accommodate new situation or changed situations.

Modes / Types of Classification Modes / Types of classification refers to the class categories into which the data could be sorted out and tabulated. These categories depend on the nature of data and purpose for which data is being sought.

Important types of classification e) Geographical (i.e. on the basis of area or region wise) f) Chronological (On the basis of Temporal / Historical, i.e. with respect to time) g) Qualitative (on the basis of character / attributes) h) Numerical, quantitative (on the basis of magnitude) e) Geographical Classification In geographical classification, the classification is based on the geographical regions. Ex:

Sales of the company (In Million Rupees) (region – wise) 37

Region

Sales

North

285

South

300

East

185

West

235

f) Chronological Classification If the statistical data are classified according to the time of its occurrence, the type of classification is called chronological classification. Sales reported by a departmental store Sales Month (Rs.) in lakhs January

22

February

26

March

32

April

25

May

27

June

30

g) Qualitative Classification In qualitative classifications, the data are classified according to the presence or absence of attributes in given units. Thus, the classification is based on some quality characteristics / attributes. Ex: Sex, Literacy, Education, Class grade etc. Further, it may be classified as a) Simple classification

b) Manifold classification

iii) Simple classification: If the classification is done into only two classes then classification is known as simple classification. Ex:

a) Population in to Male / Female b) Population into Educated / Uneducated

iv) Manifold classification: In this classification, the classification is based on more than one attribute at a time. Ex:

38

Population

Smokers

Literate

Non-smokers

Illiterate

Male

Male

Illiterate

Literate

Female

Female

Male

Male

Female

Female

h) Quantitative Classification: In Quantitative classification, the classification is based on quantitative measurements of some characteristics, such as age, marks, income, production, sales etc. The quantitative phenomenon under study is known as variable and hence this classification is also called as classification by variable. Ex: For a 50 marks test, Marks obtained by students as classified as follows Marks

No. of students

0 – 10

5

10 – 20

7

20 – 30

10

30 – 40

25

40 – 50

3

Total Students = 50 In this classification marks obtained by students is variable and number of students in each class represents the frequency.

Tabulation Meaning and Definition of Tabulation Tabulation may be defined, as systematic arrangement of data is column and rows. It is designed to simplify presentation of data for the purpose of analysis and statistical inferences.

Major Objectives of Tabulation

39

6. To simplify the complex data 7. To facilitate comparison 8. To economize the space 9. To draw valid inference / conclusions 10. To help for further analysis

Differences between Classification and Tabulation 4. First data are classified and presented in tables; classification is the basis for tabulation. 5. Tabulation is a mechanical function of classification because is tabulation classified data are placed in row and columns. 6. Classification is a process of statistical analysis while tabulation is a process of presenting data is suitable structure.

Classification of tables Classification is done based on 4. Coverage (Simple and complex table) 5. Objective / purpose (General purpose / Reference table / Special table or summary table) 6. Nature of inquiry (primary and derived table).

Ex: c) Simple table: Data are classified based on only one characteristic Distribution of marks Class Marks

No. of students

30 – 40

20

40 – 50

20

50 – 60

10

40

Total

50

d) Two-way table: Classification is based on two characteristics No. of students

Class Marks

Boys

Girls

Total

30 – 40

10

10

20

40 – 50

15

5

20

50 – 60

3

7

10

28

22

50

Total

Frequency Distribution Frequency distribution is a table used to organize the data. The left column (called classes or groups) includes numerical intervals on a variable under study. The right column contains the list of frequencies, or number of occurrences of each class/group. Intervals are normally of equal size covering the sample observations range. It is simply a table in which the gathered data are grouped into classes and the number of occurrences, which fall in each class, is recorded.

♦ Definition A frequency distribution is a statistical table which shows the set of all distinct values of the variable arranged in order of magnitude, either individually or in groups with their corresponding frequencies. - Croxton and Cowden A frequency distribution can be classified as d) Series of individual observation e) Discrete frequency distribution f) Continuous frequency distribution b) Series of individual observation Series of individual observation is a series where the items are listed one after the each observation. For statistical calculations, these observation could be arranged is either ascending or descending order. This is called as array. Ex: Roll No.

Marks obtained in statistics

41

paper 1

83

2

80

3

75

4

92

5

65

The above data list is a raw data. The presentation of data in above form doesn’t reveal any information. If the data is arranged in ascending / descending in the order of their magnitude, which gives better presentation then, it is called arraying of data.

Discrete (ungrouped) Frequency Distribution If the data series are presented in such away that indicating its exact measurement of units, then it is called as discrete frequency distribution. Discrete variable is one where the variants differ from each other by definite amounts. Ex: Assume that a survey has been made to know number of post-graduates in 10 families at random; the resulted raw data could be as follows. 0, 1, 3, 1, 0, 2, 2, 2, 2, 4

This data can be classified into an ungrouped frequency distribution. The number of post-graduates becomes variable (x) for which we can list the frequency of occurrence (f) in a tabular from as follows;

Number of post graduates (x)

Frequency (f)

0

2

1

2

2

4

3

1

42

4

1

The above example shows a discrete frequency distribution, where the variable has discrete numerical values.

Continuous frequency distribution (grouped frequency distribution) Continuous data series is one where the measurements are only approximations and are expressed in class intervals within certain limits. In continuous frequency distribution the class interval theoretically continuous from the starting of the frequency distribution till the end without break. According to Boddington ‘the variable which can take very intermediate value between the smallest and largest value in the distribution is a continuous frequency distribution. Ex: Marks obtained by 20 students in students’ exam for 50 marks are as given below – convert the data into continuous frequency distribution form. 18

23

28

29

44

28

48

33

32

43

24

29

32

39

49

42

27

33

28

29

By grouping the marks into class interval of 10 following frequency distribution tables can be formed.

Marks

No. of students

0-5

0

5 – 10

0

10 – 15

0

15 – 20

1

20 – 25

2

25 – 30

7

30 – 35

4

35 – 40

1

40 – 45

3

43

45 – 50

2

LESSON – 1 STATISTICS FOR MANAGEMENT

Session – 3

Duration: 1 hr

Technical terms used in formulation frequency distribution c) Class limits: The class limits are the smallest and largest values in the class. Ex: 0 – 10, in this class, the lowest value is zero and highest value is 10. the two boundaries of the class are called upper and lower limits of the class. Class limit is also called as class boundaries. d) Class intervals The difference between upper and lower limit of class is known as class interval. Ex: In the class 0 – 10, the class interval is (10 – 0) = 10. The formula to find class interval is gives on below i=

L−S R

L = Largest value S = Smallest value R = the no. of classes Ex: If the mark of 60 students in a class varies between 40 and 100 and if we want to form 6 classes, the class interval would be I= (L-S ) / K =

100 − 40 6

=

60 6

= 10

L = 100 S = 40 K=6

Therefore, class intervals would be 40 – 50, 50 – 60, 60 – 70, 70 – 80, 80 – 90 and 90 – 100.

♦ Methods of forming class-interval

44

c) Exclusive method (overlapping) In this method, the upper limits of one class-interval are the lower limit of next class. This method makes continuity of data. Ex: Marks

No. of students

20 – 30

5

30 – 40

15

40 – 50

25

A student whose mark is between 20 to 29.9 will be included in the 20 – 30 class. Better way of expressing is Marks

No. of students

20 to les than 30

5

(More than 20 but les than 30) 30 to les than 40

15

40 to les than 50

25

Total Students

50

d) Inclusive method (non-overlaping) Ex: Marks

No. of students

20 – 29

5

30 – 39

15

40 – 49

25

A student whose mark is 29 is included in 20 – 29 class interval and a student whose mark in 39 is included in 30 – 39 class interval.

♦ Class Frequency The number of observations falling within class-interval is called its class frequency.

45

Ex: The class frequency 90 – 100 is 5, represents that there are 5 students scored between 90 and 100. If we add all the frequencies of individual classes, the total frequency represents total number of items studied.

♦ Magnitude of class interval The magnitude of class interval depends on range and number of classes. The range is the difference between the highest and smallest values is the data series. A class interval is generally in the multiples of 5, 10, 15 and 20. Sturges formula to find number of classes is given below K = 1 + 3.322 log N. K = No. of class log N = Logarithm of total no. of observations Ex: If total number of observations are 100, then number of classes could be K = 1 + 3.322 log 100 K = 1 + 3.322 x 2 K = 1 + 6.644 K = 7.644 = 8 (Rounded off) NOTE: Under this formula number of class can’t be less than 4 and not greater than 20.

♦ Class mid point or class marks The mid value or central value of the class interval is called mid point. Mid point of a class =

(lower limit of class + upper limit of class) 2

♦ Sturges formula to find size of class interval Size of class interval (h) =

Range 1 + 3.322 log N

Ex: In a 5 group of worker, highest wage is Rs. 250 and lowest wage is 100 per day. Find the size of interval. h=

250 − 100 Range = = 55.57 ≅ 56 1 + 3.322 log N 1 + 3.322 log 50

Constructing a frequency distribution The following guidelines may be considered for the construction of frequency distribution.

46

g) The classes should be clearly defined and each observation must belong to one and to only one class interval. Interval classes must be inclusive and nonoverlapping. h) The number of classes should be neither too large nor too small. Too small classes result greater interval width with loss of accuracy. Too many class interval result is complexity. i) All intervals should be of the same width. computations. The width of interval =

This is preferred for easy

Range Number of classes

j) Open end classes should be avoided since creates difficulty in analysis and interpretation. k) Intervals would be continuous throughout the distribution. This is important for continuous distribution. l) The lower limits of the class intervals should be simple multiples of the interval. Ex: A simple of 30 persons weight of a particular class students are as follows. Construct a frequency distribution for the given data. 62

58

58

52

48

53

54

63

69

63

57

56

46

48

53

56

57

59

58

53

52

56

57

52

52

53

54

58

61

63

♦ Steps of construction Step 1 Find the range of data

(H) Highest value = 70 (L) Lowest value = 46

Range = H – L = 69 – 46 = 23 Step 2 Find the number of class intervals. Sturges formula K = 1 + 3.322 log N. K = 1 + 3.222 log 30 K = 5.90 Say K = 6 ∴ No. of classes = 6 Step 3 Width of class interval Width of class interval =

Range 23 = 3.883 ≅ 4 = Number of classes 6

47

Step 4 Conclusions all frequencies belong to each class interval and assign this total frequency to corresponding class intervals as follows. Class interval

Tally bars

Frequency

46 – 50

|||

3

50 – 54

|||| |||

8

54 – 58

|||| |||

8

58 – 62

|||| |

6

62 – 66

||||

4

66 – 70

|

1

Cumulative frequency distribution Cumulative frequency distribution indicating directly the number of units that lie above or below the specified values of the class intervals. When the interest of the investigator is on number of cases below the specified value, then the specified value represents the upper limit of the class interval. It is known as ‘less than’ cumulative frequency distribution. When the interest is lies in finding the number of cases above specified value then this value is taken as lower limit of the specified class interval. Then, it is known as ‘more than’ cumulative frequency distribution. The cumulative frequency simply means that summing up the consecutive frequency. Ex: Marks

No. of students

‘Less than’ cumulative frequency

0 – 10

5

5

10 – 20

3

8

20 – 30

10

18

30 – 40

20

38

40 – 50

12

50

48

In the above ‘less than’ cumulative frequency distribution, there are 5 students less than 10, 3 less than 20 and 10 less than 30 and so on. Similarly, following table shows ‘greater than’ cumulative frequency distribution. Ex: Marks

No. of students

‘Less than’ cumulative frequency

0 – 10

5

50

10 – 20

3

45

20 – 30

10

42

30 – 40

20

32

40 – 50

12

12

In the above ‘greater than’ cumulative frequency distribution, 50 students are scored more than 0, 45 more than 10, 42 more than 20 and so on.

Diagrammatic and Graphic Representation The data collected can be presented graphically or pictorially to be easy understanding and for quick interpretation. Diagrams and graphs give visual indications of magnitudes, groupings, trends and patterns in the data. These parameter can be more simply presented in the graphical manner. The diagrams and graphs help for comparison of the variables.

Diagrammatic presentation A diagram is a visual form for presentation of statistical data. The diagram refers various types of devices such as bars, circles, maps, pictorials and cartograms etc.

Importance of Diagrams 6. They are simple, attractive and easy understandable 7. They give quick information 8. It helps to compare the variables 9. Diagrams are more suitable to illustrate discrete data 10. It will have more stable effect in the reader’s mind. Limitations of diagrams 1. Diagrams shows approximate value

49

2. Diagrams are not suitable for further analysis 3. Some diagrams are limited to experts (multidimensional) 4. Details cannot be provided fully 5. It is useful only for comparison

General Rules for drawing the diagrams ix) Each diagram should have suitable title indicating the theme with which diagram is intended at the top or bottom. x) The size of diagram should emphasize the important characteristics of data. xi) Approximate proposition should be maintained for length and breadth of diagram. xii) A proper / suitable scale to be adopted for diagram xiii)Selection of approximate diagram is important and wrong selection may mislead the reader. xiv)Source of data should be mentioned at bottom. xv) Diagram should be simple and attractive xvi)Diagram should be effective than complex.

Some important types of diagrams f) One dimensional diagrams (line and bar) g) Two-dimensional diagram (rectangle, square, circle) h) Three-dimensional diagram (cube, sphere, cylinder etc.) i) Pictogram j) Cartogram c) One dimensional diagrams (line and bar) In one-dimensional diagrams, the length of the bars or lines is taken into account. Widths of the bars are not considered. Bar diagrams are classified mainly as follows. iii) Line diagram iv) Bar diagram -

Vertical bar diagram

-

Horizontal bar diagram

-

Multiple (compound) bar diagram

-

Sub-divided (component) bar diagram

-

Percentage subdivided bar diagram

50

ii) Line diagram This is simplest type of one-dimensional diagram. On the basis of size of the figures, heights of the bar / lines are drawn. The distances between bars are kept uniform. The limitation of this diagram are it is not attractive cannot provide more than one information. Ex: Draw the line diagram for the following data Year No. of students passed in first class with distinction

2001

2002

2003

2004

2005

2006

5

7

12

5

13

15

No. of students passed in FCD

16

(15)

14

(13) (12)

12 10 8

(7)

6 4

(5)

2001

(5) 2002

2003

2004

2005

2006

Year

Indication of diagram: Highest FCD is at 2006 and lowest FCD are at 2001 and 2004. d) Simple bars diagram A simple bar diagram can be drawn using horizontal or vertical bar. In business and economics, it is very a common diagram. Vertical bar diagram The annual expresses of maintaining the car of various types are given below. Draw the vertical bar diagram. The annual expenses of maintaining includes (fuel + maintenance + repair + assistance + insurance). Type of the car

Expense in Rs. / Year

Maruthi Udyog

47533

Hyundai

59230

51

Tata Motors

63270 Source: 2005 TNS TCS Study Published at: Vijaya Karnataka, dated: 03.08.2006

70000 63270

65000

59230

60000 55000 47533

50000 45000 40000 35000 30000

Maruthi Udyog

Hyundai

Tata Motors Source: 2005 TNS TCS Study

Published at: Vijaya Karnataka, dated: 03.08.2006 Indicating of diagram a) Annual expenses of Maruthi Udyog brand car is comparatively less with other brands depicted b) High annual expenses of Tata motors brand can be seen from diagram. ♦ Horizontal bar diagram World biggest top 10 steel makers are data are given below. Draw horizontal bar diagram. Steel maker

Arcelo r Mittal

Nippo n

Prodn. in million tonnes

110

32

POSCO JFE

31

BAO Steel

US Stee l

NUCOR

24

20

18

30

52

RIVA

Thyssen -krupp

Tangshan

18

17

16

16

Thyssen-krupp

17

RIVA

18

NUCOR

18

Top - 10 Steel Makers

Tangshan

20

US Steel

24

BAO Steel JFE

30

POSCO

31

Nippon

32 110

Arcelor Mittal

0

20

40

60

80

100

120

Production of Steel (Million Tonnes)

Source: ISSB Published by India Today

♦ Compound bar diagram (Multiple bar diagram) Multiple bar diagrams are used to provide more information than simple bar diagram. Multiple bar diagram provides more than one phenomenon and highly useful for direct comparison. The bars are drawn side-by-side and different columns, shades hatches can be used for indicating each variable used. Ex: Draw the bar diagram for the following data. Resale value of the cars (Rs. 000) is as follows. Year (Model)

Santro

Zen

Wagonr

2003

208

252

248

2004

240

278

274

2005

261

296

302

53

350

Value in Rs.

300 250

296 278 252

302 274 248

2 Model of Car

3

261 240 208

200 150 100 50 0 1

Santro

Zen

Wagnor

Source: True value used car purchase data Published by: Vijaya Karnataka, dated: 03.08.2006 Ex: Represent following in suitable diagram Class

A

B

C

Male

1000

1500

1500

Female

500

800

1000

1500

2300

2500

Total

2300

2500

Population (in Nos.)

2500 2000 1500

1500

0

1000

1500

1500

2

3

500

1000 500

800

1000

1

Class Male

Female

54

Ex: Draw the suitable diagram for following data Investment in 2004 in Rs.

Mode of investment

Investment in 2005 in Rs.

Investment

%age

Investment

%age

NSC

25000

43.10

30000

45.45

MIS

15000

25.86

10000

15.15

Mutual Fund

15000

25.86

25000

37.87

LIC

3000

5.17

1000

1.52

58000

100

66000

100

Total

110 100

5.17

1.52

25.86

37.87

25.86

15.15

43.10

45.45

% of Investment

90 80 70 60 50 40 30 20 10 0

2004

2005

Year

Two-dimensional diagram In two-dimensional diagram both breadth and length of the diagram (i.e. area of the diagram) are considered as area of diagram represents the data. The important two-dimensional diagrams are a) Rectangular diagram b) Square diagram c) Rectangular diagram Rectangular diagrams are used to depict two or more variables. This diagram helps for direct comparison. The area of rectangular are kept in proportion to the values. It may be of two types. iii)

Percentage sub-divided rectangular diagram

iv)

Sub-divided rectangular diagram

55

In former case, width of the rectangular are proportional to the values, the various components of the values are converted into percentages and rectangles are divided according to them. Later case is used to show some related phenomenon like cost per unit, quality of production etc. Ex: Draw the rectangle diagram for following data Expenditure in Rs.

Item Expenditure

Family A

Family B

Provisional stores

1000

2000

Education

250

500

Electricity

300

700

House Rent

1500

2800

Vehicle Fuel

500

1000

3500

7000

Total

Total expenditure will be taken as 100 and the expenditure on individual items are expressed in percentage. The widths of two rectangles are in proportion to the total expenses of the two families i.e. 3500: 7000 or 1: 2. The heights of rectangles are according to percentage of expenses. Monthly expenditure Item Expenditure

Family A (Rs. 3500)

Family B(Rs. 7000)

Rs.

%age

Rs.

%age

Provisional stores

1000

28.57

2000

28.57

Education

250

7.14

500

7.14

Electricity

300

8.57

700

10

House Rent

1500

42.85

2800

40

Vehicle Fuel

500

12.85

1000

14.28

Total

3500

100

7000

100

56

Provisonal Stores Electricity House Rent

Education Vehicle Fuel

% of Expenditure

100

80

60

40

20

0

A

B

Family

d) Square diagram To draw square diagrams, the square root is taken of the values of the various items to be shown. A suitable scale may be used to depict the diagram. Ratios are to be maintained to draw squares. Ex: Draw the square diagram for following data 4900

2500

1600

Solution: Square root for each item in found out as 70, 50 and 40 and is divided by 10; thus we get 7, 5 and 4.

6000

4900

5000

4000

3000

2000

1000

0

2500 1600 4 1

5

7

2

3

57

Pie diagram Pie diagram helps us to show the portioning of a total into its component parts. It is used to show classes or groups of data in proportion to whole data set. The entire pie represents all the data, while each slice represents a different class or group within the whole. Following illustration shows construction of pie diagram.

Draw the pie diagram for following data Revenue collections for the year 2005-2006 by government in Rs. (crore)s for petroleum products are as follows. Draw the pie diagram. Customs

9600

Excise

49300

Corporate Tax and dividend

18900

States taking

48800

Total

126600

Solution: Item / Source

Value in crores

Angle of circle

%ge

9600

9600 x 360 = 27.30 o 126600

7.58

Excise

49300

49300 x 360 = 140.20 o 126600

39.00

Corporate Tax and Dividend

18900

18900 x 360 = 53.70 o 126600

14.92

State’s taking

48800

48800 x 360 = 138.80 o 126600

38.50

126600

360o

Customs

Total

58

100

7.58

Customs Excise

38.5 39

Corporate Tax and Dividend State’s taking

14.92 Source: India Today 19 June, 2006

Choice or selection of diagram There are many methods to depict statistical data through diagram. No angle diagram is suited for all purposes. The choice / selection of diagram to suit given set of data requires skill, knowledge and experience. Primarily, the choice depends upon the nature of data and purpose of presentation, to which it is meant. The nature of data will help in taking a decision as to one-dimensional or two-dimensional or threedimensional diagram. It is also required to know the audience for whom the diagram is depicted. The following points are to be kept in mind for the choice of diagram. 4. To common man, who has less knowledge in statistics cartogram and pictograms are suited. 5. To present the components apart from magnitude of values, sub-divided bar diagram can be used. 6. When a large number of components are to be shows, pie diagram is suitable.

Graphic presentation A graphic presentation is a visual form of presentation graphs are drawn on a special type of paper known are graph paper. Common graphic representations are a) Histogram b) Frequency polygon c) Cumulative frequency curve (ogive)

59

Advantages of graphic presentation 7. It provides attractive and impressive view 8. Simplifies complexity of data 9. Helps for direct comparison 10. It helps for further statistical analysis 11. It is simplest method of presentation of data 12. It shows trend and pattern of data Difference between graph and diagram Diagram

Graph

7. Ordinary paper can be used 8. It is attractive understandable

7. Graph paper is required

and

easily 8. Needs some effect to understand

9. It is appropriate and effective to 9. It creates problem measure more variable 10. It can’t be used for further analysis

10. Can be used for further analysis

11. It gives comparison

11. It shows variables

12. Data are rectangles

represented

by

relationship

between

bars, 12. Points and lines are used to represent data

Frequency Histogram In this type of representation the given data are plotted in the form of series of rectangles. Class intervals are marked along the x-axis and the frequencies are along the y-axis according to suitable scale. Unlike the bar chart, which is one-dimensional, a histogram is two-dimensional in which the length and width are both important. A histogram is constructed from a frequency distribution of grouped data, where the height of rectangle is proportional to respective frequency and width represents the class interval. Each rectangle is joined with other and the blank space between the rectangles would mean that the category is empty and there are no values in that class interval. Ex: Construct a histogram for following data. Marks obtained (x) No. of students (f)

Mid point

15 – 25

5

20

25 – 35

3

30

35 – 45

7

40

45 – 55

5

50

55 – 65

3

60

65 – 75

7

70

Total

30

60

Frequency (No. of students)

For convenience sake, we will present the frequency distribution along with mid-point of each class interval, where the mid-point is simply the average of value of lower and upper boundary of each class interval.

7 6 5 4 3 2 1 0 15

25

45

35

55

65

75

Class Interval (Marks)

Frequency polygon A frequency polygon is a line chart of frequency distribution in which either the values of discrete variables or the mid-point of class intervals are plotted against the frequency and those plotted points are joined together by straight lines. Since, the frequencies do not start at zero or end at zero, this diagram as such would not touch horizontal axis. However, since the area under entire curve is the same as that of a histogram which is 100%. The curve must be ‘enclosed’, so that starting mid-point is jointed with ‘fictitious’ preceding mid-point whose value is zero. So that the beginning of curve touches the horizontal axis and the last mid-point is joined with a ‘fictitious’ succeeding mid-point, whose value is also zero, so that the curve will end at horizontal axis. This enclosed diagram is known as ‘frequency polygon’. Ex: For following data construct frequency polygon. Marks (CI) No. of frequencies (f)

Mid-point

15 – 25

5

20

25 – 35

3

30

35 – 45

7

40

45 – 55

5

50

55 – 65

3

60

65 – 75

7

70

61

10

Frequency

8

A Frequency polygon

6

4

2

0 0

10

20

30

40

50

60

70

80

90

100

Mid point (x)

Cumulative frequency curve (ogive) ogives are the graphic representations of a cumulative frequency distribution. These ogives are classified as ‘less than’ and ‘more than ogives’. In case of ‘less than’, cumulative frequencies are plotted against upper boundaries of their respective class intervals. In case of ‘grater than’ cumulative frequencies are plotted against upper boundaries of their respective class intervals. These ogives are used for comparison purposes. Several ogves can be compared on same grid with different colour for easier visualisation and differentiation. Ex: Marks (CI)

No. of frequencies (f)

Mid-point

Cum. Freq. Less than

Cum. Freq. More than

15 – 25

5

20

5

30

25 – 35

3

30

8

25

35 – 45

7

40

15

22

45 – 55

5

50

20

15

55 – 65

3

60

23

10

65 – 75

7

70

30

7

62

Less than Cumulative Frequency

Less than give diagram

30

'Less than' ogive 25

20

15

10

5 20

30

40

50

60

70

Upper Boundary (CI)

Less than give diagram

35

'More than' ogive

More than Ogive

30

25

20

15

10

10

20

30

40

50

Lower Boundary (CI)

63

60

70

Session – 4 Measures of Central Tendency

A classified statistical data may sometimes be described as distributed around some value called the central value or average is some sense. It gives the most representative value of the entire data. Different methods give different central values and are referred to as the measures of central tendency. Thus, the most important objective of statistical analysis is to determine a single value that represents the characteristics of the entire raw data. This single value representing the entire data is called ‘Central value’ or an ‘average’. This value is the point around which all other values of data cluster. Therefore, it is known as the measure of location and since this value is located at central point nearest to other values of the data it is also called as measures of central tendency. Different methods give different central values and are referred as measures of central tendency. The common measures of central tendency are a) Mean b) Median c) Mode. These values are very useful not only in presenting overall picture of entire data, but also for the purpose of making comparison among two or more sets of data.

Average Definition Average is a value which is typical or representative of a set of data. - Murry R. Speigal Average is an attempt to find one single figure to describe whole of figures. - Clark & Sekkade From above definitions it is clear that average is a typical value of the entire data and is a measure of central tendency. Functions of an average •

To represents complex or large data.



It facilitates comparative study of two variables.



Helps to study population from sample data.



Helps in decision making.



Represents single value for a series of data.



To establish mathematical relationship.

64

Characteristics of a typical average •

It should be rigidly defined and easily understandable.



It should be simple to compute and in the form of mathematical formula.



It should be based on all the items in the data.



It should not be unduly influenced by any single item.



It should be capable of further mathematical treatment.



It should have sampling stability.

Types of average Average or measures of central tendency are of following types. 1. Mathematical average a. Arithmetical mean i. Simple mean ii. Weighted mean b. Geometric mean c. Harmonic mean 2. Positional Averages a. Median b. Mode Arithmetic mean Arithmetic mean is also called arithmetic average. It is most commonly used measures of central tendency. Arithmetic average of a series is the value obtained by dividing the total value of various item by its number. Arithmetic average are of two types a. Simple arithmetic average b. Weighted arithmetic average Simple arithmetic average (Mean) Arithmetic mean is simply sometimes referred as ‘Mean’. Ex: Mean income, Mean expenses, Mean marks etc. Unlike other averages, mean has to be computed by considering each and every observations in the series. Hence, the mean cannot be found by either by inspection or observation of items. Simple arithmetic mean is equal to sum of the variable divided by their number of observations in the sample.

65

Let xi is the variable which takes values x1, x2, x3,……… xn over ‘n’ items, then arithmetic mean, simply the mean of x, denoted by bar over the variable x is given by. x=

x 1 + x 2 + x 3 + ............... + x n n

=

∑x n

Where, Σ is the Greek symbol sigma denotes the summation of all xi values. Arithmetic mean can be computed by following two methods for direct observation of individual items. a. Direct method b. Short cut method. Direct method uses above equation and steps for short cut method is illustrated in the subsequent topic. Ex: (For Direct Method) 1. Calculate the mean for following data. Marks obtained by 65 students are given below: 20, 15, 23, 22, 25, 20. x=

Mean marks

x 1 + x 2 + ......... + x n n

=

20 + 15 + 23 + 22 + 25 + 20 6

=

125 6

= 20.83 2. Six month income of departmental store are given below. Find mean income of stores. Month

Jan

Feb

Mar

Apr

May

June

Income (Rs.)

25000

30000

45000

20000

25000

20000

n = Total No. of items (observations) = 6 Total income = Σxi = (25000 + 30000 + 45000 + 20000 + 20000) = 140000 Mean income =

∑x n

i

=

140000 = Rs. 23333.33 6

The above example shows that if there are large data or large figures are there in data, computations required to get mean in high. In order to reduce computations one can go for short-cut method. The method is illustrated below.

66

Shortcut method Steps of this method is given below. Step 1: Assume any one value as a mean which is called arbitrary average (A). Step 2: Find the difference (deviations) of each value from arbitrary average. D = xi – A Step 3: Add all deviations (differences) to get Σd. Step 4: Use following equation and compute the mean value. x=A+

∑d n n = Total No. of observations Σd = Total deviation value A = Arbitrary mean

Example: Find the mean marks obtained by the students for the joining data given. 20

25

20

22

20

21

23

Let A = 20 and n = 10 Marks

D = (xi – 20)

20

0

25

5

20

0

22

2

20

0

21

1

23

3

25

5

22

2

18

-2 Σd = 16

x=A+ x = 20 +

∑d n 16 10

= 20 + 1.6 Mean Marks

x = 21.6

67

25

22

18

1. Mathematical characteristics of mean a. Algebraic sum of deviations of all observations from their arithmetic mean is zero i.e. Σ(xi - x ) = 0. b. The sum of squared deviations of the items from the mean is a minimum, that is less than the sum of squared deviations of items from any other value. Σd2 = minimum x c. Since x = ∑ . If any two values are given, third value can be computed. n d. If all the items of a sets are increased / decreased by any constant value, the arithmetic mean will also increases / decreases by the same constant.

2. Weighted arithmetic mean The weighted mean is computed by considering the relative importance of each of values to the total value. The arithmetic mean gives equal importance to all the items of distribution. In certain cases, relative importance of items is not the same. To give relative importance, weightage may be given to variables depending on cases. Thus, weightage represents the relative importance of the items. The weighted arithmetic mean in computed by following equation. Let x1, x2, x3, ………… xn are the variables and w1, w2, w3, ………… wn are the respective weights assigned. Then weighted mean x w is given by below equation. xw =

x 1 w 1 + x 2 w 2 + x 3 w 3 + ...... + x n w n = w 1 + w 2 + w 3 + ............ + w n

∑ xw ∑w

i.e., weighted average is the ratio of product of all values and respective weights to sum of weights. Ex: Compute simple weighted arithmetic mean and comment on them. Monthly salary Strength of Designation cadre (w) (Rs) (x)

xw

General Manager

25000

10

250000

Mangers

19000

20

380000

Supervisors

14000

10

140000

Office Assistant

10000

50

500000

Helpers

8000

25

200000

Σx = 76000

Σw = 115

Σxw = 1470000

(N = 5) Total

68

a. Simple arithmetic mean =

∑ x 76000 = = Rs. 15200 N 5

b. Weighted arithmetic mean =

∑ xw 1470000 = = Rs. 12782.6 ∑w 115

In this example, simple arithmetic mean does not accounts the difference in salary range for various staff. It is given equal importance. The salary of General Manager and Manager has inflated the value of simple mean. The weighted mean gives importance to the number of persons in various salary range. Ex: Comment on performance of students of two universities given below. Universit y

Bombay % of pas (x)

No. of (w) students (000)

MBA

71

MCA

Madras wx

% of pas (x)

No. of (w) students

wx

3

213

81

5

405

83

2

166

76

3

228

MA

73

5

365

58

3

174

M.Sc.

75

2

150

76

1

76

M.Com.

70

2

140

81

2

162

Σwx =1034

Σx =372

Σw =14

Σwx =1045

Course

Total (Σ)

Σx = 372

Σw =14

a. Since Σx is same, simple arithmetic average for both universities. =

∑ x 372 = = 74.4 N 5

b. Weighted mean for Bombay University = c. Weighted mean for Madras University =

∑ wx 1034 = = 73.86 ∑w 14 ∑ wx 1045 = = 74.64 ∑w 14

Comment: Madras University student’s performance is better than Bombay University students.

Discrete Series Frequencies of each value is multiplied with respective size to get total number of items is discrete series and their total number of item is divided by total number of frequencies to obtain arithmetic mean. This can be done in two methods one by direct or by short cut method.

69

Ex: Calculate the mean for following data. Value (x)

1

2

3

4

5

Frequency (f)

10

15

10

9

5

Steps: 1. Multiply each size of item by frequency to get Σfx 2. Add all frequencies (Σf = N) 3. Use formula x =

∑ fx ∑ fx = to get mean value. ∑f N

Solution: By direct method Value (x)

Frequency (f)

fx

1

10

10

2

15

30

3

10

30

4

9

36

5

5

25

Σf = 49 x=

Σfx = 131

∑ fx 131 = = 2.67 N 49

By short-cut method Let A = 3, (Assumed mean = 3) Value (x)

Frequency (f)

d = (x –A)

fd

1

10

-2

-20

2

15

-1

-15

3

10

0

0

4

9

1

9

5

5

2

10

Σf = 49 x=A+

Σfd = - 16

∑ fx  − 16  =3 +   = 2.67 N  49 

70

Continuous series In continuous frequency distribution, the individual value of each item in the frequency distribution is not known. In a continuous series the mid points of various class intervals are written down to replace the class interval. In continuous series the mean can be calculated by any of the following methods. a. Direct method b. Short cut method c. Step deviation method a. Direct method Steps of their method are as follows 1. Find out the mid value of class group or class. Ex: For a class interval 20-30, the mid value is

23 + 30 50 = = 25 mid value 2 2

is denoted by ‘m’. 2. Multiply the mid value ‘m’ by frequency ‘f’ of each class and sum up to get Σ fm. 3. Use x =

∑ fm where N = Σf formula to get mean value. N

Ex: Compute the mean for following data. Age group

No. of persons

Mid point

(CI)

(f)

‘m’

0 – 10

5

5

25

10 – 20

15

15

225

20 – 30

25

25

625

30 – 40

8

35

280

40 – 50

7

45

315

Total

Σf = 60 = N

Mean age =

fm

Σfm = 1470

∑ fm ∑ fm 1470 = = = 245 ∑f N 60 x = 24.5

b. Short cut method Steps of above methods are described below. 1. Find the mid value of each class 2. Assume any of the mid value as arbitrary average (A). 3. Multiply the deviation (differences) ‘d’ by frequency ‘f’.

71

Using the formula x = A +

∑ fd find the mean value. N

Ex: Find the mean age of patient visiting to hospital in a particular day using following data. Age group

Mid value

CI

No. of patients (f)

0 – 10

5

10 – 20

d = (m – 25)

fd

5

-20

-100

15

15

-10

-150

20 – 30

25

25

0

0

30 – 40

8

35

10

80

40 – 50

7

45

20

140

Total

Σf = 60 = N

M

Σfd = –30

Let Arbitrary average = A = 25 Mean age

x = A+

∑ fd N

1  − 30  x = 25 +   = 25 − = 24.5 2  60  x = 24.5 c. Step deviation method In this method, after finding deviation from arbitrary mean, it is divided by a common factor. Scaling down the deviation by a ‘step’ will reduce the calculation to minimum. The procedure of this method is described below. Steps of step deviation method 1. Find out the mid value ‘m’. 2. Select the arbitrary men ‘A’. 3. Find the deviation (d) of mid value of each from ‘A’. 4. Deviations ‘d’ are divided by a common factor –d'. 5. multiply d' of each class by frequency ‘f’ to get fd' and sum up for all classes to get Σfd'. 6. Using the formula x = A +

∑ fd ' x C (where, C is a common factor) N

calculate mean value.

72

Ex: Find the mean age of following data. Age (CI) 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 Total Let

No. of persons ‘f’ 5 15 25 8 7 Σf=60=N

Mid value ‘m’

(d=m–A) (d=m–25)

5 15 25 35 45

-20 -10 0 10 20

A = 25 and C = 10 x=A+

∑ fd ' xC N

x = 25 +

(−3) x 10 60

x = 25 −

1 2

x = 24.5

73

d'=

d 10

-2 -1 0 1 2

fd' -10 -15 0 8 14 Σfd'= -3

Session – 5 Measures of Central Tendency Combined Mean Combined arithmetic mean can be computed if we know the mean and number of items in each groups of the data. The following equation is used to compute combined mean. Let x 1 & x 2 are the mean of first and second group of data containing N1 & N2 items respectively. Then, combined mean = x 12 =

N1 x 1 + N 2 x 2 N1 + N 2

If there are 3 groups then x 123 =

N1 x 1 + N 2 x 2 + N 3 x 3 N1 + N 2 + N 3

Ex - 1: a) Find the means for the entire group of workers for the following data. Group – 1

Group – 2

75

60

1000

1500

Mean wages No. of workers Given data:

N1 = 1000

N2 = 1500

x 1 = 75 & x 2 = 60 Group Mean = x 12 = =

N1 x 1 + N 2 x 2 N1 + N 2

1000 x 75 + 1500 x 60 1000 + 1500

= x 12 = Rs. 66 Ex - 2: Compute mean for entire group. Medical examination

No. examined

Mean weight (pounds)

A

50

113

B

60

120

C

90

115

74

Combined mean (grouped mean weight) = x 123 =

N1 x 1 + N 2 x 2 + N 3 x 3 N1 + N 2 + N 3

(50 x 113 + 60 x 120 + 90 x 115) (50 + 60 + 90)

x 123 = Mean weight = 116 pounds Merits of Arithmetic Mean 1. It is simple and easy to compute. 2. It is rigidly defined. 3. It can be used for further calculation. 4. It is based on all observations in the series. 5. It helps for direct comparison. 6. It is more stable measure of central tendency (ideal average). Limitations / Demerits of Mean 1. It is unduly affected by extreme items. 2. It is sometimes un-realistic. 3. It may leads to confusion. 4. Suitable only for quantitative data (for variables). 5. It can not be located by graphical method or by observations.

Geometric Mean (GM) The GM is nth root of product of quantities of the series. It is observed by multiplying the values of items together and extracting the root of the product corresponding to the number of items. Thus, square root of the products of two items and cube root of the products of the three items are the Geometric Mean. Usually, geometric mean is never larger than arithmetic mean. If there are zero and negative number in the series. If there are zeros and negative numbers in the series, the geometric means cannot be used logarithms can be used to find geometric mean to reduce large number and to save time. In the field of business management various problems often arise relating to average percentage rate of change over a period of time. In such cases, the arithmetic mean is not an appropriate average to employ, so, that we can use geometric mean in such case. GM are highly useful in the construction of index numbers. Geometric Mean (GM) = n x 1 x x 2 x ...........x x n When the number of items in the series is larger than 3, the process of computing GM is difficult. To over come this, a logarithm of each size is obtained.

75

The log of all the value added up and divided by number of items. The antilog of quotient obtained is the required GM.  log1 + log 2 + ................ + log n  (GM) = Antilog   Anti log n   Merits of GM a. It is based on all the observations in the series. b. It is rigidly defined. c. It is best suited for averages and ratios. d. It is less affected by extreme values. e. It is useful for studying social and economics data. Demerits of GM a. It is not simple to understand. b. It requires computational skill. c. GM cannot be computed if any of item is zero or negative. d. It has restricted application. Ex - 1: a. Find the GM of data 2, 4, 8 x1 = 2, x2 = 4, x3 = 8 n=3 GM = n x 1 x x 2 x x 3 GM = 3 2 x 4 x 8 GM = 3 64 = 4 GM = 4 b. Find GM of data 2, 4, 8 using logarithms. Data: x1 = 2 x2 = 4 x3 = 8 N=3

76

 ∩ log x i  i ∑   =1 N 

x

log x

2

0.301

4

0.602

8

0.903 Σlogx = 1.806

 ∑ log x  GM = Antilog    N  1.806  GM = Antilog   3  GM = Antilog (0.6020) = 3.9997 GM ≅ 4 Ex - 2: Compare the previous year the Over Head (OH) expenses which went up to 32% in year 2003, then increased by 40% in next year and 50% increase in the following year. Calculate average increase in over head expenses. Let 100% OH Expenses at base year Year

OH Expenses (x)

log x

2002

Base year



2003

132

2.126

2004

140

2.146

2005

150

2.176 Σ log x = 6.448

 ∑ log x  GM = Antilog    N   6.448  GM = Antilog   3  GM = 141.03 GM for discrete series GM for discrete series is given with usual notations as month:

77

 ∩ log x i  i ∑   =1 N 

GM = Antilog Ex - 3:

Consider following time series for monthly sales of ABC company for 4 months. Find average rate of change per monthly sales. Month

Sales

I

10000

II

8000

III

12000

IV

15000

Let Base year = 100% sales. Solution:

(Rs)

Increase / decrease %ge

Conversion (x)

log (x)

100%

10000







II

– 20%

8000

80

80

1.903

III

+ 50%

12000

130

130

2.113

IV

+ 25%

15000

155

155

2.190

Month

Base year

I

Sales

Σlogx = 6.206  6.206  GM = Antilog  = 117.13  3  Average sales = 117.13 – 100 = 14.46% Ex - 4: Find GM for following data. Marks

No. of students

(x)

(f)

130

log x

f log x

3

2.113

6.339

135

4

2.130

8.52

140

6

2.146

12.876

145

6

2.161

12.996

150

3

2.176

6.528

Σf = N = 22

Σ f log x =47.23

78

 ∑ f log x  GM = Antilog    N   47.23  GM = Antilog   22  GM = 140.212 Geometric Mean for continuous series Steps: 1. Find mid value m and take log of m for each mid value. 2. Multiply log m with frequency ‘f’ of each class to get f log m and sum up to obtain Σ f log m. 3. Divide Σ f log m by N and take antilog to get GM. Ex: Find out GM for given data below Yield of wheat in

No. of farms frequency

MT

(f)

1 – 10

3

11 – 20

Mid value

log m

f log m

5.5

0.740

2.220

16

15.5

1.190

19.040

21 – 30

26

25.5

1.406

36.556

31 – 40

31

35.5

1.550

48.050

41 – 50

16

45.5

1.658

26.528

51 – 60

8

55.5

1.744

13.954

‘m’

Σf = N = 100

Σ f log m = 146.348

 ∑ f log m  GM = Antilog   N   146.348  GM = Antilog   100  GM = 29.07

Harmonic Mean It is the total number of items of a value divided by the sum of reciprocal of values of variable. It is a specified average which solves problems involving variables expressed in within ‘Time rates’ that vary according to time.

79

Ex: Speed in km/hr, min/day, price/unit. Harmonic Mean (HM) is suitable only when time factor is variable and the act being performed remains constant. N HM = 1 ∑ x Merits of Harmonic Mean 1. It is based on all observations. 2. It is rigidly defined. 3. It is suitable in case of series having wide dispersion. 4. It is suitable for further mathematical treatment. Demerits of Harmonic Mean 1. It is not easy to compute. 2. Cannot used when one of the item is zero. 3. It cannot represent distribution. Ex: 1. The daily income of 05 families in a very rural village are given below. Compute HM. Family

Income (x)

Reciprocal (1/x)

1

85

0.0117

2

90

0.01111

3

70

0.0142

4

50

0.02

5

60

0.016 ∑1

HM = =

N ∑1

x

5 = 67.72 0.0738

HM = 67.72

80

x = 0.0738

2. A man travel by a car for 3 days he covered 480 km each day. On the first day he drives for 10 hrs at the rate of 48 KMPH, on the second day for 12 hrs at the rate of 40 KMPH, and on the 3rd day for 15 hrs @ 32 KMPH. Compute HM and weighted mean and compare them. Harmonic Mean x

1

48

0.0208

40

0.025

32

0.0312 ∑1

x

x = 0.0770 Data: 10 hrs @ 48 KMPH 12 hrs @ 40 KMPH 15 hrs @ 32 KMPH

HM = =

N ∑1

x

3 0.0770

HM = 38.91 Weighted Mean w

x

wx

10

48

480

12

40

480

15

32

480

Σw = 37

Weighted Mean = x = =

Σwx = 1440 ∑ wx ∑w

1440 37

x = 38.91 Both the same HM and WM are same.

81

3. Find HM for the following data. Class (CI)

Frequency (f)

Mid point (m)

1 Reciprocal   m

1 f  m

0 – 10

5

5

0.2

1

10 – 20

15

15

0.0666

0.999

20 – 30

25

25

0.04

1

30 – 40

8

35

0.0285

0.228

40 – 50

7

45

0.0222

0.1554

Σf = 60

1 Σ f   = 3.3824 m

N 1 HM = ∑f  m =

60 3.3824

HM = 17.73

Relationship between Mean, Geometric Mean and Harmonic Mean. 1. If all the items in a variable are the same, the arithmetic mean, harmonic mean and Geometric mean are equal. i.e., x = GM = HM . 2. If the size vary, mean will be greater than GM and GM will be greater than HM. This is because of the property that geometric mean to give larger weight to smaller item and of the HM to give largest weight to smallest item. Hence, x > GM > HM .

Median Median is the value of that item in a series which divides the array into two equal parts, one consisting of all the values less than it and other consisting of all the values more than it. Median is a positional average. The number of items below it is equal to the number. The number of items below it is equal to the number of items above it. It occupies central position. Thus, Median is defined as the mid value of the variants. If the values are arranged in ascending or descending order of their magnitude, median is the middle value of the number of variant is odd and average of two middle values if the number of variants is even. Ex: If 9 students are stand in the order of their heights; the 5th student from either side shall be the one whose height will be Median height of the students group. Thus, median of group is given by an equation.

82

 N + 1 Median =    2  Ex 1. Find the median for following data. 22

20

25

31

26

24

23

Arrange the given data in array form (either in ascending or descending order). 20

22

23

24

25

26

31

 N + 1 th  7 + 1 8 th Median is given by  item =    = Median = 4 item.  2   2  4 2. Find median for following data. 20

21

22

24

28

32

 N + 1 th  6 + 1 th Median is given by  item =    Median = 3.5 item.  2   2  The item lies between 3rd and 4. So, there are two values 22 and 24. The median value will be the mean values of these two values.  22 + 24  Median =   = 23  2  Discrete Series – Median In discrete series, the values are (already) in the form of array and the frequencies are recorded against each value. However, to determine the size of  N + 1 th median   item, a separate column is to be prepared for cumulative  2  frequencies. The median size is first located with reference to the cumulative frequency which covers the size first. Then, against that cumulative frequency, the value will be located as median.

83

Ex: Find the median for the students’ marks. Obtained in statistics Marks (x)

No. of students (f)

Cumulative frequency

10

5

5

20

5

10

30

3

13

40

15

28

50

30

58

60

10

68

Just above 34 is 58. Against 58 c.f. the value is 50 which is median value

N = 68

Ex: In a class 15 students, 5 students were failed in a test. The marks of 10 students who have passed were 9, 6, 7, 8, 9, 6, 5, 4, 7, 8. Find the Median marks of 15 students. Marks

No. of students (f)

cf

0 1 2 5

3 4

1

6

5

1

7

6

2

9

7

2

11

8

2

13

9

2

15

Σf = 15 Median = Me =

N + 1th item 2

15 + 1 = 8th 2

Me 8th item covers in cf of 9. the marks against cf 9 is 6 and hence Median = 6

84

Continuous Series The procedure is different to get median in continuous series. The class intervals are already in the form of array and the frequency are recorded against each th n class interval. For determining the size, we should take item and median class 2 located accordingly with reference to the cumulative frequency, which covers the size first. When the median class is located, the median value is to be interpolated using formula given below. h f

Median =  +

N   2 − C

 0 +1 where,  0 is left end point of N/2 class and l1is right end 2 point of previous class. Where  =

h = Class width, f = frequency of median clas C = Cumulative frequency of class preceding the median class. Ex: Find the median for following data. The class marks obtained by 50 students are as follows. Cum. frequency (cf)

CI

Frequency (f)

10 – 15

6

6

15 – 20

18

24

20 – 25

9

33  N/2 class

25 – 30

10

43

30 – 35

4

47

35 – 40

3

50

Σf = N = 50 N 50 = = 25 2 2 Cum. frequency just above 25 is 33 and hence, 20 – 25 is median class. =

 0 +1 2

20 + 20 = 20 2  = 20 h = 20 – 15 = 5

85

f=9 c = 24 h N  − C  f 2 

Median =  +

5 [ 25 − 24] 9

Median = 20 + = 20 +

5 9

Median = 20.555 Ex: Find the median for following data. Mid values (m)

115

125

135

145

155

165

175

185

195

Frequencies (f)

6

25

48

72

116

60

38

22

3

The interval of mid-values of CI and magnitudes of class intervals are same i.e. 10. So, half of 10 is deducted from and added to mid-values will give us the lower and upper limits. Thus, classes are. 115 – 5 = 110 (lower limit) 115 – 5 = 120 (upper limit) similarly for all mid values we can get CI. Cum. frequency (cf)

CI

Frequency (f)

110 – 120

6

6

120 – 130

25

31

130 – 140

48

79

140 – 150

72

151

150 – 160

116

267

160 – 170

60

327

170 – 180

38

365

180 – 190

22

387

190 – 200

3

390

Σf = N = 390 N 390 = 2 2 = 195 Cum. frequency just above 195 is 267. 86

N/2 class

Median class = 150 – 160  =

150 + 150 = 150 2

h = 116 N/2 = 195 C = 151 h = 10 Median =  +

h N  − C  f 2 

Median = 150 +

10 [195 −151] 116

Median = 153.8 Merits of Median a. It is simple, easy to compute and understand. b. It’s value is not affected by extreme variables. c. It is capable for further algebraic treatment. d. It can be determined by inspection for arrayed data. e. It can be found graphically also. f. It indicates the value of middle item. Demerits of Median a. It may not be representative value as it ignores extreme values. b. It can’t be determined precisely when its size falls between the two values. c. It is not useful in cases where large weights are to be given to extreme values.

87

Session – 6 Measures of Central Tendency Mode It is the value which occurs with the maximum frequency. It is the most typical or common value that receives the height frequency. It represents fashion and often it is used in business. Thus, it corresponds to the values of variable which occurs most frequently. The model class of a frequency distribution is the class with highest frequency. It is denoted by ‘z’. Mode is the value of variable which is repeated the greatest number of times in the series. It is the usual, and not casual, size of item in the series. It lies at the position of greatest density. Ex: If we say modal marks obtained by students in class test is 42, it means that the largest number of student have secured 42 marks. If each observations occurs the same number of times, we can say that there is ‘no mode’. If two observations occur the same number of times, we can say that it is a ‘Bi-modal’. If there are 3 or more observations occurs the same number of times we say that ‘multi-modal’ case. When there is a single observation occurs mot number of times, we can say it is ‘uni-modal’ case. For a grouped data mode can be computed by following equations with usual notations. Mode =  =

h (f m − f 1 ) 2f m − f 1 − f 2

where, fm = max frequency (modal class frequency) f1 = frequency preceding to modal class. f2 = frequency succeeding to modal class h = class width. or Mode =  +

hf 2 f1 + f 2

88

Ex: 1. Find the modal for following data. Marks

No. of students

(CI)

(f)

1 – 10

3

11 – 20

16

21 – 30

26

31 – 40

31

41 – 50

16

51 – 60

8

 Max. frequency

Σf = N = 100

We shall identify the modal class being the class of maximum frequency. i.e. 31-40. where, fm = 31 f1 = 26 f2 = 16 h = 10 =

30 + 31 2

 = 30.5

Mode (z) =  +

h (f m − f 1 ) 2f m − f 1 − f 2

Mode = 30.5 +

10 (31 - 26) 2 x 31 − 26 − 16

Mode = 33. Or

89

Mode =  +

10 x 16 hf 2 = 30.5 + (26 + 16) f1 + f 2

Mode = 34.30 It can be noted that there exists slightly different mode value in the second method.

Partition values Median divides in to two equal parts. There are other values also which divides the series partitioned value (PV). Just as one point divides as series in to two equal parts (halves), 3 points divides in to four points (Quartiles) 9 points divides in to 10 points (deciles) and 99 divide in to 100 parts (percentage). The partitioned values are useful to know the exact composition of series. Quartiles A measure, which divides an array, in to four equal parts is known as quartile. Each portion contain equal number of items. The first second and third point are termed as first quartile (Q1). Second quartile (Q2) and third quartile (Qs). The first quartile is also known as lower quartiles as 25% of observation of distribution below it, 75% of observations of the distribution below it and 25% of observation above it. Calculation of quartiles Q1 = size of Q2 = size of

( N + 1) th 4 3( N + 1) 4

item

th

item

N  Q2 = (median) =  + h f  − C 2 

Measures of quartiles The quartile values are located on the principle similar to locating the median value.

90

Following table shows procedure of locating quartiles. Individual and Discrete senses

Measure

( N + 1) th

Q1

4

Continuous series

item

n 4

th

item

Q2

2( N + 1) 4

item

2 th n item 4

Q3

3 ( N + 1) th item 4

3 th n item 4

th

Ex - 1: From the following marks find Q1, Median and Q3 marks 23, 48, 34, 68, 15, 36, 24, 54, 65, 75, 92, 10, 70, 61, 20, 47, 83, 19, 77 Let us arrange the data in array form. Sl. No.

x

1.

10

2.

15

3.

19

4.

20

5.

23 Q1

6.

24

7.

34

8.

36

9.

47

10.

48 Q2

11.

54

12.

61

13.

65

14.

68

15.

70 Q3

16.

75

17.

77

18.

83

19.

92 91

a. Q1 =

1 ( n + 1) th item 4

Q1 =

1 (19 + 1) 4

Q1 =

1 x 20 4

Here, n = 19 items

Q1 = 5th item ∴ Q1 = 23 b. Q2 = Q2 =

2 ( n + 1) th item 4 2 x 20 4

10th item ∴ Q2 = 48 c. Q3 = Q3 =

3 ( n + 1) th item 4 3 x 20 = 15th item 4

∴ Q3 = 70 Ex - 2: Locate the median and quartile from the following data. Size of shoes

4

4.5

5

5.5

6

6.5

7

7.5

8

Frequencies

20

36

44

50

80

30

30

16

14

X

f

cf

4

20

20

4.5

36

56

5

44

100  Q1

5.5

50

150

6

80

230  Q2

6.5

30

260  Q3

7

30

290

7.5

16

306

8

14

320

N = Σf = 320

92

Q1 =

1 ( n + 1) th item 4

Q1 =

1 321 4

Q1 = 80.25th item Just above 80.25, the cf is 100. Against 100 cf, value is 5. ∴ Q1 = 5

Q2 =

1 ( n + 1) th item 2

Q2 =

1 x 321 2

160.5th item Just above 160.5, the cf is 230. Against 230 cf value is 6. ∴ Q2 = 6

Q3 =

3 ( n + 1) th item 4

Q3 =

3 x 321 = 240.75th item 4

Just above 240.75, the cf is 260. Against 260 cf value is 6.5. ∴ Q3 = 6.5 Ex - 3: Compute the quartiles from the following data. CI Frequency (f)

0-10

10-20

20-30

30-40

40-50

50-60

60-70

70-80

5

8

7

12

28

20

10

10

First quartile (Q1) =  +

h 1 h 3   N − C and Q3 =  +  N − C  f 4 f 4  

and (Q2) = Median =  +

93

h N  − C and  f 2 

CI

f

cf

0-10

5

5

10-20

8

13

20-30

7

20

30-40

12

32  Q1

40-50

28

60  Q2

50-60

20

80  Q3

60-70

10

90

70-80

10

100

N = Σf = 100 a. First locate Q1 for ¼ N ¼ N = 25  = 30 h = 10 f = 12 c = 20 (Q1) =  +

h 1  N − C  f 4 

= +

30 + 30 = 30 2

Q1 = 30 +

10 [ 25 − 20] 12

Q1 = 34.16 b. Locate Q2 (Median) Q2 corresponds to N/2 = 50,  + Q2 =  +

h N − f  2

Q2 = 40 +

 C 

10 [ 50 − 32] 28

Q2 = 46.42

94

40 + 40 = 40 2

Q3 corresponds to ¾ N = 75,  + Q3 =  +

h 3  N − C  f 4 

Q3 = 50 +

10 [ 75 − 60] 20

50 + 50 = 50 2

Q3 = 57.5

Deciles The deciles divide the arrayed set of variates into ten portions of equal frequency and they are some times used to characterize the data for some specific purpose. In this process, we get nine decile values. The fifth decile is nothing but a median value. We can calculate other deciles by following the procedure which is used in computing the quartiles. Formula to compute deciles. D1 =  +

h f

h  2 1   N − C , D 2 =  +  N − C  f  20  10  

& so, on

Percentiles Percentile value divides the distribution into 100 parts of equal frequency. In this process, we get ninety-nine percentile values. The 25th, 50th and 75th percentiles are nothing but quartile first, median and third quartile values respectively. Formula to compute percentiles is given below: P25 =  +

h  25 h  26   N − C , P26 =  +  N − C  f  100 f  100  

and so, on

Ex: Find the decile 7 and 60th percentile for the given data of patients visited to hospital on a particular day. CI

f

Cf

10-20

1

1

20-30

3

4

30-40

11

15

40-50

21

36

50-60

43

79  P60

60-70

32

111 D70

70-80

9

120

Σf = N = 120 95

h 7  N − C ,  f  10 

a. D7 =  +

=

60 + 60 = 60 2 7 N = 84 10 h = 10, f = 32 c = 79

D7 = 60 +

10 ( 84 − 79) 32

7th Decile = D7 = 61.562 b. 60th percentile P60 =  +

 60  N − C   100 

h f

=

50 + 50 = 50 2 h = 10 f = 43 c = 36 60 N = 72 100

P60 = 50 +

10 ( 72 − 36) 43

P60 = 50 +

10 ( 72 − 36) 43

P60 = 58.37

SOME NUMERICAL EXAMPLES 1. Show that following distribution is symmetrical about the average. Also shows that median is the mid-way between lower and upper quartiles. X

2

3

4

5

6

7

8

9

10

Frequency

2

9

29

57

80

57

29

9

2



To show the given distribution is symmetrical, Mean, Median and Mode must be same.

96



To show median is mid-way between the lower and upper quartile i.e., Q2 – Q1 = Q3 – Q2.

Mid-point

Class interval

x

CI

2

cf

f

d = (x – 6)

fd

1.5 – 2.5

2

-4

-8

2

3

2.5 – 3.5

9

-3

-27

11

4

3.5 – 4.5

29

-2

-58

40

5

4.5 – 5.5

57

-1

-57

97  Q1 class

6

5.5 – 6.5

80

0

0

177  Q2 class

7

6.5 – 7.5

57

1

57

234  Q3 class

8

7.5 – 8.5

29

2

58

263

9

8.5 – 9.5

9

3

27

272

10

8.5 – 10.5

2

4

8

274

Cum. freq.

Σfd = 0

N=274 Let A = 6

Mean = A + =6 +

h ∑ fd N 1x 0 =6 274

Mean = 6.

Median Q2 =  +

h f

N  2 −

 C  N 274 = = 137 2 2 C = 97

Q2 = 5. +

1 [137 − 97] 80

Q2 = 5.5 + 0.5 Median = Q2 = 6.

97

Mode Mode =  +

h ( f m − f1 ) 2f m − f 1 − f 2

Mode = 5.5 +

Modal class 5.5 – 6.5

1 ( 80 − 57 ) 2 x 80 − 57 − 57

Mode = 6. Since, Mean = Mode = Median. The given distribution is symmetrical. Q1 calculation h 4  N − C  f 2 

Q1 =  +

Q1 = 6.5 +

1 [ 68.5 − 40] 57

∴ Q1 = 7. Now, Q2 – Q1 = Q3 – Q2 i.e.

6–5=7–5 2=2

2. Find the mean for the set of observations given below. 6, 7, 5, 4 x= =



n

i =1

N

xi =

6+8+7+8+4 5

30 =6 5

3. Find the mean for the following data. CI

f

xi

fx

0-10

3

5.5

16.5

11-20

16

15.5

248

21-30

26

23.5

683

31-40

31

35.5

1180.5

41-50

16

45.5

728

51-60

8

55.5

444

N = Σf = 100

3300

98

x=

∑ fx 3200 = N 100

x = 32 4. Find the mean profit of the organisation for the given data below: Profit CI

f

xi

fx

100-200

10

150

1500

200-300

18

250

4500

300-400

20

350

7000

400-500

26

450

11700

500-600

30

550

16500

600-700

28

650

18200

700-800

18

750

13500

N = Σf = 150 x1 =

100 + 200 2

x1 =

300 2

72900

x1 = 150

x= =

∑ fx N 72900 150

x = 486

Step Deviation Method x = a + hd  d = x=a+h

x−a h

∑ fd N

a = Arbitrary constant h = class width

99

Profit CI

f

xi

d

fd

100-200

10

150

-3

-30

200-300

18

250

-2

-36

300-400

20

350

-1

-20

400-500

26

450

0

0

500-600

30

550

+1

30

600-700

28

650

+2

56

700-800

18

750

+3

34

N = Σf = 150 x=a+h

Σfm = 54

∑ fd N

 54  x = 450 + 100    150  x = 486 5. In an office there are 84 employees and there salaries are given below. Salary

2430

2590

2870

3390

4720

5160

4

28

31

16

3

2

Employees

1. Find the mean salary of the employees 2. What is the total salary of the employees? x= =

∑ fx N

2430 x 4 + 2590 x 28 + 2870 x 31 + 3390 x 16 + 4730 x 3 + 5160 x 2 84

x=

∑ fx N

x=

249930 84

Rs. 2975.36 1.

x = 2975.36

2. Total salary = 2,49,930 (Rs.)

100

6. The average marks secured by 36 students was 52 but it was discovered that on item 64 was misread as 46. Find the correct me of the marks. x=

∑ fx N

52 =

∑ fx 56

Σfx = 52 x 36 = 1872 Σfx = Σfx - incorrect + correct correct = 1872 – 46 64 = 1890 x=

∑ fx correct N

x=

1890 36

x = 52.5 7. The mean of 100 items is 46, later it was discovered that an item 16 was misread as 61 and another item 43 was misread as 34 and also found that the total number of items are 90 not 100 find the correct mean value. x=

∑ fx N

46 =

∑ fx 100

Σfx = 4600 Σfx = Σfx - incorrect + correct = 4600– 61 - 34 + 16 + 43 = 4564 x=

∑ fx correct N

x=

4564 90

= 50.71

101

8. Calculate the mean for the following data. Value

Frequency

< 10

4

< 20

10

< 30

15

< 40

25

< 50

30

CI

f

‘m’ mid point

fm

0-10

4

5

20

10-20

10

15

150

20-30

15

25

375

30-40

25

35

875

40-50

30

45

1350

Σf = 84 x=

∑ fm N

=

2770 84

Σfx 2770

x = 32.97 9. For a given frequency table, find out the missing data. The average accident are 1.46. No. of accidents

Frequency

0

46

1

?

2

?

3

25

4

10

5

5

102

No. of accidents (x)

Frequency

0

46

0

1

?

f1

2

?

2f1

3

25

75

4

10

40

5

5

25

N = 200

Σfx = 140 + f1 + 2f2

1.46 =

fx

(f)

140 + f 1 + 2f 2 200

292 = 140 + f1 + 2f2 ∴ f1 + 2f2 = 152

----(1)

w.k.t. N = Σf 200 = 86 + f1 + f2 f1 + f2 = 114

----(2)

f1 + 2f2 = 152

----(1)

f1 + f2 = 114

----(2)

(1) – (2)

--------------------------------f2 = 38 --------------------------------∴ f2 = 38 f1 + f2 = 114 f1 + 114 – 38 f1 = 76

103

10. Find out the missing values of the variate for the following data with mean is 31.87. xi

F

12

8

20

16

27

48

33

90

?

30

54

8 N = 200

xi

f

fx

12

8

96

20

16

320

27

48

1296

33

90

2970

x

30

30x

54

8

432

N = 200

Σfx = 5114 + 30x

x = 31.87 x=

∑ fx N

31.87 =

∑ fx 200

Σfx = 6374

----(1)

Σfx = 5114 + 30x

----(2)

(1) = (2) 6374 = 5114 + 30x 6374 - 5114 = 30x ∴30x = 1260 x = 42.

104

11. The average rainfall of a city from Monday to Saturday is 0.3 inches. Due to heavy rainfall Sunday the average rainfall for the week increased to 0.5 inches. What is the rainfall on Sunday? Given:

Mon – Sat

= 0.3”

Sun

= 0.5”

x=

∑ fx 1 N

0.3 =

∑ fx 1 6

Σfx1 = 1.8

x=

∑ fx 2 N

0.5 =

∑ fx 2 7

Σfx2 = 3.5

Rainfall on Sunday = Σfx2 – Σfx1 = 3.5 – 1.8 = 1.7” 12. The average salary of male employees in a firm was Rs. 520 and that of females Rs. 420 the mean of salary of all the employees as a whole is Rs. 500. Find the percentage of male and female employees. Given: x 1 = 520

x 2 = 420

n1 = Male persons. x=

x = 500

n2 = Female persons.

n1 x1 + n 2 x 2 n1 + n 2

500 =

n 1 x 520 + n 2 x 420 n1 + n 2

500 =

520n 1 + 420n 2 n1 + n 2

500n1 + 500n2 = 520n1 – 420n2 80n2 = 20n1 n1 = 4n2 Let n1 + n2 = 100 4n2 + n2 = 100 5n2 = 100 n2 = 20%

 Female

n1 = 80%

 Male

20% and 80%  are male and females in the firm.

105

13. The A-M of two observations is 25 and there GM is 15. Find the HM. Given: AM = 25 x= x= 25 =

GM = 15

HM = ?

a+b 2

GM = 2 ab

a+b 2

15 =

GM =

a+b 2

ab ab

(15)2 = ( ab )2 ab = 225

a + b = 50

2 HM = 1 1 + a b HM =

2ab a+b

HM =

2 x 225 50

HM = 9 a + b = 50 ab = 225 a=

225 b

HM = 9 14. The GM is 60 an HM is 28.24. Find AM for two observations. AM x= x=

GM

HM

a+b 2

60 =

254 − 95b 2

ab = 3600

ab

28.24 =

2ab a+b

a+b=

2ab 28.4

602 = ab

= 127.475

=

2 x 3600 28.4

a + b = 254.95

106

15. Calculate the missing frequency from the data if the median is 50. CI

f

cf

10-20

2

2

20-30

8

10

30-40

6

16

40-50

? f1

16+f1

50-60

15

31+f1  median class

60-70

10

41+f1

f = 41 + f1 Q= +

h N  − C  f 2 

50 = 50 + 0=

10  N  − (16 + f1 )  15  2 

10  N  − (16 + f1 )  15  2 

50 – 50 =

N  0 = 10  − (16 + f1 ) 2 

N  0 =  − (16 + f 1 )  2  (16 + f 1 ) =

10  N  − (16 + f1 )  15  2 

N 2

16 + f1 = ½ (41 + f1) 2 (16 + f1) = 41 + f1 32 + 2f1 = 41 + f1 f1 = 9

107

SOURCES AND REFERENCES 1. Statistics for Management, Richard I Levin, PHI / 2000. 2. Statistics, RSN Pillai and Bagavathi, S. Chands, Delhi. 3. An Introduction to Statistical Method, C.B. Gupta, & Vijaya Gupta, Vikasa Publications, 23e/2006. 4. Business Statistics, C.M. Chikkodi and Salya Prasad, Himalaya Publications, 2000. 5. Statistics, D.C. Sancheti and Kappor, Sultan Chand and Sons, New Delhi, 2004. 6. Fundamentals of Statistics, D.N. Elhance and Veena and Aggarwal, KITAB Publications, Kolkata, 2003. 7. Business Statistics, Dr. J.S. Chandan, Prof. Jagit Singh and Kanna, Vikas Publications, 2006.

108

Session – 7 Measures of Dispersions

The measures of Central Tendency alone will not exhibit various characteristics of the frequency distribution having the same total frequency. Two distribution can have the same mean but can differ significantly. We need to know the extent of variation or deviation of the values in comparison with the central value or average referred to as the measures of central tendency. Measures of dispassion are the ‘average of second order’. The are based on the average of deviations of the values obtained from central tendencies x , Me or z. The variability is the basic feature of the values of variables. Such type of variation or dispersion refers to the ‘lack of uniformity’. Definition: A measure of dispersion may be defined as a statistics signifying the extent of the scatteredness of items around a measure of central tendency. Absolute and Relative Measures of Dispersion: A measure of dispersion may be expressed in an absolute form, or in a relative form. It is said to be in absolute form when it states the actual amount by which the value of item on an average deviates from a measure of central tendency. Absolute measures are expressed in concrete units i.e., units in terms of which the data have been expressed e.g.: Rupees, Centimetres, Kilogram etc. and are used to describe frequency distribution. A relative measures of dispersion is a quotient by dividing the absolute measures by a quality in respect to which absolute deviation has been computed. It is as such a pure number and is usually expressed in a percentage form. Relative measures are used for making comparisons between two or more distribution. Thus, absolute measures are expressed in terms of original units and they are not suitable for comparative studies. The relative measures are expressed in ratios or percentage and they are suitable for comparative studies. Measures of Dispersion Types Following are the common measures of dispersions. a. The Range b. The Quartile Deviation (QD) c. The Mean Deviation (MD) d. The Standard Deviation (SD)

109

Range ‘Range’ represents the differences between the values of the extremes’. The range of any such is the difference between the highest and the lowest values in the series. The values in between two extremes are not all taken into consideration. The range is an simple indicator of the variability of a set of observations. It is denoted by ‘R’. In a frequency distribution, the range is taken to be the difference between the lower limit of the class at the lower extreme of the distribution and the upper limit of the distribution and the upper limit of the class at the upper extreme. Range can be computed using following equation. Range = Large value – Small value Coefficient of Range =

L arg e value − Small value L arg e value + Small value

Problems 1. Compute the range and also the co-efficient of range of the given series of state which one is more dispersed and which is more uniform. Series – I – 9, 10, 15, 19, 21

Series – II – 1, 15, 24, 28, 29

R = LV – SV = 21 – 9 = 12 CR =

R = LV – SV = 29 – 1 = 28

12 12 = = 0.4 21 + 9 30

CR =

R 28 = = 0.933 LV + SV 30

Series I is les dispersed and more uniform Series II is more dispersed and less uniform Evaluating Criteria i.

Less the CR is less dispersion

ii.

More the CR is less uniform

Range Merits i.

It is very simplest to measure.

ii.

It is defined rigidly

iii.

It is very much useful in Statistical Quality Control (SBC).

iv.

It is useful in studying variation in price of shars and stocks.

110

Limitations i.

It is not stable measure of dispersion affected by extreme values.

ii.

It does not considers class intervals and is not suitable for C.I. problems.

iii.

It considers only extreme values.

2. Find range of Co-efficient of range from following data. A:

10

11

12

13

14

B:

40

41

42

43

44

C:

100

101

102

103

104

Series - I

Series – II

Series – III

R =LV – 3m = 14 – 10

R = 44 - 40

= 4

CR = =

R = 104 - 100

= 4 R LV + SV

CR =

4 24

=

= 0.166

= 4 R LV + SV

CR =

4 84

=

= 0.0476

R LV + SV 4 204

= 0.0196

Series III is less dispersed and more uniform Series I is more dispersed and less uniform 3. Compute range and coefficient of range for the following data. x:

6

12

18

24

30

36

42

f:

20

130

16

14

20

15

40

Range = LV – SV = 42 – 6 = 36 CR =

R LV + SV

=

36 48

= 0.75

111

Quartile Deviation Quartile divides the total frequency in to four equal parts. The lower quartile Q1 refers to the values of variates corresponding to the cumulative frequency N/4. Upper quartile Q3 refers the value of variants corresponding to cumulative frequency ¾ N. 1 (Q3 – Q1). In this quartile Q2 as it 2 N corresponds to the value of variate with cumulative frequency is equal to c.f. = . 2 Quartile deviation is defined as QD =

a) QD =

1 (Q3 – Q1) 2

b) Relative measure of dispersion coefficient of QD =

Q 3 − Q1 Q 3 + Q1

Problems 1. Find quartile deviation and coefficient of quartile deviation for the given grouped data also compute middle quartile. Class

f

1 – 10

3

11 – 20

16

21 – 30

26

31 – 40

31

41 – 50

16

51 – 60

8 Σf = N = 100

Class

f

Cf

1 – 10

3

3

11 – 20

16

19

21 – 30

26

45  Q1 Class

31 – 40

31

76  Q2 & Q3 Class

41 – 50

16

92

51 – 60

8

100

N = 100

112

Q1 

N 100 25 = 4 4

Q1 =  +

h f

N   4 − C

Q1 = 20.5 +

10 [ 25 − 19] 26

Q1 = 22.80

Q2 =  +

h N  − C  f 2 

Q2 = 30.5 +

10 [ 50 − 45] 31

Q2 = 32.11

Q3 =  +

h 3  N − C  f 4 

Q3 = 30.5 +

10 [ 75 − 45] 31

Q3 = 40.17

QD = =

1 (Q3 – Q1) = 0.5 (Q3 – Q1) 2 1 (40.17 – 22.80) 2

= 8.685 Coef. QD =

Q 3 − Q1 Q 3 + Q1

=

40.17 − 22.80 40.17 + 22.80

=

17.37 62.97

= 0.275

113

2. Find quartile deviation from the following marks of 12 students and also co-efficient of quartile deviation. Sl. No.

Marks

1.

25

2.

30

3.

37

4.

43

5.

48

6.

54

7.

61

8.

67

9.

72

10.

80

11.

84

12.

89

Q1 = 3.25th item = 3rd item + 0.25 of item = 37 + 0.25 (43 - 37) Q1 = 38.5 Q3 =9.75th item = 9 + 0.75rd item = 72 + 0.75 (80- 72) Q3 = 78 QD = =

1 (Q3 – Q1) 2 1 (78 – 38.3) 2

QD = 19.75 Coef. QD = =

Q 3 − Q1 Q 3 + Q1 78 − 38.5 78 + 38.5

= 0.339

3. Compute quartile deviation. and its Coefficient for the data given below:

114

x

f

Cf

58

15

15

59

20

35

60

32

67  Q1 Class

61

35

102

62

33

135

63

22

157  Q3 Class

64

20

177

65

10

187

65

8

195

N = 195 n + 1th Q1 = size 4 =

195 + 1th size 4

Q1 = 48.78th size and corresponding to cf 67, which gives Q1 = 60

Q3 = =

3 ( n + 1) th size 4 3 (196) th = 146.33 th size . 4

It lies in 157, cf. Against cf 157 Q3 = 63 QD =

1 (Q3 – Q1) 2

=

1 (63 – 60) 2

QD = 1.5 Coef. QD = =

Q 3 − Q1 Q 3 + Q1 63 − 60 3 = 63 + 60 123

= 0.024 Merits of Quartile Deviation 115



It is very easy to compute



It is not affected by extreme values of variable.



It is not at all affected by open and class intervals.

Demerits of Quartile Deviation •

It ignores completely the portions below the lower quartile and above the upper of quartile.



It is not capable for further mathematical treatment.



It is greatly affected by fluctuations in the sampling.



It is only the positional average but not mathematical average.

116

Session – 8 Measures of Dispersions

Mean Deviation Mean deviation is the average differences among the items in a series from the mean itself or median or mode of that series. It is concerned with the extent of which the values are dispersed about the mean or median or the mode. It is found by averaging all the deviations from control tendency. These deviations are taken into computations with regard to negative sign. Theoretically the deviations of item are taken preferably from median instead than from the mean and mode. Merits of Mean Deviation •

It is rigidly defined and easy to compute.



It takes all items in to considerations and gives weight to deviation according to these sign.



It is less affected by extreme values.



It removes all irregularities by obtaining deviation and provides correct measures.

Demerits of Mean Deviation •

It is not suitable for algebraic treatments.



It is positive which is not justified mathematically.



It is not satisfactory measure when the deviations are taken from mode.



It is not suitable when class intervals are open end.

117

Formula to compute Mean Deviation If xi is variant and takes the values x1, x2, x3, …….. xn with average. A (mean, median, mode), then mean deviation from the average – A is defined by MD =

∑ xi − A N

For the grouped data MD =

∑f xi − A N

Coefficient of MD =

MD Mean

1. Compute MD and CMD from mean for the given data below. X

d = xi − x

21

26.55

32

15.55

38

9.55

41

6.55

49

1.45

54

6.45

59

11.45

66

18.45

68

20.45

Σx = 428

Σ x i − x = Σd= 116.45 x=



i =1

MD =

n

xi

∑ xi − x N

x=

=

428 = 47.35 9

116.45 9

MD = δ = 12.938 Coefficient of MD = =

MD Avg 12.938 = 0.272 47.55

118

2. Following are the wages of workers. Find mean deviation from median and its coefficient. x

Wages

x i − Me = x i − 47

59

17

30

32

22

25

67

25

22

43

32

15

22

43

4

17

47  M

0

64

55

8

55

59

12

47

64

17

80

67

20

25

80

33

25

Σ x i − M = 186

Σ x i − M = 186

 11 + 1  th  item Median =   2   11 + 1   = 6th item =  2  Me = 47

MD = =

∑ x i − Me N 186 = 16.91 11

Coefficient of MD = =

MD Median 16.91 = 0.359 47

3. Compute MD about its mode and its coefficient.

119

x

f

d = x i − Mode

fd

20

6

100

600

40

19

80

1520

60

40

60

2400

80

23

40

920

100

65

20

1300

120  Mode

83  Modal class

0

0

140

55

20

1100

160

20

40

800

180

9

60

5401

Σf = 320

Σf x i − Mode = 9180

the highest frequency is 83 and hence Z = 120 MD=

∑ x i − Mode N

 9180   Median =   320  = 28.68

Coefficient of MD =

28.68 120

= 0.239

120

4. Find out the mean deviation from the data given below about its median. Salaries

40

50

50-100

100-200

200-400

No. of Employees

22

18

10

8

2

x

No. of Employees

x(mv)

cf

d = x i − Me

fd

40

22

40

22

10

220

50

18

50

40

0

0

50-100

10

75

50

25

250

100-200

8

150

58

100

800

200-400

2

300

60

250

500

Σf = 60

Σf x i − Me = 1770

 N + 1 Median =    2 

th

item

=

60 + 1 2

=

61 = 30.5 2

It lies in 40 cf and against 40 cf discrete value is 50

MD =

∑ x i − Median N

 1770   =  60  MD = 29.5

Coefficient of MD = =

MD Median

29.5 50 = 0.59

Session – 9 Measures of Dispersions 121

Standard Deviation Standard deviation is the root of sum of the squares of deviations divided by their numbers. It is also called ‘Mean error deviation’. It is also called mean square error deviation (or) Root mean square deviation. It is a second moment of dispersion. Since the sum of squares of deviations from the mean is a minimum, the deviations are taken only from the mean (But not from median and mode). The standard deviation is Root Mean Square (RMS) average of all the deviations from the mean. It is denoted by sigma (σ). Characteristics of standard deviation 1. Standard deviation and coefficient of variation possesses all these properties which a good measure of dispersion should possess. 2. The process of squaring the deviation eliminates negative sign and makes mathematical computations easy. Merits 1. It is based on all observations. 2. It can be smoothly handled algebraically. 3. It is a well defined and definite measure of dispersion. 4. It is of great importance when we are making comparison between variability of two series. Merits 1. It is difficult to calculate and understand. 2. It gives more weightage to extreme values as the deviation is squared. 3. It is not useful in economic studies. Standard deviation If the variant xi takes the values of x1, x2 ………….. xn the standard deviation denoted by σ and it is defined by σ=

(

∑ xi − x N

)

2

The quantity σ2 is called variance.

122

Alternate Expressions For raw data σ2 =

()

∑ x2 − x n

For a grouped data σ2 =

2

()

∑ fx 2 − x n

2

For a grouped data with step deviation method σ =

∑ fd 2  ∑ fd  −  N  N 

2

Coefficient of variance It is defined as the ratio to be equal to standard deviation divided by mean. σ The percentage form of CV is given by CV = x 100 x

123

Problems 1. Ten students of a class have obtained the following marks in a particular subject out of 100. Calculate SD and CV for the given data below. (x)

d = (x1 = 38.5)

marks

d = (x1 - x )

1.

5

- 33.5

1122.25

2.

10

- 28.4

812.25

3.

20

- 18.5

342.25

4.

25

- 13.5

182.25

5.

40

1.5

2.25

6.

42

3.5

12.25

7.

45

6.5

42.25

8.

48

9.5

90.25

9.

70

31.5

992.25

10.

80

41.5

1722.25

Sl. No.

Σ(x1 - x )2 = Σ d2 = 5320.50

Σx = 385

x=

∑x N

=

385 10

= 38.5

(

σ=

∑ xi − x N

σ=

5320.5 10

(x1 - x )2

)

2

= 23.066

CV =

σ x 100 x

CV =

23 x 100 38.5

CV = 59.9%

124

2.

Compute standard deviation and coefficient of varience for following data of 100 students marks. Class

f

Mid point

Class

d

fd

fd2

x 1 – 10

3

0.5 – 10.5

5.5

-2

-6

12

11 – 20

16

10.5 – 20.5

15.5

-1

-16

16

21 – 30

26

20.5 – 30.5

25.5

0

0

0

31 – 40

31

30.5 – 40.5

35.5

1

31

31

41 – 50

16

40.5 – 50.5

45.5

2

32

64

51 – 60

8

50.5 – 60.5

55.5

3

24

72

Σfd = 65

Σfd2= 195

N = Σf = 100 a = 25.5 d=

x − a x − 25.5 = =d h 10

d=

15.5 − 25.5 − 10 = = −1 10 10

x=a+h+

∑ fd N

 65  = 25.5 + 10    100 

= 25.5 + 6.5

x 32 σ=h

∑ fd 2  ∑ fd  −  N  N 

σ = 10

195  65  −  100  100 

CV =

σ x 100 x

CV =

12.359 x 100 32

2

2

= 12.359

= 38.62%

125

3. The AM and SD of a set of nine items are 43 and 5 respectively if an item of value 63 is added, find the mean and SD. x=

∑ xi N

Σxi = x x N Σxi = 43 x 9 Σx = 387

for 9 items

Σx = 387 + 63 for 10 item Σx = 450 Modified mean x =

∑ x 450 = N 10

x = 45 σ=5

x = 43 σ2 =

()

∑ x2 −x N

for 9 items

2

∑ x2 2 25 = − ( 43) 9 25 =

∑ x2 − 1849 9

25 + 1849 =

∑ x2 9

∑ x2 = 1874 9 Σx2 = 1874 Σx2 = 16866

for 9 items

If 63 is added Σx2 = 16866 + (63)2 = 20835 for 10 items Modified

()

σ2 =

∑ x2 −x N

σ2 =

20835 2 − ( 45) 10

2

σ2 = 7.64 is modified SD.

126

4. The mean of 5 observations is 4.4. and variance is 8.24 and if the 3 items of the five observations are 1, 2 and 6. Find the values of other two observations. w.k.t. x =

∑x N

4.4 =

∑x N

Σx = 22

σ2 =

()

∑ x2 −x N

2

8.24 =

∑ x2 2 − ( 4.4) 5

8.24 =

∑ x2 − 19.36 9

8.24 + 19.36 =

∑ x2 5

Σx2 = 138 Σx2 = 12 + 22 + 62 + x12 + x22 138 = 1 + 4 + 36 + x12 + x22 97 = x12 + x22 x12 + x22 = 97

---- (1)

Σx = 1 + 2 + 6 + x1 + x2 22 = 9 + x1 + x2 x1 + x2 = - 13

---- (2)  put (2) in (1)

x2 = 13 – x1 by (1) & (2) x12 + (13 – x1)2 = 97 x12 + 169 + x12 – 26x1 = 97 2 x12 – 26x1 + 72 = 0 x12 – 13x1 + 36 = 0

127

x1 = x1 =

-b±

b 2 − 49 2a

- (-13) ± 169 − 4 x 36 2

x1 =

13 ± 5 2

x1 =

13 5 ± 2 2

x1 = 6.5 ± 2.5 x1 = 9 or x1 = 4 x1 = 9

x2 = 4

128

5. The mean and S.D. of the frequency distribution of a continuous random variable x are 40.604 and 7.92 respectively. Change of origin and scale is given below. Determine the actual class interval. d

-3

-2

-1

0

1

2

3

4

f

3

15

45

57

50

36

25

9

d

f

fd

fd2

MV

CI

-3

3

-9

27

22.5

20-25

-2

15

-30

60

29.5

25-30

-1

45

-45

45

32.5

30-35

0

57

0

0

37.5

35-40

1

50

50

50

42.5

40-45

2

36

72

144

47.5

50-55

3

25

75

225

52.5

55-60

4

9

36

144

57.5

N = 240

Σfd = 149

Σfd2 = 695

x=a+h

∑ fd N

40.604 = a + h

40.604 = a + 0.62h

∑ fd 2  ∑ fd  σ=h −  N  N  7.92 = h

----- (1) 2

695  149  −  240  240 

2

= h 2.895 − 0.620 7.92 = h x 1.584 h = 4.998 h=5 Put h = 5 in equation (1) 40.604 = a + 0.62 x 5 a = 37.5

129

149 240

Combined Standard Deviation Suppose we have different samples of various sizes n1, n2, n3 …….. having means x1, x2, x3 and standard deviation σ1, σ2, σ3 ……. then combine standard deviation can be computed by the following formula. σ2 (n1 + n2) = n1 (σ12 + d12) + n2 (σ22 + d22) d1 = x 1 − x d2 = x 2 − x 1. The mean’s of two samples of sizes 50 and 100 respectively are 54.1 and 50.3 and there standard deviations are 8 and 7 respectively obtain the SD for combined group. n1 = 50

n2 = 100

x 1 = 54.1

x 2 = 50.3

σ1 = 8

σ2 = 7

x=

n1 x1 + n 2 x 2 (n 1 + n 2 )

x=

(50 x 54.1) + (100 x 50.3) 50 + 100

x = 51.56 σ2 (n1 + n2) = n1 (σ12 + d12) + n2 (σ22 + d22) d1 = x 1 − x d2 = x 2 − x d1 = 94.1 – 51.56 d1 = 2.54

d12 = 6.45

d2 = 50.3 – 51.56 d2 = - 1.26

d22 = 1.56

σ2 150 = 50 (82 + 6.45) + 100 (72 + 1.58) 3σ2 = (64 + 6.45) + 2 (49 + 1.58) 3σ2 = 70.45 + 2 x 50.58 σ = 7.56

130

2. The mean wage is Rs. 75 per day, SD wage is Rs. 5 per day for a group of 1000 workers and the same is Rs. 60 and Rs. 4.5 for the other group of 1500 workers. Find mean and standard deviation for the entire group. We have by data, x 1 = 75, σ1 = 5, n1 = 1000 x 2 = 60, σ2 = 450, n2 = 1500 Let x and σ be the mean and SD of the entire group. Consider x = i.e., x =

n1 x1 + n 2 x 2 n1 + n 2

1000 x 75 + 1500 x 60 =60 1000 + 1500

Also we have, (n1 + n2) σ2 = n1 (σ12 + d12) + n2 (σ22 + d22), where d1 = x 1 - x = 75 – 66 = 9; d2 = x 2 - x = 60 – 66 = -6 ∴ (1000 + 1500) σ2 = 1000 (52 + 92) + 1500 (4.52 + (-6)2) ∴ σ 2 = 76.15 or σ = 8.73

3. The runs scored by 3 batsman are 50, 48 and 12. Arithmtic mean’s respectively. The SD of there runs are 15, 12 and 2 respectively. Who is t he most consistent of the three batsman? If the one of these three is to be selected who is to be selected? A

B

C

AM ( x )

50

48

12

SD(σ)

15

12

2

CVA =

σA x 100 xA

CVA =

15 x 100 50

CVA = 30%

CVB =

σB x 100 xB

CVB =

12 x 100 48

CVB = 25%

CVC =

σC x 100 xC

CVC =

2 x 100 12

CVC = 16.66% Evaluation Criteria 1. Less CV indicates more constant player and hence more consistent player is (Player C) 2. Highest rune scorer = x A = 50

4. The coefficient of variation of the two series are 75% and 90% with SD 15 and 18 respectively compute there mean. CVA = 75% CVB = 80% σA = 15 σB = 18

CV =

σ x 100 x

75 =

15 x 100 xA

90 =

x A = 20

18 x 100 xA

x A = 20

5. Goals scored by two teams A & B in a foot ball season are as shown below. By calculating CV in each, find which team may be considered as more consistent. No. of matches

No. of goals

Team (A)

Team (B)

x

A-team

B-team

fx

fx

0

27

17

0

0

1

9

9

9

9

2

8

6

16

12

3

5

5

15

15

4

4

3

16

12

N = Σf = 53

Σf = 40

Σfx = 56

Σfx2 = 48

Team (A)

Team (B)

fx2

fx2

0

0

9

9

32

24

45

45

64

48

Σfx2 = 150

Σfx2 = 126

x A=

∑ fx 56 = = 1.056 N 53

xB=

∑ fx 48 = = 1.2 N 40

()

∑ fx 2 σ = − x A N 2

2

=

150 2 = 1.30 − (1.056 ) = 1.715 = σ A 53

24

()

∑ fx 2 − x N

2

σ = B

2

=

126 2 − (1.2 ) = 1.95 = σB = 1.30 40

CVA =

σA 1.30 x 100 = 123.8% x 100 = xA 1.056

CVB =

σB 1.30 x 100 = 109% x 100 = xB 1.2

Since, CVB < CVA, team B is more consistent player 6. The prices of x and y share A & B respectively state which share more stable in its value. Price A

(xi = 53)

Price - A

(xi = 105)

(x)

(xi = x )

(4)

(xi = x )

55

2

4

108

3

9

54

1

1

107

2

4

52

-1

1

105

0

0

53

0

0

105

0

0

56

3

9

106

1

1

58

5

25

107

2

4

52

-1

1

104

-1

1

50

-3

9

103

-2

4

51

-2

4

104

-1

1

49

-4

16

101

-4

16

Σx = 530

(xi = x )2

Σ(xi= x )2 = 70

Σx = 1050

(xi = x )2

Σx(xi= x )2 = 40

25

x A=

∑x 530 = = 53 N 10

xB=

∑ x 1050 = = 105 N 10

σ=

70 = σ = 2.64 10 A

σ=

40 =σ=2 B 10

A

B

CVA = CVB =

σA x σB x

x 100 =

2.64 x 100 = 4.98% 53

x 100 =

2 x 100 = 1.903% 105

Since, CVB is less share B is more stable. 7. A student while computing the coefficient of variation obtained the mean and SD of 100 observations as 40 and 5.1 respectively. It was later discovered that he had wrongly copied an observation as 50 instead of 40. Calculate the correct coefficient of variation.

>> x =

∑x ∑x i.e. 40 = n 100

∴ Σx (incorrect) = 4000 Now correct Σx = 4000 – 50 + 40 = 3990 ∴ correct x =

3990 = 39.9 100

()

∑ x2 Let us consider σ = − x n 2

( 5.1) 2

2

∑x2 2 − ( 40) 100

=

i.e. ( 40 ) + ( 5.1) = 2

2

∑x2 ∑ x2 or = 1626.01 100 100

26

∴ Σx2 (incorrect) = 100 x 1626.01 = 162601 Now correct Σx2 = 162601 – (50)2 + (40)2 = 161701 ∴ correct σ2 = correct i.e., correct σ2 =

(

∑ x2 − correct x n

)

2

161701 2 − ( 39.9) = 25 100

Now correct efficient of variation =

σ x 100 x

5 x 100 = 12.56% 39.9 Hence correct C.V. = 12.53%

27

8. The mean and SD of 21 observations are 30 and 5 respectively. It was subsequently noted that one of the observations 10 was incorrect. Omit it and determine the mean and SD of the rest.

>> x =

∑x ∑x or ∑ x = 630 i.e. 30 = n 21

∴ incorrect Σx = 630 Now omitting the incorrect value 10, Σx = 630 – 10 = 620

New

n = 21 – 1 = 20 New x =

620 = 31 20

Next consider σ 2 =

( 5)

2

()

∑ x2 − x n

2

∑ x2 2 = − ( 30 ) 100

i.e. 900 + 25 =

∑ x2 21

∴ incorrect ∑ x 2 = 925 x 21 = 19425 Again omitting the incorrect value 10. New

Σx = 19425 –(10)2 = 19325, n = 20

Hence new σ 2 = new

(

∑ x2 − new x 20

)

2

19325 − (31) 2 = 5.25 20 ∴ New σ =

5.25 = 2.29

9. The mean of 200 items was 50. Later on it was discovered that two items were misread as 92 and 8 instead of 192 and 88. Find out the correct mean. >> x =

∑x ∑x or ∑ x = 10000 i.e. 50 = n 200

∴ incorrect Σx = 10000 Correct Σx = 10000 – 92 – 8 + 192 + 88 = 10180 28

∴ Correct mean =

10180 = 50.9 200

10. Find the missing frequencies in the following data given that the median is 137.2. Class

Frequency

100110

110120

120130

130140

140150

150100

1 06170

1 70180

15

44

133

F1

125

F2

35

16

N=600

>> We prepare the table with the column of cumulative frequencies and use the formula for median. Class

Frequency

cf

100-110

15

15

110-120

44

59

120-130

133

192

130-140

f1

192 + f1

140-150

125

317 + f1

150-160

f2

317 + f1 + f2

160-170

35

352 + f1 + f2

170-180

16

368 + f1 + f2

 Median class

N = 600

Median = 1 +

h N   − c f 2 

We can take the median class as 130-140 since median is given to be 137.2 l=

130 + 130 = 130 , h = 10 f = f1, c = 192 2

∴ 137.2 = 130 +

10 (300 - 192) f1

i.e., 137-2 – 130 =

1080 i.e., 7.2 f1 = 1080 or f1 150 f1

But the last cumulative frequency must be equal to N = 600

29

i.e.

368 + f1 + f2 = 600 368 + 150 + f2 = 600 ∴ f2 = 82

Thus f1 = 150, f2 = 82

30

Relationship between various measures of dispersion We have some of following relationships among the various methods of measures of dispersion 1. Mean ± QD covers 50% of observations of the distribution 2. Mean ± MD covers 57.5% of observations 3. Mean ± 1 σ includes 68.27% of observations 4. Mean ± 2 σ includes 95.45% of observations 5. Mean ± 3 σ includes 99.73% of observations 6. QD = 6745 σ = 7. MD = 8. QD =

2 σ 3

2 4 xσ= σ A 5 5 MD 6

9. Combining the results we get 3 QD = 2 SD and 5 MD = 4 SD that is also equal to 6 QD. 10. Range = 6 times SD. SOURCES AND REFERENCES 8. Statistics for Management, Richard I Levin, PHI / 2000. 9. Statistics, RSN Pillai and Bagavathi, S. Chands, Delhi. 10. An Introduction to Statistical Method, C.B. Gupta, & Vijaya Gupta, Vikasa Publications, 23e/2006. 11. Business Statistics, C.M. Chikkodi and Salya Prasad, Himalaya Publications, 2000. 12. Statistics, D.C. Sancheti and Kappor, Sultan Chand and Sons, New Delhi, 2004. 13. Fundamentals of Statistics, D.N. Elhance and Veena and Aggarwal, KITAB Publications, Kolkata, 2003. 14. Business Statistics, Dr. J.S. Chandan, Prof. Jagit Singh and Kanna, Vikas Publications, 2006.

31

CORRELATION ANALYSIS Concept and Importance of Correlation We may come across certain series wherein there may be more than one variable. A distribution in which each variable assumes two values is called a Bivariate Distribution. If we measure more than two variables on each unit of a distribution, it is called Multivariate Distribution. In a bivariate distribution, we may be interested to find if there is any relationship between the two variables under study. The Correlation is a statistical tool which studies the relationship between two variables and the correlation analysis involves various methods and techniques used for studying and measuring the extent of the relationship between the two variables. Correlation analysis is used as a statistical tool to ascertain the association between two variables. “When the relationship is of a quantitative nature, the appropriate statistical tool for discovering & measuring the relationship and expressing it in a brief formula is known as correlation.” - Croxton & Cowden “Correlation is an analysis of the covariation between two or more variables.” - A. M. Tuttle “Correlation Analysis contributes to the understanding of economic behaviour, aids in locating the critically important variables on which others depend, may reveal to the economist the connections by which disturbances spread and suggest to him the paths through which stabilizing forces may become effective.” -

W.

A.

Neiswanger “The effect of correlation is to relation is to reduce the range of uncertainty of our prediction.” -

Tippett

32

The problem in analyzing the association between two variables can be broken down into three steps. o We try to know whether the two variables are related or independent of each other. o If we find that there is a relationship between the two variables, we try to know its nature and strength. This means whether these variables have a positive or a negative relationship and how close that relationship is. o We may like to know if there is a causal relationship between them. This means that the variation in one variable causes variation in another. When data regarding two or more variables are available, we may study the related variation of these variables. For e.g. in a data regarding heights (x) and weights (y) of students of a college, we find that those students who have greater height would have greater weight. Also, students who have lesser height would have lesser weight. This type of related variation among variables is called correlation. Correlation may be (i) Simple correlation (ii) Multiple correlation (iii) Partial correlation. Simple correlation concerns with related variation among two variables. Multiple correlation and partial correlation concern with related variation among three or more variables. Two variables are said to be correlated when they vary such that a. The higher values of one variable correspond to the higher values of the other and the lower values of the variable correspond to the lower values of the other. or b. The higher values of one variable correspond to the lower values of the other. Generally, it can be seen that those who are tall will have greater weight, and those who are short will have lesser weight. Thus height (x) and weight (y) of persons show related variation. And so they are correlated. On the other hand production (x) and price (y) of vegetables show variation in opposite directions. Here the higher the production the lower would be the price.

33

In both the above examples, the variables x and y show related variation. And so they are correlated. TYPES OF CORRELATION Correlation is positive (direct) if the variables vary in the same directions, that is, if they increase and decrease together. Height (x) and weight (y) of persons are positively correlated. Correlation is negative (inverse) if the variables vary in the opposite directions, that is, if one variable increases the other variable decreases. Production (x) and price (y) of vegetables are negatively correlated. If variables do not show related variation, they are said to be non – correlated. If variables show exact linear relationship, they are said to be perfectly correlated. Perfect correlation may be positive or negative.

Correlation and Causation o The correlation may be due to chance particularly when the data pertain to a small sample. o It is possible that both the variables are influenced by one or more other variables. o There may be another situation where both the variables may be influencing each other so that we cannot say which is the cause and which is the effect.

Types of Correlation o Positive and Negative: If the values of the two variables deviate in the same direction i.e., if the increase in the values of one variable results, on an average, in a corresponding increase in the values of the other variable or if a decrease in the values of one variable results, on an average, in a corresponding decrease in the values of the other variable, correlation is said to be positive or direct. For example: Price & Supply of the commodity. On the other hand, correlation is said to be negative or inverse if the variables deviate in the opposite direction i.e., if 34

the increase (decrease) in the values of one variable results, on the average, in a corresponding decrease (increase) in the values of the other variable. For example: Temperature and Sale of Woolen Garments. o Linear and Non-Linear: The correlation between two variables is said to be linear if corresponding to a unit change in one variable, there is a constant change in the other variable over the entire range of the values. For example: y = ax + b. The relationship between two variables is said to be non-linear or curvilinear if corresponding to a unit change in one variable, the other variable does not change at a constant rate but at a fluctuating rate. When this is plotted in the graph this will not be a straight line. o Simple, Partial and Multiple: The distinction amongst these three types of correlation depends upon the number of variables involved in a study. If only two variables are involved in a study, then the correlation is said to be simple correlation. When three or more variables are involved in a study, then it is a problem of either partial or multiple correlation. In multiple correlation, three or more variables are studied simultaneously. But in partial correlation we consider only two variables influencing each other while the effect of other variable is held constant. For example: Let us suppose that we have three variables, number of hours studied (x); IQ (y); marks obtained (z). In a multiple correlation we will study the correlation between z with 2 variables x & y. In contrast, when we study the relationship between x & z, keeping an average IQ as constant, it is said to be a study involving partial correlation.

Methods of Correlation METHODS OF CORRELATION GRAPHIC SCATTER DIAGRAM

ALGEBRAIC COVARIENCE METHOD

Process of Calculating Coefficient of Correlation

RANK CORRELATION

CONCURRENT DEVIATION METHOD

35

o Calculate the means of the two series: X and Y. o Take deviations in the two series from their respective means, indicated as x and y. The deviation should be taken in each case as the value of the individual item minus (–) the arithmetic mean. o Square the deviations in both the series and obtain the sum of the deviationsquared columns. This would give ∑x2 and ∑y2. o Take the product of the deviations, that is, ∑xy. This means individual deviations are to be multiplied by the corresponding deviations in the other series and then their sum is obtained. o The values thus obtained in the preceding steps ∑xy, ∑x2 and ∑y2 are to be used in the formula for correlation. SCATTER DIAGRAM METHOD Scatter diagram is a graphic presentation of bivariate data. Here, bivariate data with n pairs of values is represented by n points on the xy – plane. The two variables are taken along the two axes, and every pair of values in the data is represented by a point on the graph. The pattern of distribution of points on the graph can be made use of for the rough estimation of degree of correlation between the variables. In the scatter diagram – a. If the points form a line with positive sloe (a line moving upwards), the variables are positively and perfectly correlated. b. If the points form a line with negative slope (a line moving downwards), the variables are negatively and perfectly correlated. c. If the points cluster around a line with positive slope the variables are positively correlated. d. If the points cluster around a line with negative slope, the variables are negatively correlated. e. If the points are spread all over the graph, the variables are non correlated.

36

f. Any other curve – form of spread of points indicates curvilinear relation between the variables. Scatter diagram is one of the simplest ways of diagrammatic representation of a bivariate distribution and provides us one of the simplest tools of ascertaining the correlation between two variables. Suppose we are given n pairs of values of two variables X and Y. For example, if the variables X and Y denote the height and weight respectively, then the pairs my represent the heights and weights (in pairs) of n individuals. These n points may be plotted as dots (.) on the x – axis and y – axis in the xy – plane. (It is customary to take the dependent variable along the x – axis.) the diagram of dots so obtained is known as scatter diagram. From the scatter diagram we can form a fairly good, though rough idea about the relationship between the two variables. The following points may be borne in mind in interpreting the scatter diagram regarding the correlation between the two variables: 1. If the points are very dense i.e very close to each other, a fairly good amount of correlation may be expected between the two variables. On the other hand, if the points are widely scattered, a poor correlation may be expected between them. 2. If the points on the scatter diagram reveal any trend (either upward or downward), the variables are said to be correlated and if no trend is revealed, the variables are uncorrelated. 3. If there is an upward trend rising from lower left hand corner and going upward to the upper right hand corner , the correlation is positive since this reveals that the values of the two variables are move in the same direction. If, on the other hand the points depict a downward trend from the upper left hand corner, the correlation is negative since in this case the values of the two variables move in the opposite directions. 4. In particular , if all the points lie on a straight line starting from the left bottom and going up towards the right top, the correlation is perfect and positive , and if all the points lie on a straight line starting from the left top and coming down to right bottom , the correlation is perfect and negative. 5. The method of scatter diagram is readily comprehensible and enables us to form a rough idea of the nature of the relationship between the two variables merely by 37

inspection of the graph. Moreover, this method is not affected by extreme observation whereas all mathematical formulae of ascertaining correlation between two variables are affected by extreme observations. However, this method is not suitable if the number of observations is fairly large. 6. The method of scatter diagram tells us about the nature of the relationship whether it is positive or negative and whether it is high or low. It does not provide us exact measure of the extent of the relationship between the two variables. 7. The scatter diagram enables us to obtain an approximate estimating line or line of best fit by free hand method. The method generally consists in stretching a piece of thread through the plotted points to locate the best possible line. KARL PEARSON’S COEFFICIENT OF CORRELATION (COVARIENCE METHOD; PRODUCT MOMENT) This is a measure of linear relationship between the two variables. It indicates the degree of correlation between the two variables. It is denoted by ‘r’. INTERPRETATION OF COEFFICIENT OF CORRELATION a. A positive value of r indicates positive correlation b. A negative value of r indicates negative correlation c. r = +1 means, correlation is perfect positive. d. r = -1 means, correlation is perfect negative. e. r = 0 (or low) means, the variables are non – correlated. Karl Pearson’s measure known as Pearsonian correlation co efficient between two variables ( series) X and Y , usually donated by r , is a numerical measure of linear relationship between them and is defined as the ratio of the covariance between X and Y , written as Cov (x, y) to the product of standard deviation of X and Y .

Assumptions of the Karl Pearson’s Correlation o The two variables X and Y are linearly related.

38

o The two variables are affected by several causes, which are independent, so as to form a normal distribution.

Coefficient of Determination The strength of r is judged by coefficient of determination, r2 for r = 0.9, r2 = 0.81. We multiply it by 100, thus getting 81 per cent. This suggests that when r is 0.9 then we can say that 81 per cent of the total variation in the Y series can be attributed to the relationship with X.

Rank Correlation Limitations of Spearman’s Method of Correlation o Spearman’s r is a distribution-free or non parametric measure of correlation. o As such, the result may not be as dependable as in the case of ordinary correlation where the distribution is known. o Another limitation of rank correlation is that it cannot be applied to a grouped frequency distribution. o When the number of observations is quite large and one has to assign ranks to the observations in the two series, then such an exercise becomes rather tedious and time-consuming. This becomes a major limitation of rank correlation.

Some Limitations of Correlation Analysis o Correlation analysis cannot determine cause-and-effect relationship. o Another mistake that occurs frequently is on account of misinterpretation of the coefficient of correlation and the coefficient of determination. o Another mistake in the interpretation of the coefficient of correlation occurs when one concludes a positive or negative relationship even though the two variables are actually unrelated.

39

40

Properties of Correlation Coefficient Property 1 - Limits for Correlation Coefficient Pearsonian correlation coefficient can not exceed 1 numerically. In other words it lies between 1 and -1. Symbolically: – 1 ≤ r ≤ 1. r = + 1 implies perfect positive correlation between the variables. Property 2 - Correlation Coefficient is independent of the change of origin and scale. Mathematically, if X and Y are given and they are transformed to the new variables U and V by the change of origin and scale viz, u = (x – A)/h

and

v = (y – B)/k

; h >0, k >0

Where A, B, h >0, k >0; then the correlation coefficient between x and y is same as the correlation coefficient between u and v i.e., r (x,y) = r ( u, v)

rxy = ruv

Property 3 - Two independent variables are uncorrelated but the converse is not true. Remarks: one should not be confused with the words of uncorrelation and independence. rxy = 0 i.e., uncorrelation between the variables x and y simply implies the absence of any linear (straight line) relationship between them. They may however, be related in some other form (other than straight line) e.g., quadratic (as we have see in the above example, logarithmic or trigonometric form. Property 5 - If the variables x and y is (+ 1) if the signs of a and b are different and (-1) if the signs of a and b are alike. Interpretation of r the following general points may be borne in mind while interpreting an observed value of correlation coefficient r: If r = -1 there is perfect negative correlation between the variables. In this scatter diagram will again be a straight line.

41

If r = 0, the variables are uncorrelated in other words there is no linear (straight line) relationship between the variables. However, r = 0 does not imply that the variables are independent. For other values of r lying between + 1 and – 1 there are no set guidelines for its interpretation. The maximum we can conclude is that nearer the value of r to 1, the closer is the relationship between the variables and nearer is the value of r to 0 the less close is the relationship between them. One should be very careful in interpreting the value of r as it is often misinterpreted. The reliability or the significance of the value of the correlation depends on a number of factors. One of the ways of testing the significance of r is finding its probable error, which in addition to the value of r takes into account the size of the sample also. Another more useful measure for interpreting the value of r is the coefficient of determination. It is observed there that the closeness of the relation ship between two variables is not proportional to r.

In total the Properties are: o Limits for Correlation Coefficient. o Independent of the change of origin & scale. o Two independent variables are uncorrelated but the converse is not true. o If variable x & y are connected by a linear equation: ax+by+c=0, if the correlation coefficient between x & y is (+1) if signs of a, b are different & (-1) if signs of a, b are alike.

Important Formulas:

r=

nΣ dx.dy - Σdx. Σdy √nΣdx2 – (Σdx)2. nΣdy2 -(Σdy)2

r=

Σxy √[Σx2.Σy2]

r = [Cov (x,y)} / [SD (x)*SD (y)]

42

The application of the formulaes depends on different situations. Following are some problems which are solved using different formulas. We can notice that irrespective of the formulas the answer will remain same. Problem Number 1, 2, 3 are solved with different formulas for the same data. xy

X

Y

x=X-X

y=Y-Y

x2

y2

39

47

-26

-19

676

361

65

53

0

-13

0

169

62

58

-3

-8

9

64

24

90

86

25

20

625

400

82

62

17

-4

289

16

500 -68

75

68

10

2

100

4

20

25

60

-40

-6

1600

36

240

98

91

33

25

1089

625

825

36

51

-29

-15

841

225

435

78

84

13

18

169

324

234

650

660

0

0

5398

2224

2704

X

Y dx=X-A dy=Y-A dx2

dy2

dxdy

39

47

-31

-13

961

169

403

65

53

-5

-7

25

49

35

62

58

-8

-2

64

4

X2

2704

√[5398*2224]

r = 0. 7804

Y2

XY

2209

1833

– (50)58 XY 10*2404 - ΣX. ΣY

nΣ 1521 47

62

40

2. 58√nΣX 3844 2 –(ΣX) 3364 3596 nΣY2 –(ΣY)2

90

86

8100

7396

82

62

6724

3844

5084

75

68

5625

4624

5100

25

60

625

3600

5100

98

91

9604

8281

8918

36

51

1296

2601

1836

78

84

6084

7056

6552

86

20

24

400

576

480

82

62

12

2

144

4

75

68

5

8

25

64

r 24=

25

60

-45

0

2025

0

0

98

91

28

31

784

961

868

36

51

-34

-9

1156

81

306

78

84

8

24

64

576

192

-50

58

5648 2484

2364

r=

Y

r=

39

16

90

650 660

X

494 0

r=

√[10*5648 – (-50)2 . 10*2484 – (58)2] 65 53 4225 2809 3445

r =7740 0.78

10*45604 – 650*660

43

. 2 650 660 47648 457842r=0. 45604 √10*47648 –(650) 10*45784–(660) 7804

Problem No 4: From the following data given calculate “n”: Correlation coefficient – 0.8; Summation of product deviations – 60; SD of y – 2.5; Summation of x2 – 90. x & y are the deviations from their arithmetic mean. Answer:

r = [Cov (x,y)} / [SD (x)*SD (y)] 0.8 = [1/n (60)] / [{√(90/n)}*(2.5)] 0.8*0.8 = [(1/n)*(1/n)*60*60] / [(90/n)*2.5*2.5] 0.8*0.8*2.5*2.5*90 = [(1/n)*(1/n)*60*60] n=10 Problem 5: A computer while calculating correlation coefficient between x & y from a pair of 25 observations. Summation X is 125, Summation X2 is 650; Summation Y is 100, Summation Y2 is 460; Summation of X&Y is 508. Later it is observed that two pairs of observations were taken as (6, 14) and (8,6) instead of (8, 12) and (6,8). Prove that the correct correlation coefficient is 0.67. Answer: When we apply the formula we get the answer. First applying the formula we need to find all terms. Then add all correct values [(8, 12) and (6,8)] after deducting wrong values [(6, 14) and (8,6)] from those terms. Now apply them in the formula. We get the answer as 2/3. Problem 6:

44

If the relation between two random variables x & y is: 2x+3y=4, then the correlation coefficient is: Answer: -1 (by the property) Problem 7: In two sets of variables X & Y with 50 observations each, following data was observed: AM of X is 10; SD of X is 3; AM of Y is 6; SD of Y is 2; coefficient of correlation is 0.3. However after subsequent verification one pair (10,6) was weeded out. What is the change in the correlation coefficient with the remaining 49 pairs of values? Answer: As that in problem first we need to find all terms in the formula. After that deduct the wrong values (10,6) from those terms. Now apply new terms in the formula again. We get the answer.

PROBABLE ERROR After computing the value of the correlation coefficient, the next step is to find the extent to which it is dependable. Probable error of correlation coefficient usually denoted by P.E (r) is an old measure of testing the reliability of an observed value of correlation coefficient in so far as it depends upon the condition of random sampling. If r is the observed correlation coefficient in a sample of n pairs of observation then its standard error, usually denoted by S.E (r) is given by

1 – r2 SE (r) = √n

PE (r) = SE (r) * 0.6745

The reason for taking the factor 0.6745 is that in a normal distribution 50% of the distribution lie in the rang μ ± 0.6745 σ is the s.d. 45

According to Secrist, “The probable error of the correlation coefficient is an amount which if added to and subtracted from the mean correlation coefficient, produces amounts within which the chances are even that a coefficient of correlation from a series selected a random will fall.

Uses of probable error The probable error of correlation coefficient may be used to determine the limits which the population correlation coefficient may be expected to lie. Limits for population correlation coefficient are 1. r ± P.E. (r) : This implies that if we take another random sample of the same size n from

the same population from which the first sample was taken, then the

observed value of the correlation coefficient , say, r1 in the second sample can be expected to lie within the limits given. 2. P.E. (r) may be used to test if an observed value of sample correlation coefficient is significant of any correlation in the population. The following guidelines may be used: a. If r < P.E. ( r ) i.e, if the observed value of r is less than its P.E., then the correlation is not at all significant. b. If r > P.E. ( r ) i.e, if the observed value of r is greater than 6 times its P.E., then r is definitely significant. c. In other situation nothing can be concluded with certainty. Important Remarks 1: Sometimes P.E. may lead to fallacious conclusions particularly when n , the number of pairs of observations is small. In order to use P.E. effectively, n should be fairly large. However a rigorous test for testing the significance of an observed sample correlation coefficient is provided by Student’s t test. Important Remarks 2: P.E. can be used only under the following conditions a. The data must have been drawn from a normal population. 46

b. The conditions of random sampling should prevail in selecting sampled observation. r < PE (r) – r is not at all significant; r > 6 PE (r) – r is significant; other cases nothing can be concluded with certainty. Problem 1: Comment whether the correlation coefficient is significant or not. X

Y

Dx=(X-60)/5 Dy=(Y-65)/5 dx2 dy2 dxdy

45 35 -3

-6

9

36

18

70 90 2

5

4

25

10

65 70 1

1

1

1

1

30 40 -6

-5

36

25

30

90 95 6

6

36

36

36

40 40 -4

-5

16

25

20

50 60 -2

-1

4

1

2

75 80 3

3

9

9

9

85 80 5

3

25

9

15

60 50 0

-3

0

9

0

2

-2

140 176 141

1 – r2 SE (r) =

√n 1 – (0.9)2

SE (r) =

PE (r) = SE (r) * 0.6745

PE (r) = 0.06 * 0.6745

√10

SE (r) = 0.0600

PE (r) = 0.0405

0.9 > 6 PE (r) [i.e.,0.2432] – r is highly significant

Working Note:

10*141 – (2) 2

nΣ dx.dy - Σdx. Σdy

r=

√nΣdx2 – (Σdx)2. nΣ dy2 -(Σdy)2

r = 0.90

r = √10*140 – (-2)2 . 10*176 – (2)2

CORRELATION IN BIVARIATE FREQUENCY TABLE If in a bivariate distribution the data are fairly large,

they may be summarized in the

form of a two way table. Here for each variable , the values are grouped into various classes ( not necessarily the same for both the variables) keeping in view the same 47

considerations as in the case of univariate distribution. For example, if there are m classes for the X – variable series and n classes for the Y – variable series then there will be m x n cells in the two – way table. By going through the different pairs of the values ( x, y) and using tally marks we can find the frequency for each cell and thus obtain the so called bivariate frequency table.

NΣfxy – (Σfx)(Σfy)

r=

√[N Σfx2 – (Σfx)2] [NΣfy2 – (Σfy)2]

NΣfuv – (Σfu)(Σfv)

r=

√[N Σfu2 – (Σfu)2] [NΣfv2 – (Σfv)2]

Food Expenditure

Family Income (Rs.) 200-300 300-400 400-500 500-600 600-700

(in %) 10-15

-

-

-

3

7

15-20 C-I

-

4

9

4

3

x 250 350 7 450 550 20-25 6 650 12

5

-

8

-

u

-2

25-30

CI

Y

-1 3 0

1

2 19

10

v

f

fv

fv2

fuv

12.5

-1 -

-

-

3

7

10

-10

10

-17

17.5

0

-

4

9

4

3

20

0

0

0

22.5

1

7

6

12

5

-

30

30

30

-15

27.5

2

3

10

19

8

-

40

80

160

-16

100 100

200

-48

f

10

20

40

20

10

fu

-20 -20

0

20

20

0

fu2

40

20

0

20

40

120

fuv -26 -26

0

18

-14

-48

48

100*(-48) – 0*100 √[(100*120)-0] √[(100*200)-(100)2]

r = -0.4381 Problem 2:

18

Age in Years 19 20 21

22.5

3

2

-

-

-

17.5

-

5

4

-

-

12.5

-

-

7

10

-

7.5

-

-

-

3

2

2.5

-

-

-

3

1

Marks

22

49

x

18

19

20

21

22

u

-2

-1

0

1

2

y

v

22.5

2

3

17.5

1

12.5

-12

f

fv

fv2

fuv

2

-4

-

-

-

5

10

20

-16

-

5

-5

4

-

-

9

9

9

-5

0

-

-

7

10

-

17

0

0

0

7.5

-1

-

-

-

3

2.5

-2

-

-

-

3

-3

-6

2

-4

5

-5

5

-7

1

-4

4

-8

16

-10

6

50

-38

f

3

7

11

16

3

40

fu

-6

-7

0

16

6

9

fu2

12

7

0

16

12

47

fuv -12

-9

0

-9

-8

-38

40*(-38) – 9*6 √[(40*47)-(9)2] √[(40*50)-(6)2]

r = -0.8373

RANK CORRELATION METHOD Sometimes we come across statistical series in which the variable under consideration are not capable of quantitative measurements but can be arranged in a serial order. This happens when we are dealing with qualitative characteristics ( attributes) such as honesty, beauty, character, morality, etc. Which cannot be measured quantitatively but can be arranged serially. In such situations Karl Pearson’s coefficient of correlation cannot be used as such. Charles Edward Spearman, a British psychologist, developed a formula in 1904 which consists in obtaining the correlation coefficient between the ranks of n individuals in the two attributes under study. The Pearson Correlation Coefficient between the ranks X and Y is called the rank correlation coefficient between the characteristics A and B for that group of individuals.

50

The students are assigned ranks in Statistics according to their marks in Statistics. Also, they are assigned ranks in Mathematics according to their marks in Mathematics. Then, the correlation between these two sets of ranks is called rank correlation. The coefficient of correlation computed for these ranks is called Spearman’s coefficient of rank correlation. In a bivariate data, if the values of the variables are ranked in the decreasing (or increasing) order, the correlation between these ranks is rank correlation. The coefficient of correlation computed for these rank is Spearman’s coefficient of rank correlation. It is denoted by ρ (Rho) If R1 and R2 are the ranks in the two characteristics, and d = R1 – R2 is the difference between the ranks, coefficient of rank correlation is – ρ = 1 - 6∑d2

n3 – n Since ρ is the product moment coefficient of correlation between the ranks , it is a value between -1 and +1 Karl Pearson’s coefficient of correlation can be calculated only if the characteristics under study are quantitative ( they should be numerically measurable) but, Spearman’s coefficient of rank correlation can be calculated even if the characteristics under study are qualitative.

If it is possible to assign ranks to the units with regard to the two

characteristics , co efficient of rank correlation can be calculated. REPEATED RANKS In case of attributes if there is a tie i.e. if any two or more individuals are placed together in any classification w.r.t an attribute or if in case of variable data there is more than one item with the same value in either or both the series, then Spearman’s formula for calculating the rank correlation coefficient breaks down, since in this case the variable X ( the ranks of individuals in characteristic A ( 1st series) and Y ( the ranks of individuals characteristic B ( 2nd series) do not take the values from 1 to n and consequently x ≠ y, while in proving we had assumed that x = y. 51

For the computation of coefficient of rank correlation, while ranking the values, two or more values may be equal. And so, a situation of ties may arise. In such a case, all those values which are equal are assigned with the same average rank. And then, the coefficient of rank correlation is found. Here, corresponding to every such repeated rank correlation is found. Here corresponding to every such repeated rank (which repeats m times), a factor (m3 – m) / 12 is added to ∑d2 In this case, common ranks are assigned to the repeated items. These common ranks are the arithmetic mean of the ranks which these items would have got if they were different from, each other and the next item will get the rank next to the rank used in computing the common rank. For e.g, suppose an item is repeated at rank 4. Then the common rank to be assigned to each item is ( 4 + 5) / 2 i.e, 4.5 which is the average of 4 and 5 , the ranks which these observations would have assumed if they were different. The next item will be assigned the rank 6. if an item is repeated thrice at

rank 7, then the common

rank to be assigned to each value will be ( 7+8+9)/ 3, i.e 8 which the arithmetic mean of 7,8 and 9 viz, the ranks these observation would have got if they were different from each other. The next rank to be assigned will be 10. If only a small proportion of the ranks are tied, this technique may be applied together with formula. If a large proportion of ranks are tied, it is advisable to apply an adjustment or a correction factor as explained: “In a formula add the factor

m (m2 – 1) / 12 to ∑d2, where m is the number of

times an item is repeated. This correction factor is to be added for each repeated value in both the series. REMARKS ON SPEARMAN’S RANK CORRELATION COEFFICIENT 1. Since Spearman’s rank correlation coefficient ρ is nothing but Pearson’s correlation coefficient between the ranks, it can be interpreted in the same way as the Karl Pearson’s correlation coefficient. 2.

Karl Pearson’s correlation coefficient assumes that the parent population from which sample observations are drawn is normal. If this assumption is

52

violated than we need a measure which is distribution – free (or non – parametric). A distribution free measure is one which does not make any assumptions about the form of the population. Spearman’s ρ is such a measure (i.e. distribution free), since no strict assumptions are made about the form of the population from which sample observations are drawn. 3. Spearman’s formula is easy to understand and apply as compared with Karl Pearson’s formula. The values obtained by the two formulae, viz Pearsonian r and Spearman’s ρ are generally different. The differences arise due to the fact that when ranking is used instead of full set of observations, there is always some loss of information. Unless many ties exist, the coefficient of rank correlation should be slightly lower than the Pearsonian coefficient. 4. Spearman’s formula is the only formula to be used for finding correlation coefficient if we are dealing with qualitative characteristics which cannot be measured quantitatively but can be arranged serially. It can also be used where actual data are given. In case of extreme observations, Spearman’s formula is preferred to Pearson’s formula. 5. Spearman’s formula has its limitation also. It is not practicable in the case of bivariate frequency distribution. For n > 30, this formula should not be used unless the ranks are given, since in the contrary case the calculations are quite time consuming.

When ranks are not repeated: Rank in A (x) Rank in B (y)

D=x-y

ρ = 1-

2

1

10

2

7

-5

3

2

1

6

6[∑D + {m(m2-1)/12}] -2

4

1

6

8

-2

2 n(n – 1) 4

Problem 1: 7

3

4

16

8

1

7

49

9

9

0

0

10

5

5

25

When ranks4 are repeated: 5

ρ = 1-

-9

2 6∑D D

81

2 n (n –1) 25

1 2 4 1

206

53

6*206

1-10(100-1) ρ = -0.24

Problem 2: Cost

Sales

X

Y

D

D2

39

47

8

10

-2

4

65

53

6

8

-2

4

62

58

7

7

0

0

90

86

2

2

0

0

82

62

3

5

-2

4

75

68

5

4

1

1

25

60

10

6

4

16

98

91

1

1

0

0

36

51

9

9

0

0

78

84

4

3

1

1

6*30

1-10(100-1) ρ = 0.82

30 54

Problem 3:

X

Y

R1

R2

D

D2

48

13

3

5.5

-2.5

6.25

33

13

5

5.5

-0.5

0.25

40

24

4

1

3

9

9

6

10

8.5

1.5

2.25

16

15

8

4

4

16

16

4

8

10

-2

4

65

20

1

2

-1

1

24

9

6

7

-1

1

16

6

8

8.5

-0.5

0.25

57

19

2

3

-1

1

6(∑D2+∑[m(m2-1)/12])

ρ =1-

n (n2 – 1)

6 (41+ 2 + 0.5 +0.5)

ρ =1-

10(102–1)

ρ = 0.7333

ALGEBRAIC METHOD (CONCURRENT 41 DEVIATIONS) This is very casual method of determining the correlation between two series when we are not very serious about its precision. This is based on the signs of the deviations. ( i.e. direction of the change) of the values of the variable from its preceding value and does not take into account the exact magnitude of the values of the variable. Thus, we put a plus (+) sign , minus (- ) sign, or equality (=) sign for the deviation if the value of the variable is greater than, less than or equal to the preceding value respectively. The deviation in the values of the two variables are said to be concurrent if they have the same sign i.e. either both deviation are positive , both are negative or both are equal. The formula used for computing correlation coefficient r by this method is given by

r = + √ + [(2c-n)/n]

55

Where c is the number of pairs of concurrent deviation and n is the number of pairs of deviation. In the formula plus / minus sign to be taken in side and outside the square root is of fundamental importance. Since -1 ≤ r ≤ 1 , the quantity inside the square root , viz, ± ( 2c – n) must be positive otherwise r will be imaginary which is not possible.

n

Thus, if (2c – n) is positive , e take positive sign in and outside the square root in and if ( 2c – n) is negative , we take negative sign in and outside the square root. Remarks 1: it should be clearly noted that here n is not the number of pairs of observation but it is the number of pairs of deviation and as such it is one less than the number of pairs of observation. Remarks 2: r computed by formula is also known as coefficient of concurrent deviations. Remarks 3: coefficient of concurrent deviations is primarily based o the following principle: “If the short time fluctuations of the time series are positively correlated or in other words, if their deviation is concurrent, their curves would move in the same direction and would indicate positive correlation between them”.

Year

Supply

x

1993

160

1994

164

+

1995

172

1996 1997

Price

y

xy

280

-

-

+

260

-

-

182

+

234

-

-

166

-

266

+

-

292

r = ± √ ± [(2c-n)/n] 1998

170

+

254

-

-

1999

178

+

230

-

-

2000

192

+

190

-

-

2001

186

-

200

+

-

r = ± √ ± [(0-8)/8] 56

r = -1

57

REGRESSION Literally the word regression means ‘return to the origin’. In statistics, the word is used in a different sense. If two variables are correlated, the unknown value of one of the variables can be estimated by using the known value of the other variable. The so estimated value may not be equal to the actually observed value, but it will be close to the actual value. Regression Analysis, in general sense, means the estimation or prediction of the unknown value of one variable from the known value of the other variable. The Regression Analysis confined to the study of only two variables at a time is termed as Simple Regression. But quite often the values of a particular phenomenon may be affected by multiplicity of causes. The Regression analysis for studying more than two variables at a time is known as Multiple Regression. In Regression Analysis there are two types of variables. The variable whose value is influenced or is to be predicted is called dependent variable. The variable which influences the values or used for prediction is called independent variable. The Regression Analysis independent variable is known as regressor or predictor or explanator while the dependent variable is also known as regressed or explained variable. LINEAR & NON-LINEAR REGRESSION If the given bivariate data are plotted on a graph, the points so obtained on the diagram will more or less concentrate around a curve, called the “Curve of Regression”. The mathematical equation of the Regression curve, is called the Regression Equation. If the regression curve is a straight line, we say that there is linear regression between the variables under study. If the curve of regression is not a straight line, the regression is termed as curved or non-linear regression. The property of the tendency of the actual value to lie close to the estimated value is called regression. In a wider usage regression is the theory of estimation of unknown value of a variable with the help of known values of the variables. The regression theory was first introduced and developed by Sir Francis Galton in the field of Genetics. 58

Here, firstly, a mathematical relation between the two variables is framed. This relation which is called regression equation is obtained by the method of least squares. It may be linear or non – linear. For a bivariate data on x and y, the regression equation obtained with the assumption that x is dependent on y is called regression of x on y. The regression of x on y is: (x – AM of x ) = bxy (y – AM of y) The regression equation obtained with the assumption that y is dependent on x is called regression of y on x. the regression of y on x is – (y – AM of y) = byx (x – AM of x) The following set of formulas explains all the terms given below:

bxy =

bxy=

r. бx

bxy =

бy

nΣxy - Σx.Σy 2

nΣy -(Σy)

2

Cov (x,y) бy2 Σdx.dy

bxy =

Σdy2

byx =

byx=

r. бy бx

byx =

nΣxy - Σx.Σy 2

nΣx -(Σx)

2

Cov (x,y) бx2 Σdx.dy

byx =

Σdx2

The regression of x on y is used for the estimation of x values and the regression of y on x is used for the estimation of y values. The graph of the regression equations are the regression lines.

PROPERTIES OF REGRESSION Regression coefficient are the coefficients of the independent variables in the regression equations. 1. The regression coefficient bxy is the change occurring in x for unit change in y. The regression coefficient byx is the change occurring in y for unit change in x.

59

2. The regression coefficient is independent of the origin of measurements of the variables. But, they are dependent on the scale. 3. The geometric mean of regression coefficients is equal to the coefficient of correlation (numerically). 4. The regression coefficients cannot be of opposite signs. If r is positive, both the regression coefficients will be positive. If r is negative, both the regression coefficients will be negative. If r is zero, both the regression coefficients will be zero. 5. Since coefficient of correlation, numerically cannot be greater than 1, the product of regression coefficients cannot be greater than 1.

PROPERTIES OF REGRESSION LINES There are two regression lines. 1. The regression lines intersect at ( x,y) 2. The regression lines have positive slope if the variables are positively correlated. They have negative slope if the variables are negatively correlated. 3. If there is perfect correlation, the regression lines coincide ( there will be only one regression line) LINES OF REGRESSION Line of regression is the lines which gives the best estimate of one variable for any given value of the other variable. In case of two variable say x & y, we shall have two regression equations; x on y and the other is y on x. Line of regression of y on x is the line which gives the best estimate for the value of y for any specified value of x. Line of regression of x on y is the line which gives the best estimate for the value of x for any specified value of y.

60

LINES OF REGRESSION OF y on x

(y - AM of y) = (x – AM of x) r. бy бx LINES OF REGRESSION OF x on y

(x – AM of x) = (y - AM of y) r. бx бy REMEMBER a. When r=0 i.e., when x & y are uncorrelated, then the lines of regression of y on x, and x on y are given as: y – y = 0 and x – x = 0. The lines are perpendicular to each other. b. When r=+1 then the two lines coincide. c. If the value of r is significant, we can use the lines of regression for estimation and prediction. d. If r is not significant, then the linear model is not a good fit and hence the line of regression should not be used for prediction. COEFFICIENTS OF REGRESSION a. bxy is the Coefficient of regression of x on y. b. byx is the Coefficient of regression of y on x. THEOREMS ON REGRESSION COEFFICIENTS a. The correlation coefficient is the Geometric Mean between the Regression Coefficients i.e., r2= bxy byx b. The sign to be taken before the square root is same as that of regression coefficients. c. If one of the regression coefficient is greater than one, then the other must be less than one. d. The AM of the modulus value of regression coefficients is greater than the GM of the modulus value of the Correlation Coefficient. 61

e. Regression coefficients are independent of change of origin but not of scale. Problem 1: X

Y

dx=X-X dy=Y-Y dx2 dy2 dxdy

91

71

1

1

1

1

1

97

75

7

5

49

25

35

105 69

18

-1

324

1

-18

121 97

31

27

961

729

837

67

-23

0

529

0

0

124 91

34

21

1156 441

714

51

39

-39

-31

1521 961

1209

73

61

-17

-9

289

81

153

111 80

21

10

441

100

210

57

-33

-23

1089 529

759

0

6360 2868 3900

70

47

900 700 0

Σdx.dy

bxy =

bxy =

Σdy2

Σdx.dy

byx =

3900 2868

1.361

byx =

Σdx2

3900

0.6132

(y-y) = byx (x-x)

(x-x) = bxy (y-y) (x-90) = 1.361(y-70)

(y-70) = 0.6132 (x-90)

x=1.361y - 5.27

y=0.6132x + 14.812

Problem 2: The data about the sales & advertisement expenditure of a firm is given below: Sales

Advertisement Expenditure

Means

40

6

Standard Deviations

10

1.5

6360

Coefficient of Correlation is 0.9

62

o Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores. o What should be the advertisement expenditure if the firm proposes a sales target of 60 crores of rupees? Answer:

(x-x) = bxy (y-y)

bxy =

r. бx

(y-y) = byx (x-x)

byx =

бy

r. бy бx

(x-40) = (0.9*10/1.5) (y-6)

(y-6) = (0.9*1.50/10) (x-40)

x = 6y+4

y = 0.135x+0.6

x = 6*10+4

y = 0.135*60+0.6

x = 64

y =8.7

Problem 3: Point out the consistency, if any, in the following statement: “The Regression Equation of y on x is 2y+3x=4 and the correlation coefficient between x & y is 0.8” Answer: Refer properties.

Problem 4:

63

By using the following data, find out the two lines of regression and from them compute the Karl-Pearson’s coefficient of correlation: ΣX=250; ΣY=300; ΣXY=7900; ΣX2=6500; ΣY2=10000; n=10 Answer:

nΣxy - Σx.Σy

bxy =

2

nΣy -(Σy)

byx =

2

10*7900 – 250*300

bxy =

byx =

10*10000 -(300)2

0.4

rxy

2

nΣxy - Σx.Σy nΣx2 -(Σx)2

10*7900 – 250*300 10*6500 -(250)2

1.6

rxy

2

= bxy* bxy

rxy = 0.8

= 1.6* 0.4

Problem 5: Find the two regression coefficients and hence the r. n=5; X=10; Y=20; Σ(X-4)2=100; Σ(Y-10)2=160; Σ(X-4)(Y-10)=80 Answer: U=X-4; U=X-4=6; ΣU= nU = 30. Similarly ΣV=50

byx= byx=

nΣUV - ΣU.ΣV nΣU2 -(ΣU)2

5*80 – 30*50 5*100 -(30)2

= (11 4)

byx= byx=

nΣUV - ΣU.ΣV nΣV2 -(ΣV)2

5*80 – 30*50 5*160 -(50)2

= (11 17)

64

r = √(11/4)(11/17) = 1.33 ( it is impossible)

Time Series Generally, planning of economic and business activities is based on predictions of production, demand, sales etc. The future can be predicted by a detailed study of the past variations. Thus, future demand can be predicted by studying the variations in the demand for last few years. A time series may be defined as a collection of readings belonging to different time periods, of some economic variable or composite of variables. A series of observations of a phenomenon recorded at successive points of time is called Time Series. It is a chronological arrangement of statistical data regarding the phenomenon. Generally, time series are those of production, demand, sales, price, imports, exports, bank rate, value of money, etc. Usually in time series equidistant points of time are considered. There may be weekly, monthly, yearly, etc recordings. A graphical presentation of a time series is called Historigram. COMPONENTS OF A TIME SERIES In a time series, the observations vary with time. The variation occurring in any period is the result of many factors. The effects of these factors may be summed up as four components. They are – a. Trend. ( Secular trend, Long Term Movement) b. Seasonal Variation. Cyclical variation ( Business Cycle) c. Irregular variation ( Random Fluctuation, Erratic Variation) d. Cyclical Variation An analytical Study of different components of a time series, the effects of these components, etc is called analysis of time series. The utility of such analysis is –

65

a. Understanding the past behaviour of the variable b. Knowing the existing nature of variation c. Predicting the future trend d. Comparison with other similar variables. Trend (Secular Trend) Trend is the overall change taking place in the time series over a long period of time. It is the change taking place in a period of many years. Most of the time series show a general tendency to increase, decrease or to remain constant over a long period of time. Such an overall change occurring is the trend. Examples a. Steady increase in the population of India in the past many years is an upward trend. b. Steady increase in the price of gold in last many years is an upward trend. c. Due to availability of greater medical facilities, death rate is decreasing. Thus, death rate shows a downward trend. d. Atmospheric temperature at a place, though show short time variation, does not show significant upward or downward trend. The root cause of trend is technological advancement, growth of population change in tastes etc. Trend is measured, mainly by the method of moving averages and by the method of least squares. Seasonal Variation The regular and periodic variation in a time series is called seasonal variation. Generally, the period of seasonal variation would generally, the period of seasonal variation would be within one year. The factors causing seasonal variation are (1) weather condition, (2) customs, tradition and habits of people. Seasonal variation is predictable. Examples

66

a. An increase in the sales of woolen cloths during winter. b. An increase in the sales of note – books during the month of June, July and August. c. An increase in atmospheric temperature during summer. Cyclical Variation (Business Cycle) Cyclical Variation is an oscillatory variation which occurs in four stages viz – prosperity, recession, depression and recovery. Generally, such variation occurs in economic and business activities. They occur in a gap of more than one year. One cycle consisting of four stages occurs in a period of few years. The period is not definite. Generally, the period is 5 to 10 years. Many Economists have explained the causes of cyclical variation. Each of them is significant. Irregular variation (Random Fluctuation) Apart from the regular variations, most of the time series show variations which are totally unexpected. Irregular variations occur as a result of unexpected happenings such as wars, famines, strikes, floods etc. they are unpredictable. Generally, the effect of such variation lasts for a short period. Examples a. An increase in the price of vegetables due to a strike by the railway employees. b. A decrease in the number of passengers in the city buses, occurring as a result of strike by public sector employees. c. An increase in the number of deaths due to earthquakes. Measurement Of Trend o Graphic (or Free-hand Curve fitting) Method o Method of Semi-Averages o Method of Curve Fitting by the Principle of Least Squares o Method of Moving Averages 67

METHOD OF SEMI-AVERAGES Problem 1: Estimate value for 2000. If the actual sales figures are 35000 units, how do you account for the difference between the figures obtained? Years

1993

1994

1995

1996

1997

1998

Sales

20

24

22

30

28

32

Answer:

Year

Sales (‘000s)

1993

20

1994

24

1995

22

1996

30

1997

28

1998

32

3 yearly Semi Avg

66

90

Semiaverage

22

30

Year

Trend Values (‘000s)

1993

22 – 2.667

19.333

1994

22

22

1995

22 + 2.667

24.667

1996

30 - 2.667

27.333

1997

30

30

1998

30 + 2.667

32.667

1999

32.667 + 2.667

35.334

2000

35.334 + 2.667

38

(30-22) = 8 8/3 = 2.667 The difference is because of the assumption that there is a linear relationship between the given time series values. Moreover, the effects of seasonal, cyclical and irregular variations have been completely neglected.

68

Problem 2: From the following series find the Trend by Semi Average method. Estimate the value for the year 1999. Year Value

90

91

92

93

94

95

96

97

98

170

231 261 267 278 302 299 298 340

Answer:

Year

Values 4 yearly

Semi-

Semi-Totals Average 1990

170

1991

231

1992

261

1993

267

1994

278

1995

302

1996

299

1997

298

1998

340

929

232

1239

310

(310 – 232) = 78 78 / 5. Estimate of the year 1999: 310+(5/2)*(78/5) = 349

69

METHOD OF CURVE FITTING: PRINCIPLE OF LEAST SQUARES Fitting of Linear Trend: y = a + bx To find a & b: (i) ∑y = na + b∑x; (ii) ∑xy = a ∑x + b ∑x2 Fitting of a Second Degree (Parabolic) Trend: y = a + bx + cx2 To find a, b & c: (i) ∑y = na + b∑x + c∑x2 (ii) ∑xy = a∑x + b∑x2 + c∑x3 (iii) ∑x2y = a ∑x2 + b∑x3 + c∑x4 Problem 3: Fit a linear trend from the following data. Estimate the production for the year 1999. Verify ∑(y-ye) = 0 where ye is the corresponding trend value of y. Year

1990

1992

1994

1996

1998

Production

18

21

23

27

16

Answer: Let us consider the year 1994 to be the mid point (It would be nice to take this as the mid point as there are odd number of years). Year

Production

x

x2

xy

Trend Values

(y-ye)

1990

18

-4

16

-72

20.6

-2.6

1992

21

-2

4

-42

20.8

0.2

1994

23

0

0

0

21

2

1996

27

2

4

54

21.2

5.8

1998

16

4

16

64

21.4

-5.4

40

4

105

0

70

Fitting of Linear Trend: y = a + b x To find a & b: ∑y = n a + b∑ x

105 = a*5 + b*0

∑xy = a ∑x + b ∑x2

a = 21

4 = a*0 + b*40

b = 0.1

Therefore the equation will be given by: y = 21 + 0.1x Estimated production of 1999: y = 21 + 0.1*5

y=21.5 thousands of units.

Problem 4: Calculate the quarterly trend values by the method of least squares for the following quarterly data for the last 5 years given below: Year 1994 1995 1996 1997 1998

I Quarter 60 68 80 108 160

II Quarter 80 104 116 152 184

III Quarter 72 100 108 136 172

IV Quarter 68 88 96 124 164

Answer: Year

Total

Average

U

U2

Uy

Trend Values

1994 1995

280 360

70 90

-2 -1

4 1

140 -90

64 88

1996

400

100

0

0

0

112

71

1997

520

130

1

1

130

136

1998

680

170 560

2 0

4 10

340 240

160

Fitting of Linear Trend: y = a + b U To find a & b: ∑y = n a + b∑ U

560 = a*5 + b*0 a = 112 ∑Uy = a ∑U + b ∑U2 240 = a*0 + b*10 b = 24

Therefore the equation will be given by: y = 112 + 24x Therefore the quarterly increment is : (24/4)=6 By the calculations we come to know that the quarterly increment is 6. Therefore the values for second & third Quarters of 1994 are: 64 - (6/2) & 64 + (6/2) respectively.

Year

I Quarter

II Quarter

III Quarter

IV Quarter

1994

55

61

67

73

1995

79

85

91

97

1996

103

109

115

121

1997

127

133

139

145

1998

151

157

163

169

72

73

Problem 1: Fit an equation of the form y = a + b x + c x2 to the data given below. X

1

2

3

4

5

Y

25

28

33

39

46

Answer:

X

Y

x

x2

x3

x4

xY

Yx2

Trend Values

1

25

-2

4

-8

16

-50

100

24.88

2

28

-1

1

-1

1

-28

28

28.26

3

33

0

0

0

0

0

0

32.92

4

39

1

1

1

1

39

39

38.86

5

46

2

4

8

16

92

184

46.08

10

0

34

53

351

171

Fitting of a Second Degree (Parabolic) Trend: ∑y = na + b∑x + c∑x2

171 = 5a+0b+10c

…..(i)

∑xy = a∑x + b∑x2 + c∑x3

53=0a+10b+0c

…..(ii)

74

∑x2y = a ∑x2 + b∑x3 + c∑x4

351=10a+0b+34c

…..(iii)

By (ii) b = 5.3; Solving (i) and (iii) [Multiply (i) by 2 and deduct that from (iii)] we get c = o.64 (14c = 9) and a = 32.92 (171-10*0.64=5a) Therefore the equation is: y = 32.92 + 5.3 x + 0.64 x2 Problem 2: Fit an equation of the form y = A. Bx to the data given below x

1

2

3

4

5

y

1.6

4.5

13.8

40.2

125

Answer:

x

y

Y= log y

Yx

x2

Trend Values

1

1.6

0.2041

0.2041

1

1.6

2

4.5

0.6532

1.3064

4

4.6

3

13.8

1.1399

3.4197

9

13.8

4 5 15

40.2 125

1.6042 2.0969 5.6983

6.4168 10.4845 21.8315

16 25

41.1 122.3

Fitting of a Exponential Curve: y = A. Bx

…..(i)

Taking Logarithm we get: log y = log A+ x log B Y = a + bx

…..(ii); Y = log y; a = log A; b = log B

…..(iii)

Equation (ii) can be written as: ∑Y = na + b∑x

5.6983 = 5a + 15b

…..(iv)

∑xY = a∑x + b∑x2

21.8315 = 15a+55b

…..(v)

75

By solving (iv) & (v) we get b = 0.4737 & a = -0.2814 Take Antilog we get A = 0.5231; B = 2.977; Therefore the trend equation is: y = 0.5231*(2.977)x METHOD OF MOVING AVERAGES This is the simple and flexible method of measuring trend. Moving Average is an averaging process that smoothens out the fluctuations and ups & downs in the given data. The Moving Average of period ‘m’ is a series of successive averages of m overlapping values at a time, starting with 1st, 2nd, 3rd value and so on. Problem 3: Calculate 5 yearly Moving Average from the data given below: 10; 14; 18; 22; 26; 30; 34; 38; 42; 46 Answer: Year

Values

5 yearly Moving Total

Average

1

10

2

14

3

18

90

18

4

22

110

22

5

26

130

26

6

30

150

30

7

34

170

34

8

38

190

38

9

42

210

42

10

46

11

50

76

Problem 4: Calculate 4 yearly Moving Average from the following data: 37.4; 31.1; 38.7; 39.5; 47.9; 42.6 Answer: Year

Production

4 yearly Moving

2

Total

Moving Total Moving

Average

PeriodCentered Average

1991

37.4

1992

31.1 146.7

1993

38.7 157.2

1994

36.675

1995

47.9

1996

42.6

37.99

81.475

40.74

39.300

39.5 168.7

75.975

42.175

SEASONAL VARIATIONS The variations due to such forces which operate in a regular periodic manner with period less than one year. The objectives of studying this is as follows: o To isolate seasonal variations: To determine the effect of seasonal swings on the values of a given phenomenon. o To eliminate them: To determine the value of the phenomenon if there were no seasonal ups & downs.

77

Methods: o Method of “Simple Averages” o “Ratio to Trend” Method o “Ratio to Moving Averages” Method o “Link Relative” Method SIMPLE AVERAGES This is the simplest method of measuring the seasonal variations in a time series and involves the following steps: o

Arrange the data by years & months

o

Compute the average for the months

o

Compute the overall average

o

Obtain seasonal Indices for different months

Problem 5: Compute the seasonal index from the data given: Quarter

1990

1991

1992

1993

1994

1995

I

3.5

3.5

3.5

4.0

4.1

4.2

II

3.9

4.1

3.9

4.6

4.4

4.6

III

3.4

3.7

3.7

3.8

4.2

4.3

IV

3.6

4.8

4.0

4.5

4.5

4.7

Answer:

78

Year

I Qtr.

II Qtr.

III Qtr.

IV Qtr.

1990

3.5

3.9

3.4

3.6

1991

3.5

4.1

3.7

4.8

1992

3.5

3.9

3.7

4.0

1993

4.0

4.6

3.8

4.5

1994

4.1

4.4

4.2

4.5

1995

4.2

4.6

4.3

4.7

TOTAL

22.8

25.5

23.1

26.1

A.M. Seasonal

3.8

4.25

3.85

4.35

93.6

104.7

94.8

107.1

Index

X = 4.06 {(3.8+4.25+3.85+4.35)/4} o

{(3.8/4.06)*100}=93.6

o

{(4.25/4.06)*100}=104.7

o

{(3.85/4.06)*100}=94.8

o

{(4.35/4.06)*100}=107.1

RATIO TO TREND This is a method which is an improvement over the previous method. This is on the assumption that seasonal fluctuations for any season are a constant factor of the trend. This involves the following steps: o

Compute the trend values by the appropriate method

o

Assuming multiplicative model, trend is eliminated.

o

Arrange values according to the years, months or quarters

o

These seasonal indices are adjusted to the total of 1200 for monthly data or 400 for quarterly data.

Problem 6:

79

Using “Ration to Trend” method, determine seasonal index. Year

I Quarter II Quarter III Quarter IV Quarter

1

68

60

61

63

2

70

58

56

60

3

68

63

68

67

4

65

56

56

62

5

60

55

55

58

Answer: Year

Total

Average

x

x2

xy

Trend values

1

252

63.0

-2

4

-126

64.3

2

244

61.0

-1

1

-61

62.85

3

266

66.5

0

0

0

61.4

4

242

60.5

1

1

60.5

59.95

5

224

56.0

2

4

112

58.5

307

0

10

-14.5

Fitting of Linear Trend: y = a + b x To find a & b: ∑y = n a + b∑ x

307 = a*5 + b*0

∑xy = a ∑x + b ∑x2

a = 61.4

-14.5 = a*0 + b*10

b = -1.45

Therefore the equation will be given by: y = 61.4 -1.45x Quarterly values will be: increment of (-1.45/2 = -0.36) Between II & III quarter: - 0.36/2 = -0.18

Year

Trend Values

Trend Eliminated Values

I Quarter II Quarter III Quarter IV Quarter I Quarter II Quarter III Quarter IV Quarter

1

64.84

64.48

64.12

63.76

104.9

93.05

95.13

98.81

2

63.39

63.03

62.67

62.61

110.4

92.02

89.36

96.29

3

61.94

61.58

61.22

60.86

109.8

102.3

111.1

110.1

80

4

60.50

60.14

59.78

59.42

107.4

98.10

93.68

104.3

5

59.06

58.70

58.34

57.98

101.6

93.70

87.42

100.03

Total

534.1

479.2

476.7

509.6

Average

106.8

95.84

95.33

101.9

Adjusted Seasonal Indices

106.9

95.9

95.4

101.9

Sum of the averages: 106.8 + 95.84 + 95.33 + 101.9 = 399.90 Trend Eliminated Values are: (Given Value for that Quarter / Trend Value for that Quarter)* 100 Therefore the Correction Factor is:

400/ 399.90

RATIO TO MOVING AVERAGES This is a method which is an improvement over the previous method. This is a widely used measure which involves the following steps: o

Obtain 12-month (4-quarter) moving average values.

o

Express the original values as a percentage of centered moving average.

o

Arrange these according to the years/months/quarter

o

These indices should be 1200 or 400.

Problem 7: Calculate the seasonal indices. 1991

I Quarter

68

II Quarter

62

III Quarter

61

63.125

IV Quarter

63

62.250

81

1992

1993

I Quarter

65

62.375

II Quarter

58

62.750

III Quarter

66

62.875

IV Quarter

61

63.875

I Quarter

68

64.125

II Quarter

63

64.500

III Quarter

63

IV Quarter

67

Answer: Ratio to Moving Averages: (61/63.125)*100 = 96.63; (63/62.250)*100 = 101.20; ….. and so on. Trend Eliminated Values Year 1991 1992 1993 Total

I Quarter II Quarter III Quarter IV Quarter 104.21 106.04 210.25

92.43 97.67 190.1

96.63 104.97 201.6

101.20 95.50 196.7

Averages

105.13

95.05

100.80

98.35

Adjusted Seasonal Indices

105.31

95.21

100.97

98.52

399.33

LINK RELATIVES This is the value of the given phenomenon in any season expressed as a percentage of its value in the preceding season. This involves the following steps: o

Convert the original data into link relatives.

o

Average these link relatives for each month.

o

Convert Link Relatives into Chain relatives.

o

Obtain CR for the first month

o

Obtain Corrected Chain relatives.

Problem 8:

82

Wheat Prices (10 Kgs.) Year

1990

1991

1992

1993

Quarter I Qtr. (Jan- Mar)

75

86

90

100

II Qtr. (Apr – June)

60

65

72

78

III Qtr. (Jul – Sept.)

54

63

66

72

IV Qtr. (Oct. – Dec.)

59

80

85

93

Answer: Note: Link Relatives for any month = (Current Month’s Value / Previous Month’s Value) * 100 Chain Relative for any month = (Link Relative of that month * Chain Relative of the preceding month) / 100 New CR for the First Quarter: (LR of I Qtr. * CR of last Qtr.)/100 (123.303 * 89.81) / 100 =112.54 d = ¼(New CR of first Qtr. -100) = ¼(112.54 – 100) = 3.135 Adjusted CR: 78.395 – 3.135 = 75.26; 72.69 – 6.27 = 66.42; 89.81 – 9.405 = 80.41 Year 1990 1991 1992 1993

I Quarter II Quarter III Quarter IV Quarter 145.76 112.5 117.65

80 75.58 80 78

90 96.92 91.67 92.31

109.26 126.98 128.79 129.17

83

Total 375.91 Average 125.3 Chain Relative 100 Adjusted CR 100 Seasonal Indices 124.2

313.58 78.395 78.395 75.26 93.47

370.90 92.725 72.69 66.42 82.49

494.20 123.55 89.81 80.41 99.87

322.09 400

CYCLICAL VARIATIONS This is an approximate or crude method of measuring cyclical variations, which consists of estimating trend, seasonal components and then eliminating their effect from the given Time Series. RANDOM VARIATIONS These can not be estimated accurately, we can not obtain an estimate the variance of random components.

84

INDEX NUMBERS Index number is an indicator of the level of a phenomenon at a specific point of time in comparison with its level at some other specific point of time. Index numbers may be of varying price, production, growth rate, imports, exports, cost of living, etc. Generally, index numbers of various economic activities are found useful. For Economists, index numbers are of use at every stage of planning, policy making, decision making etc. and so, index numbers may very be called ‘Economic Barometers’. Just as Barometers measure atmospheric pressure, index numbers measure changes occurring in economic field. An index number is a statistical device designed to measure relative level of a group of related variables over a period of time and space. In other words it is a number which expresses the overall level of a group of related variables at a given time called ‘Current Period’ as compared to the level a some other time called ‘Base Period’. Generally, index numbers are expressed in percentage. Thus, if index number of wholesale prices of food articles in 1995 as compared to 1990 is 150, the implication is that overall level of wholesale prices of food articles I 1995 is 150% of the level in 1990. Here, 1995 is the current year and 1990 is the base year. Index number can very well be calculated for individual variables. For instance, if price of a commodity is Rs. 5 in 1992 and Rs. 8 in 1995, the index number of price for the year 1995 with respect to the base 1992 is P = (8/5)* 100 = 160. That is, the price of the commodity in 1995 is 160% of its price in 1992. Here, since only a single variable is considered, the index number is called ‘Relative’. In this particular case, it is the ‘Price Relative’. Price Relative is the price in the current year expressed as a percentage of the price in the base year. If p0 and p1 are the prices of a commodity in the base year and the current year respectively, the price relative is P = (p1/p0)* 100. This is an indicator which reflect the relative changes in the level of certain phenomenon in any given period (or over a specified period of time) called the current period with respect to its values in some fixed period, called base period selected for comparison

85

DEFINITION “Index Numbers are statistical devices designed to measure the relative change in the level of a phenomenon (variable or group of variables) with respect to time, geographical location or other characteristics such as income, profession etc.” Generally index numbers are of three types. 1. Price index number 2. Quantity index number 3. Value index number Various price index numbers which are in use are wholesale price index number, consumer price index number, etc. The price index number may be of different groups of commodities – food articles, laboratory equipments etc. Price Index Numbers indicate the general level of prices

of articles in the current period as compared to that of the

base period. Quantity Index Numbers are index numbers of quantity of goods imported or exported, quantity of agricultural produce etc. Value Index Numbers are the index numbers of the total money value of transaction taking place. Note 1: price index is 125 means price level in the current year is 125% of price level in the base year. Note 2: Average price level in 1990 is double the average price level in 1980 means index numbers of price for 1990 with base 1980 is 200. Note 3: index number for 1995 with base 1970 is 325 means average price level has increased by 225% from 1970 to 1995. PROBLEMS IN CONSTRUCTION 86

o

The Purpose of Index Numbers

o

Selection of Commodities or Items

o

Data for Index Numbers

o

Selection of Base Period

o

Type of Average to be used

o

System of Weighting

o

Choice of formula

IMPORTANT NOTATIONS o

p0: Price of the Commodity in the Base Period

o

p1: Price of the Commodity in the Current Period

o

q0: Quantity of a Commodity consumed or purchased during the Base Period

o

q1: Quantity of a Commodity consumed or purchased in the Current Period

o

w: Weight assigned to a commodity according to its relative importance in the group.

o

I: Simple Index Number or Price Relative obtained on expressing current year price as a percentage of the base year price and is given by: I = Price Relative = (p1/p0)*100

o

P01: Price Index Number for the Current Year w.r.t. the Base Year

o

P10: Price Index Number for the Base Year w.r.t. the Current Year

o

Q01: Quantity Index Number for the Current Year w.r.t. the Base Year

o

Q10: Quantity Index Number for the Base Year w.r.t. the Current Year

o

V01: Value Index Number for the Current Year w.r.t. the Base Year

o

p0j: Price for the jth commodity in the Base Year, j = 1,2,3 … n.

o

p1j: Price for the jth commodity in the Current Year

USES OF INDEX NUMBER 1. Index numbers are useful to governments in formulating policies regarding economic activities such as taxation, imports and exports, grant of license to new firms, bank rate. 2. Index number are useful in comparing variation in production , price etc.

87

3. Index numbers help industrialist and businessman in planning their activities such as production of goods, their stock etc. 4. Consumer price index number is used for the

fixation of salary and grant of

allowance to employees. 5. Consumer price index numbers are used for the evaluation of purchasing power of money. LIMITATIONS OF INDEX NUMBERS 1. While constructing index numbers, some representative items alone are made use of. The index number so obtained may not indicate the changes in the concerned fields accurately. 2. As customs and habits change from time to time the use of commodities also varies. And so, it is not possible to assign proper \weights to various items. 3. Many formulae are used for the construction of index numbers. These formulae give different values for the index. 4. There is ample scope for bias in the construction of index numbers. By altering the price quotation or by improper selection of items, index numbers can be manipulated. STEPS IN THE CONSTRUCTION OF INDEX NUMBERS The various steps in the construction of index numbers are – o Defining (Stating) the purpose of the index number. o Selecting the base period o Selecting the items o Obtaining price quotations o Selecting the appropriate systems of weights. o Selecting the appropriate formula. 1. Defining (Stating) the purpose of the index number.

88

At the very outset, the purpose of the index number should be decided. As different index numbers are useful for different purposes, the purpose on hand may need a particular index number. A clear definition of purpose will help in the selection of the right index number. While constructing the index number, the selection of items, base periods, weights, etc, depend mainly on the purpose. Absence of clear definition of purpose often leads to construction of an unsuitable index number. 2. Selecting the base period. While constructing an index number, appropriate base period should be selected. The base period should be selected. The base period should be economically stable. There should not be abnormal variations. The period should be free of wars, floods, famines, etc. it should not be too distant from the current period. Again, the consumption pattern during the two periods should not differ much. Depending on the situation, fixed base index number or chain base index number may be preferred. 3. Selecting the items. Selection of items is mainly based on the purpose of the index number. Items differ with the purpose. For example, a wholesale price index number requires items which are transacted at the wholesale market. A consumer price index number requires items which are consumed by the particular group of people. However, in a consumer price index number, items differ with the habits, customs and standard of living. Generally, there are many items that could be included in the index number. But the list can be reduced by selecting representative items only. 4. Obtaining price quotations. After selecting the items for constructing an index number, price quotations for these items should be obtained. Since price is likely to vary from place to place, it is better to obtain price quotations from different places. Also, it is advisable to obtain price quotations from different agencies. Then, the prices should be averaged. Again prices are likely to vary during the span of the base period and also during the span of the current

89

period. Hence, it is better to collect price quotations at regular intervals. These quotations should be averaged and the average should be used in the construction. 5. Selecting the appropriate systems of weights. The items considered in constructing index numbers often have varied importance; weights are attached to the items. Mostly, these weights are quantities in the base period, those in the current period or these in any other period. Sometimes, a combination of quantities in different periods may be considered as weights. 6. Selecting the appropriate formula The selection of formula is based mainly on the availability of data regarding quantities, Laspeyre’s, Paasche’s, fisher’s or any other index number is calculated. While selecting the formula care should be taken to see that maximum use of available data is made. PRICE INDEX NUMBER The various price index numbers in common use are – o Laspeyre’s index number o Paasche’s index number o Marshall – Edgeworth index number o Fischer’s ideal index number. QUANTITY INDEX NUMBERS Generally, quantity index numbers are calculated by adopting price as weights. Some of the quantity index numbers are o Laspeyre’s Quantity index number o Paasche’s Quantity index number o Marshall – Edgeworth Quantity index number o Fischer’s ideal Quantity index number.

90

Tests for an Index Number A good index number should satisfy the following tests. 1. Time reversal test 2. Factor reversal test. Time reversal test. This test is proposed by Irving Fisher. According to him, an index number (formula) should be such that when the base year and current year are interchanged (reversed) the resulting index number should be the reciprocal of the earlier. The time reversal test requires that the index number computed backwards should be the reciprocal of the index number computed forwards, except for the constant of proportionality. Let P01 be the index number (based on certain formula) for the period ‘1’ with respect to the base period ‘0’. Let P10 be the index number (based on the same formula) for the period ‘0’ with respect to the base period ‘1’. Then, the particular index number (formula) satisfies time reversal test if - P01 x P10 = 1 Here, P01 and P10 are mere ratios – they should not be expressed as percentages. Time reversal test is not satisfied by Laspeyre’s and Paasche’s index numbers. But it is satisfied by Marshall – Edgeworth and Fischer’s ideal index numbers. Factor Reversal Test This test also proposed by Irving Fisher. Here, the argument is that the index number (formula) should be such that the price index and quantity index computed according to the formula should both be quality effective in indicating changes. Factor reversal test requires that the product of the index number of price (with quantities as weights) and the index number of quantity (with prices as weights) should indicate net change in value taking place in between the two periods. 91

Thus if, P01 and Q01 are mere ratios – they should not be expressed as percentages. Fisher’s index number satisfies factor reversal test. But, Laspeyre’s, Paasche’s and Marshall – Edgeworth index numbers do not satisfy this test. BIAS IN AN INDEX NUMBER Generally, if price of a commodity shows significantly high increase, its use will decrease. The consumers lessen the use of such commodities. Thus, if base year quantities are used as weights, the greater variation of price will get greater weightage than needed. Therefore such an index number will be an overestimate of the actual situation. Thus, Laspeyre’s index number which uses the base year quantities as weights, is generally an over estimate. It shows upward bias. On the other hand, if current year quantities are used as weights, the greater variations will be paid lesser importance than needed. This leads to a downward bias. Thus, Paasche’s index number, which uses current year quantities as weights, is generally an under estimate. It shows downward bias. However, fisher’s and Marshall – Edgeworth index numbers make use of base as well as current year quantities and so, they are free of bias. FISHER’S INDEX NUMBER IS ‘IDEAL’. Fisher’s index number is called ‘Ideal Index Number’ because of the following reasons. o It is a geometric mean which is considered as the appropriate average for averaging ratios. o

It takes into account the base year quantities as well as the current year quantities.

o It is free of bias. o It satisfies both time reversal and factor reversal test. CONSUMER PRICE INDEX NUMBER Consumer Price Index Number is an index number of the cost met by a specified class of consumers in buying a ‘Basket of goods and services’. Here, Basket of goods and

92

services’ means goods and services needed in day to day life of the specified class of consumers. The pattern of consumption of goods is different in different classes. And so, the general index numbers fail to indicate the changes in costs with regard to various classes of consumers. Here, ‘Class of consumers’ means group of consumers having almost identical pattern of consumption. Generally, the classes are those of workers of a factory, people belonging to a particular community, government employees, etc. USES OF CONSUMER PRICE INDEX NUMBERS 1. Consumer Price Index Numbers indicate the changes in the consumer prices. And so, they help governments in formulating policies regarding control of price, taxation, imports and exports of commodities, etc. 2. They are used in granting allowances and other facilities to employees. 3. They are used for the evaluation of purchasing power of money. They are used for deflating money. 4. They are used for comparing changes in the coat of living of different classes of people. STEPS IN THE CONSTRUCTION OF CONSUMER PRICE INDEX NUMBER The steps in the construction of a consumer price index number are – 1. Defining Scope and Coverage At the very outset, it is necessary to decide the class of consumers for which the index number is required. The class may be that of bank employees, government employees, merchants, farmers etc. In any case the geographical coverage should also be decided. That is, the locally, city or town where the class dwells should be mentioned. Anyhow the consumers in the class should have almost the same pattern of consumption. 2. Conducting family budget enquiry and selecting the weights. Having decided about the scope and coverage, the next step is to conduct a sample survey of consumer families regarding their budget on various items. The survey should cover a reasonably good number of representative families. It should be conducted during

a 93

period of economic stability. In the survey, information regarding commodities consumed by the families, their quality, and the respective budget are collected. The items included in the index number are classified generally under the heads (1) Food, (2) Clothing, (3) Fuel and lighting, (4) Miscellaneous. Sufficiently large number of representative items is included under each head. 3. Obtaining price quotations The quotations of retail prices of different commodities are collected from local market. The quotation is collected from different agencies and from different places. Then, they are averaged and the averages are made use of. The price quotations of the current period and that of the base periods should be collected. 4. Computing the index number. There are two methods of computation of consumer price index number. They are – a. Aggregative expenditure method. b. Family budget method. Aggregative expenditure method Here the quantities used in the base year are taken as weights. Thus, the consumer price index number by this method is: P01 = (Total expenditure in the current year / Total expenditure in the base year) x 100 Family budget method: Consumer price index number by this method is the weighted arithmetic mean of the price relatives. The weights assigned are the expenditure in a normal period. Thus, the consumer price index number is: P01 = (∑WI / ∑W) where W = P0Q0 and I = (P1/P0) METHODS (along with formulas) o Simple (Unweighted) Aggregate Method:

∑ p1 ∑ p0

*100

∑ q1 ∑ q0

*100

94

P01 =

Q01 =

o Weighted Aggregate Method:

∑ wp1 P01 =

∑ wp0

*100

o Lapeyre’s Price Index or Base Year Method:

∑p1q0

La

P01 =

∑p0q0

*100

o Paasche’s Price Index: P01Pa = o Fisher’s Price Index

∑p1q1 ∑p0q1

*100

P01F= [P01Pa *P01La]1/2 o

Marshall Edgeworth Price Index Number:

∑p1q1 + ∑p1q0

P01Ma =

∑p0q1 + ∑p0q0

*100

Problem 1: From the following compute Price Index Numbers using all four methods.

Commodities

1970 Quantity

Price A B C D

20 50 40 20

1980 Quantity

Price 8 10 15 20

40 60 50 20

6 5 15 25

Answer:

95

1970

1980

Commodities

p0q0

p0q1

p1q0

p1q1

p0

q0

p1

q1

A

20

8

40

6

160

120

320

240

B

50

10

60

5

500

250

600

300

C

40

15

50

15

600

600

750

750

D

20

20

20

25

400

500

400

500

1660

1470

2070

1790

Answer: Laspeyre’s Index Number:

∑p1q0 ∑p0q0

2070

*100

1660

*100

124.699

Paasche’s Index number:

∑p1q1 ∑p0q1

1790

*100

*100

121.77

1470

Fisher’s Ideal Index Number:

[P01F= P01Pa *P01La]1/2

[124.699*121.77 ]1/2

123.32 96

Marshall Edgeworth Index Number:

∑p1q1 + ∑p1q0

1790 + 2070

*100

∑p0q1 + ∑p0q0

*100

1470 + 1660

123.23

Problem 2: From the following construct index number of the group of four commodities by using Fishers Ideal method Commodities

Base Year

Current Year

Price

Expenditure

Price

Expenditure

A

2

40

5

75

B

4

16

8

40

C

1

10

2

24

D

5

25

10

60

Answer:

Commodities

Base Year

Current Year

q0

q1

p1q0

p0q1

75 8

20 4

15 5

100 32

30 20

24

10

12

20

12

p0

p0q0

p1

p1q1

A B

2 4

40 16

5 4

C

1

10

2

97

D

5

25

10

60

91

5

6

199

Fisher’s Ideal Price Index



[P01F= P01Pa *P01La]1/2

202 * 199 91 * 92

*100

50

30

202

92

219.12

TEST OF CONSISTENCY o

Unit Test: This test requires that the Index Number formula should be independent of the units in which the prices or the quantities of various commodities are quoted. All those formulas which were discussed earlier other than Simple Aggregate of Prices (Quantities) satisfy this test.

o

Time Reversal Test

:

P01 * P10 = 1

Other than Laspeyre’s & Paasche’s Index Numbers all others satisfy this test. o

Factor Reversal Test:

P01 * Q01 =

[∑p1q1/ ∑p0q0]

Problem 3: From the following check whether (i) Laspeyre’s (ii) Paasche’s (iii) Fishers Index Numbers satisfy the Time & factor Reversal Tests commodities

Base Year Price Quantity

Current Year Price Quantity

A B

6.5 2.8

500 124

10.8 2.9

560 148

C D

4.7 10.9

69 38

8.2 13.4

78 24

E

8.6

49

10.8

27

Answer:

98

Commodities

p0q0

p1q0

p0q1

p1q1

A

3250

5400

3640

6048

B

347.2

359.6

414.4

429.2

Laspeyre’s

C

324.3

565.8

366.6

639.6

Index Number: 101.21

D

414.2

509.2

261.6

321.6

E

421.4

529.2

232.2

291.6

4757.1

7363.8

4914.8

7730

Laspeyre’s

Price

Index

Number: 154.80

Paasche’s

Price

Quantity

Index

Number: 157.28 Paasche’s

Quantity

Index Number: 104.97 Fisher’s Ideal Price Index Number: 156.03

Fisher’s Ideal Quantity Index Number:

103.01 By trail we can find that Fisher’s Index Number satisfies both the tests. Problem 4: From the following calculate Cost Of Living Index Number Commodities

Base Year Price

Current Year Price

Weights

A

30

47

4

B

8

12

1

C

14

18

3

D

22

15

2

E

25

30

1

Answer:

99

Commodities

P

WP

A

156.67

626.67

B

150

150

C

128.57

385.71

D

68.18

136.36

E

120

120 1418.74

1418.74/11 = 128.98 1418.74/11 = 128.98

Notes by: Prof.Sudheer Pai, RNSIT, Bangalore

PROBABILITY 100

INTRODUCTION Suppose a coin is tossed. The toss may result in the occurrence of 'Head' or in the occurrence of 'Tail'. Here, the chances of head and tail are equal*. In other words, the probability of occurrence of head is ½ and the probability of occurrence of tail is ½ Thus, Probability is a numerical measure which indicates the chance of occurrence. There are three systematic approaches to the study of probability. They are 1. The classical approach 2. The empirical approach. 3. The axiomatic approach. Each of these approaches has its own merits and demerits. Chance has a part to play in almost all activities. In every such activity, there is indefiniteness. For example, 1. 2. 3.

A new-born child may be male or female. A stone aimed at a mango on a tree may hit it or it may not. A student who takes P.U.E. examination may score any mark between 0 and 100.

In the midst of such indefiniteness, predictions are made. This necessitates a systematic study of probabilistic happenings.

RANDOM EXPERIMENT (Stochastic experiment, Trial) There are two types of experiments. They are— (i) Deterministic experiment and (ii) Random experiment. A deterministic experiment, when repeated under the same conditions, results in the same outcome. It has a unique outcome. Random experiment is an experiment which may not result in the same outcome when repeated under the same conditions. It is an experiment which does not have a unique outcome. For example, 1. The experiment of 'Toss of a coin' is a random experiment. It is so because when a coin is tossed the result may be 'Head' or it may be 'Tail'. 2. The experiment of 'Drawing a card randomly from a pack of playing cards' is a random experiment. Here, the result of the draw may be any one of the 52 cards.

101

SAMPLE SPACE The set of all possible outcomes of a random experiment is the Sample space. The sample space is denoted by S. The outcomes of the random experiment (elements of the sample space) are called sample points or outcomes or cases. A sample space with finite number of outcomes is a finite sample space. A sample space with infinite number of outcomes is an infinite sample space. Ex1. While throwing a die, the sample space is S = {1, 2, 3, 4, 5, 6}. This is a finite sample space. Ex2. While tossing two coins simultaneously, the sample space is S = {HH, HT, TH, TT}. This is a finite sample space. Ex3. Consider the toss of a coin successively until a head is obtained. Let the number of tosses be noted. Here, the sample space is S= {1, 2, 3,4....}.This is an infinite sample space.

EVENT Even is a subnet of the sample space. Events are denoted by A, B, C etc. An event which does not contains any outcome is a null event (impossible event). It is denoted by Φ. An event which has only one outcome is an ELEMENTARY EVENT OR SAMPLE EVENT. An event which has more than outcome is a compound event. An event which contains all the outcomes is equal to the sample and it is called sure event or certain event. Ex.1. While throwing a die, A={2,4,6} is an events. It is the event that the throw results in an even number. Here, A is a compound event. Ex.2. While tossing two coins, A={TT} is an event. It is the event that the toss results in two tails. Here, A is a simple event. The outcomes which belong to an event are said to be favourable to that event. The event happens whenever the experiment results in a favourable outcomes . Otherwise, the event does not happen While throwing a die, the event A = {2,4,6} has three favourable outcomes, namely, 2,4 and 6. Where the throw results in 2,4 or 6, event A occurs.

COMPLEMENT OF AN EVENT

102

Let A be an event. Then, Complement of A is the event of non-occurrence of A. It is the event constituted by the outcomes which are not favourable to A. The complement of A is denoted by A′ or Ā or Ac. While throwing a die, If A = {2,4,6}, its complement is A′ = {1,3,5}. Here, A is the event that throw result in an even number. A′ is the event that throw does not result in an even number. That is, A′ is the event that throw result in an odd number.

SUB-EVENT. Let A and B be two events such that event A occurs whenever event B occurs. Then, event B is sub-event of event A. While throwing a die, let A = {2,4,6} and B = {2}. Here, B is a sub-event of event A. That is, B ⊂ A.

UNION OF EVENTS. Definition: Union of two or more events is the event of occurrence of at least one of these events. Thus, union of two events A and B is the event of occurrence of at least one of them. The union of A&B is denoted by A∪B or A+B or AorB. Ex1. While tossing two coins simultaneously, let A = {HH} and B = {TT} be two events. Then, their union is A∪B = {HH, TT}. Here, A is the event of occurrence of two heads and B is the event of occurrence of two tails. Their union A∪B is the event of occurrence of two heads or two heads or two tails. Ex2. While throwing a die, let A = {2,4,6}, B = {3,6} and C = {4,5,6} be three events. Then, their union is A∪B ∪C = {2,3,4,5,6}.

INTERSECTION OF EVENTS Intersection of two or more events is the event of simultaneous occurrence of all these events. Thus, Intersection of two events A and B is the event of occurrence of both of them. The intersection of A and B is denoted by A∩B or AB or A and B. Ex1. While tossing two coins, let A = {HH,TT} B′ = {HH,HT,TH} be two events. Then, their intersection is A∩B = {HH}. Ex2. While throwing a die, let A = {2,4,6}, B = {3,6} and C = {4,5,6} be three events. Then, their intersection is A∩B∩C = {6}.

103

EQUALLY LIKELY EVENTS (Equiprobable events) Two or more events are equally likely if they have equal chance of occurrence. That is, equally likely events are such that none of them has greater chance of occurrence than the others. Ex. 1. While tossing a fair coin, the outcomes 'Head' and 'Tail' are equally likely. Ex.2. While throwing a fair die, the events A={2,4,6}, B = {1,3, 5}&C={ 1,2, 3} are equally likely. A sample space is called an equiprobable space if the outcomes are equally likely. For instance, the sample space S = {1, 2, 3, 4, 5, 6} of throw of a fair die is equiprobable space because the six outcomes are equally likely.

MUTUALLY EXCLUSIVE EVENTS (Disjoint events) Two or more events are mutually exclusive if only one of them can occur at a time. That is, the occurrence of any of these events totally excludes the occurrence of the other events. Mutually exclusive events cannot occur together. Ex. 1. While tossing a coin, the outcomes 'Head1 and 'Tail' are mutually exclusive because when the coin is tossed once, the result cannot be Head as well as Tail. Ex.2. While throwing a die, the events A = {2, 4, 6}, B= {3,5} and C = {1} are mutually exclusive. If A is an event, A and A' are mutually exclusive. It should be noted that intersection of mutually exclusive events is a null event. EXHAUSTIVE EVENTS (Exhaustive set of events) A set of events is exhaustive if one" or the other of the events in the set occurs whenever the experiment is conducted. That is, the set of events exhausts all the outcomes of the experiment The union of exhaustive events is equal to the sample space. Ex.1. While throwing a die, the six outcomes together are exhaustive. But here, if any one of these outcomes is leftout, the remaining five outcomes are not exhaustive. Ex.2. While throwing a die, events A = {2,4, 6},B = {3, 6} and C = {1,5,6} together are exhaustive.

104

THE CLASSICAL APPROACH CLASSICAL (MATHEMATICAL, PRIORI) DEFINITION Let a random experiment have n equally likely, mutually exclusive and exhaustive outcomes. Let m of these outcomes be favourable to an event A. Then, probability of A is — P(A) =

m Number of favourable=outcomes n Total number of outcomes

Limitations of classical definition: This definition is applicable only when (i) The outcomes are equally likely, mutually exclusive and exhaustive. (ii) The number of outcomes n is finite.

RESULT 1 P(A) is a value between 0 and 1. That is, 0 < P(A) < 1. Proof: Let a random experiment have n equally likely, mutually exclusive and exhaustive outcomes. Let m of these outcomes be favourable to event A. Then,P(A) = m n Here, the least possible value of m is 0. Also, the highest possible value of m is n. And so, 0 ≤ m ≤ n. 0 m n ≤ ≤ n n n ⇒ 0 ≤ p ( A) ≤ 1 Thus, P(A) is a value between 0 and 1. RESULT 2 P(A') = 1 - P(A). That is, P(A) = 1 - P(A'). Proof: In a random experiment with n equally likely, mutually exclusive and exhaustive outcomes, if m outcomes are favourable to event A, the remaining (n-m) outcomes are favourable to the complementary event A'. Therefore,

105

Thus, P(A') = 1 - P(A). That is, P(A) = 1 - P(A'). Exercise 1: a. Find the probability of head in the toss of a fair coin. Solution: The sample space is 5 = {H,T}. There are n- 2 equally likely, mutually exclusive and exhaustive outcomes. One outcome, namely H is favourable to the event 'A : toss results in head'. Thus, m = 1.

∴ P[head] = P(A) = m/n = ½ b. Find the probability that a throw of an unbiased die results in (i) an ace (number 1) (ii) an even number (iii) a multiple of 3. Solution: The sample space is S = {1,2,3,4,5,6]. There are n = 6 equally likely, mutually exclusive and exhaustive outcomes. Let events A, Band C be— A : throw results in an ace (number 1) B : throw results in an even number C: throw results in a multiple of 3 (i) Event A has one favourable outcome. ∴P[ace] = P(A)= m/n = 1/6 (ii) Event B has 3 favourable outcomes, namely, 2, 4 and 6. ∴P [even number] = P(B) = m/n = 3/6= ½ (iii) Event C has 2 favourable outcomes, namely, 3 and 6 P [multiple of 3] = P(C)= m/n = 2/6 = 1/3 c. A bag contains 3 white, 4 red and 2 green balls. One ball is selected at random from the bag. Find the probability that the selected ball is (i) white (ii) non-white (iii) white or green. Solution: The bag totally has 9 balls. Since the ball drawn can be any one of them, there are 9 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B and C be A: selected ball is white B: selected ball is non-white C: selected ball is white or green (i) There are 3 white balls in the bag. Therefore, out of the 9 outcomes, 3 are favourable to event A.

106

∴P [white ball] = P(A) = 3/9 = 1/3

(ii) Event B is the complement of event A. Therefore, = 2/3

∴ P(non-white ball] = P(B) = 1 - P(A) = 1 – 1/3

(iii) There are 3 white and 2 green balls in the bag. Therefore, out of 9 outcomes, 5 are either white or green. ∴ P[white or green ball] = P(C) = 5/9 d. One card is drawn from a well-shuffled pack of playing cards. Find the probability that the card drawn (i) is a Heart (ii) is a King (iii) belongs to red suit (iv) is a King or a Queen (v) is a King or a Heart. Solution: A pack of playing cards has 52 cards. There are four suits, namely, Spade, Club, Heart and Diamond (Dice). In each suit, there are thirteen denominations - Ace (1), 2, 3, 10, Jack (Knave), Queen and King. A card selected at random may be any one of the 52 cards. Therefore, there are 52 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B, C, D and E be — A: selected card is a Heart B: selected card is a King C: selected card belongs to a red suit. D: selected card is a King or a Queen E: selected card is a King or a Heart (i) There are 13 Hearts in a pack. Therefore, 13 outcomes are favourable to event A. ∴ P [Heart] = P(A) =13/52 = ¼ (ii) There are 4 Kings in a pack. Therefore, 4 outcomes are favourable to event B.

∴P[King] = P(B)=4/52 =1/13 (iii)There are 13 Hearts and 13 Diamonds in a pack. Therefore, 26 outcomes are favourable to event C.

∴ P [Red card] = P(C) =26/52 = ½ (iv) There are 4 Kings and 4 Queens in a pack. Therefore, 8 outcomes are favourable to event D. ∴ P[King or Queen] = P(D) = 8/52 = 2/13 (v) There are 4 Kings and 13 Hearts in a pack. Among these, one card is Heart-King. Therefore, (4+13-1) = 16 outcomes are favourable to event E. ∴ P[King or Heart] = P(E) =16/52 = 4/13

107

e. The selection can be any one of the eight numbers. Therefore, there are 8 equally Hkely, mutually exclusive and exhaustive outcomes. Let events A and B be— Solution: A bag contains 8 tickets which are marked with the numbers 1,2,3,.. 8. Find the probability that a ticket drawn at random from the bag is marked with (i) an even number (ii) a multiple of 3. A: selected number is even. B: selected number is a multiple of 3. (i) Four of the selections, namely, 2, 4, 6 and 8 are favourable to event A. ∴ P [even number] = P(A) = 4/8 = ½ (ii)

Two of the selections, namely, 3 and 6 are favourable to event B. ∴ P[multiple of 3] = P(B) = 2/8 = ¼ .

Exercise 2: a. A fair coin is tossed twice. Find the probability that the tosses result in (i) two heads (ii) at least one head. b. Two fair dice are rolled. Find the probability that (i) both the dice show number 6 (ii) the sum of numbers obtained is 7 or 10 (iii) the sum of the numbers obtained is less than 11 (iv) the sum is divisible by 3. c.

A box has 5 white, 4 red and 3 green balls. Two balls are drawn at random from the box. Find the probability that they are (i) of the same colour (ii) of different colours.

d.

Two cards are drawn at random from a pack of cards. Find the probability that (i) both are Spades (ii) both are Kings (iii) one is Spade and the other is a Heart (v) the cards belong to the same suit (v) the cards belong to different suits.

e.

A bag has 9 tickets marked with numbers 1, 2, 3,……9. Two tickets are drawn at random from the bag. Find the probability that both the numbers drawn are (i) even (ii) odd.

Solution:

108

a. The sample space is 5 = (HH, HT, TH, TT}, There are four equally likely, mutually exclusive and exhaustive outcomes.-Let events A and B be— A : the tosses result in 2 heads B : the tosses result in at least one head. (i) One outcome, HH is favourable to event A. ∴P[two heads] = P(A) = ¼ (ii) 3 outcomes HH, HT and TH are favourable to event B. ∴ P[at least one head] = P(B) = ¾ b. The sample space is S= {(1,1), (1, 2), (1,3)…… (1,6) (2,1), (2, 2), (2,3) ....(2,6) ………………………………………….. (6,1), {6, 2), (6, 3)….(6,6)} There are 6x6 = 36 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B, C and D be — A : both the dice show number 6 B : sum of the numbers obtained is 7 or 10 C: sum of the numbers obtained is less than 11. D : sum of the numbers obtained is divisible by 3. (i) One outcome, namely, (6, 6) is favourable to event A. ∴P[6 on both the dice] = P(A) = 1/36 (ii) Nine outcomes, namely, (6,1), (5,2), (4, 3), (3,4), (2,5), (1,6), (6, 4), (5, 5) and (4,6) are favourable to event B.

∴P[sum is 7 or 10] = 9/36 = ¼ (iii) The complement of event C is— C': sum is 11 or 12. Event C' has three favourable outcomes, namely, (6,5), (5, 6) and (6, 6). P[sum is less than 11] = 1 – P[sum is 11 or 12] = 1-3/36 = 1-1/12 = 11/12 (iv) The sum is divisible by 3 if it is 3, 6, 9 or 12. Therefore, the outcomes favourable to event D are (2, 1), (1,- 2), (5,1), (4,2), (3,3), (2, 4), (1,5), (6, 3), (5, 4), (4, 5), (3, 6) and (6, 6). Thus, 12 outcomes are favourable. P[sum is divisible by 3] = 12/36 = 1/3.

109

c. The box totally has 12 balls. A random draw of two balls has 12C2 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be— A : the balls drawn are of the same colour B : the balls drawn are of different colours. (i)Events happens when the drawn balls are both white or both red or both green. Out of 12 C2 selections, 5C2 selections are both white,4C2 selections are both red and 3C2 selections are both green. Thus, 5C2+4C2 + 3C2 outcomes are favourable to event A. 5 C2 + 4 C2 + 3 C2 P[balls of same colour] = 12 C2 10 + 6 + 3 = 66 19 = 0.2879 66 (ii) Event B is the complement of event A. Therefore, P [balls of different colours] = 1 - P[same colour] = 1- P(A) = 1 - 19/66 = 47/66

d. A random draw of 2 cards from a pack of 52 cards has 52C2 equally likely, mutually exclusive and exhaustive outcomes. Let events A, B, C, D and E be— A: both the cards drawn are Spades B: both the cards drawn are Kings. C: the cards drawn are one Spade and one Heart. D: the cards belong to the same suit. E: the cards belong to different suits. (i) Since there are 13 Spades in a pack, event A has 13C2 favourable outcomes. Therefore, 13 C2 13 × 6 1 = = P[both spades]= 52 C 2 26 × 51 17 (ii) Since there are 4 Kings in a pack, event B has 4C2 favourable outcomes. Therefore, 4 C2 2×6 1 = = P[both Kings] = 52 C 2 26 × 51 221 (iii) Here, one card should be a Spade and the other should be a Heart. From 13 Spades, one Spade can be had in 13C1 ways. From 13 Hearts, one Heart can be had in 13C1 ways. Thus, 13C1 X 13C1 outcomes are favorable to event C. Therefore,

110

C1 ×13 C1 P [a Spade and a Heart] = 52 C2 13 × 13 13 = = 26 × 51 102 13

(iv) Here, the cards should be 2 Spades or 2 Clubs or 2 Hearts or 2 Diamonds. There are 13 cards of each suit. In each case, a selection of two cards can be made in 13C2 ways. Thus, totally the number of favourable cases is 13C2 + 13C2 + 13C2 + 13C2 C 2 + 13 C 2 + 13 C 2 + 13 C 2 P[cards of same suit] = 52 C2 4 × 78 4 = = 26 × 51 17 (v) Events E is the complement of event D. Therefore, P[cards of different suits] = 1 – P[cards of same suit] = 1 – 4/17 = 13/17 13

(e) There are 9C2 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be A : both the selected numbers are even. B : both the selected numbers are odd. (i) Out of 9 numbers, 4 numbers, namely, 2,4,6 and 8 are even. Therefore, 4C2 selections will have two even numbers. Therefore, 4 C2 = 6 / 36 = 1 / 6 P[both even] = P(A) = 9 C2 (ii) Out of 9 numbers, 5 numbers, namely, 1,3,5,7 and 9 are odd. Therefore, 5C2 selections will have two odd numbers. Therefore, 5 C2 = 10 / 36 = 5 / 18 P[both odd] = P(B) = 9 C2 Exercise 3: A bag contains 3 red, 4 green and 3 yellow marbles. Three marbles are randomly drawn from the bag. What is the probability that they are of (i) the same colour (ii) different colours (one of each colour)? Solution : There are 10C3 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be A: Selected marble are of the same colour. B: Selected marbles are of different colours 111

(i)The marbles drawn should be 3 red or 4 green or 3 yellow. Therefore, 3C3 + 4C3 + 3C3 outcomes are favourable to events A, Therefore, 3

P [marbles of the same colour] =

C3 + 4 C3 + 3 C3 10 C3 1+ 4 +1 1 = = 120 20

(ii) The marbles should be one of each colour. Therefore, 3C1 x3C1 x 3C1 outcomes are favourable. Therefore, 3

P [marbles of different colours] =

C1 + 4 C1 + 3 C1 10 C3 3 = 10

112

THE AXIOMATIC APPROACH Consider a random experiment with sample space S. Associated with this random experiment, many events can be defined. Let for every event A, a real number P(A) be assigned. Then, P(A) is the probability of event A, if the following axioms are satisfied. Axiom 1 : P(A) ≥ 0 Axiom 2 : P(S) - 1, S being the sure event. Axiom 3 : For two mutually exclusive events A and B, P(A ∪ B) = P{A) + P(B) Note that the third axiom can be generalised for any number of mutually exclusive events.

ADDITION THEOREM PROBABILITY Exercise 12: (i) S how that P(A) = 1 – P(A') (ii) Show that probability is a value between 0 and 1. (iii)

Show that P(Ф) = 0 where Ф is null event.

Solution: (i) If A and A' are complementary events, A ∪ A' = S. By the axiom 2, P(S) = 1. And so, P(A ∪ A') =1 .... Result 1 But A and A' are mutually exclusive events. Therefore, by the axiom 3, P(A ∪ A') = P(A) + P(A') By the results 1 and 2, P(A) + P(A') = 1 That is, P(A) = 1-(A') (ii) Let A be an event. Then, by the axiom ],

....Result 2

P(A)≥0 If A' is the complementary event of A, P(A') = 1 – P(A) But, by axiom1,,P(A') ≥0

....Result 1

Therefore, 1 - P(A) ≥ 0

....Result 2

And so, P(A)≤ By the results 1 and 2, 0 ≤ P(A) ≤ 1 That is, probability is a value between 0 and 1. (iii) If A is an event and if Φ is a null event, A ∪ Φ = A ∴ P ( A ∪ φ ) = P ( A) ….. Result 1

113

But, A and Φ are mutually exclusive. Therefore P ( A ∪ φ ) = P ( A) + P (φ )

….. Result 2

By the result 1 and 2 P(A) + P(Φ) = P(A) That is, P(Φ) = P(A) – P(A) = 0 ADDITION THEOREM PROBABILITY For two events A and B, Show that

Solution : For events A and B,

Here, A∩B and A`∩B are mutually exclusive. Therefore, by axiom 3, ---Result 1

Also,

Here, A∩B and A`∩B are mutually exclusive therefore, By result 1 and result 2 -------Result 2 Exercise: Show that (i) P(A ∪ B) ≤ P{A) + P(B) (ii) P(A ∩ B) = P(A) + P(B) - P(A ∪ B) Solution : (i) The addition theorem is— P (A ∪ B) = P(A) + P(B) - P(A ∩ B) Here, P(A ∩ B) ≥ 0. Therefore, P(A ∩ B) ≤ P(A) + P(B). (ii) The additional theorem is ---P(A ∪ B) = P(A) + P(B) - P(A ∩ B) 114

⇒ P(A ∩ B) = P(A) + P(B) - P(A ∪ B) Also, note that P(A ∪ B) + P(A ∩ B) = P(A) + P(B)

SOLVED PROBLEMS Exercise: Write down the sample space for each of the following random experiments. (i) A coin is tossed three times and the result of each throw is noted, (ii) A coin is tossed three times and the number of heads obtained is noted, (iii) A couple goes on producing children until a male child is born. The number of female children born is noted, (iv) In case (iii) above, instead of noting the 'Number of female children', the 'Number of children bom' is noted, (v) A tetrahedron (a solid with four triangular surfaces) whose sides are painted red, red, blue and green is thrown. The colour of the side which touches the ground is noted. (vi) Blood of husband and wife are tested and the blood group (whether O, A, B or AB) in each case is identified. (vii)

A person is randomly selected and his religion is noted.

Solution: (i)

S= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT)

(ii)

S= {0,1,2,3}

(iii)

S ={0,1,2,3,....}

(iv)

5= {1,2, 3, 4

(v)

S = {red, blue, green}

(vi)

S = {(O,O), (O, A), (O, B), (O, A3), (A, O), (A, A), (A, B), (A, AB), (B, O),

}

(B, A), (B, B), (B, AB), (AB, O), (AB,A), (AB,B), (AB,AB)} (vii)

S = { Hindu, Christian, Muslim, Jain, Jew, ....}

Exercise (i) Given the equiprobable sample space S = {1, 2, 3,4, 5, 6] and the event A = {1, 3, 5}, find P(A). (ii) Given the sample space S = {1, 2, 3, 4, 5, 6} and the events A = { 1 , 3, 5} and B = {2, 4, 6}. If P(A) = 1/3 find P(B). (iii) If 5 = {E1, E2) is the sample space and if P(E1) = 0.3, find P(E2). Solution: (i) Since the sample space is equiprobable, mathematical definition can be used for finding probability.

115



P(A)

Number of favourable outcomes = 3/6 = 1/2 Total number of outcomes

(ii) Here, events A and B are complementary.



P(B) = 1 – P(A) = 1 – 1/3 = 2/3

(iii)

Here, E1, E2 are complementary events.



P(E2) = 1 – P(E1) = 1 – 0.3 = 0.7 Exercise: (i)

If P(A) = 1/3, find P(A').

(ii) (iii)

If P(A) = 1/2, P(B)= ¾ and P(A ∩ B) = ¼, find P(A ∪ B). If P(A) = 1/8, P(B) = 1/6 and P(A ∪ B) = ¼, find P(A ∩ B)

(iv)

If P(A) = ½ and P(A ∩ B) = ¼ find P(B|A).

Solution: (i)

P(A') = 1-P(A) = 1 – 1/3 = 2/3 P(A ∪ B) = P(A) +P(B) – P(A ∩ B)

(ii)

= 1/2 +3/4 – 1/4 = 1 (iii)

By additional theorem ----P(A ∪ B) = P(A) +P(B) – P(A ∩ B) ⇒ P(A ∩ B) = P(A) + P(B) - P(A ∪ B)

= 1/8 + 1/6 – 1/4 = 1/24 1 (iv) P(B|A) = P ( A ∩ B ) = 4 = 1 2 1 P ( A) 2 Exercise : If P(A) = 0.8, P(B) = 0.5 and P(A ∪ B) = 0.9 find P(A|B). Are A and B independent events? Solution: By additional theorem---P(A ∪ B) = P(A) + P(B) – P(A ∩ B)



P(A ∩ B) = P(A) +P(B) - P(A ∪ B)

= 0.8 + 0.5 – 0.9 = 0.4 And so,

P(B|A) =

P ( A ∩ B ) 0.4 = = 0.8 P ( A) 0.5

Thus, P(A|B) = 0.8 Here, P(A|B) = P(A). Therefore, events A and B are independent.

116

Exercise : Three unbiased dice are thrown once. Find the probability that all the three dice show the number 6. Solution : When 3 dice are thrown, there are 6 x 6 x 6 = 216 equally, mutually exclusive and exhaustive outcomes. of these 216 outcomes, 1 outcome, namely, (6, 6, 6) is favourable. Therefore probability of all the three dice showing the number 6 is P[all the three result in the number 6] =1/216 Exercise : A fair coin is tossed five times. Find the probability of obtaining (i) head in all the tosses, (ii) head in at least one of the tosses. Solution: There are 25 = 32 equally likely, mutually exclusive and exhaustive outcomes. Out of them, one outcome is HHHHH and another outcome is TTTTT. Therefore, (i) P[head in ail tosses] = 1/32 (ii) P[at least one head] = 1 – P[tail in all tosses] = 1-1/32 = 31/32 Note : Whenever probability of the event “at least one” has to be found, it is easier to find it by using the probability of the complementary event as follows. P[at least one]= P[none]

Exercise : There are 20 persons. 5 of them are graduates. 3 persons are randomly selected from these 20 persons. Find the probability that at least one of the selected persons is graduate. Solution: From 20 persons, 3 persons can be selected in 20C3 ways. Thus, there are 20C3 equally likely, mutually exclusive and exhaustive outcomes. Since there are 15 persons who are not graduates, P[ at least one is graduate] = 1 – P[none is graduate]

= 1−

15

C3 91 = 1 − 20 228 C3

137 = 0.6 228 Exercise :

117

In a college, there are five lecturers. Among them, three are doctorates. If a committee consisting three lecturers is formed , what is the probability that at least two of them are doctorates ? Solution: From the five lecturers, three lecturers can be selected in 5C3 ways. Thus, there are 5C3 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be — A : Two of the selected lecturers are doctrates. B : All the three selected lecturers are doctrates. Then, events has 3C2 x 2C1 favourable outcomes. And, event B has C3 favourable outcomes. Here, events A and B are mutually exclusive. .'. P[at least two doctrates] = P[two or three doctrates] = P( A ∪B) = P ( A) + P ) B ) =

3

3 C C 2 ×2 C1 +5 3 5 C3 C3

3 ×2 1 + 10 10 7 = = 0.7 10 =

PROBLEMS:3 What is the probability that there will be 53 Sundays in a randomly selected (i) leap year (ii) non-leap year? Solution: (i) A leap year has 366 days, Out of them, 7*52 = 364 days make 52 complete weeks. The remaining two days may occur in any of the following pattern --(Sunday, Monday), (Monday, Tuesday), (Tuesday, Wednesday), (Wednesday, Thursday), (Thursday, Friday), (Friday, Saturday) and (Saturday, Sunday). Out of these 7 cases which are equally likely, mutually exclusive and exhaustive, 2 cases namely (Sunday, Monday) and (Saturday, Sunday) have Sunday. Therefore, P[leap year has 53 Sundays]=2/7 (ii) A non- leap year has 365 days. Out of them, 364 days make 52 complete weeks. The remaining one day may be Sunday, Monday, ---- Saturday. Out of these 7 possibilities, only one is Sunday. Therefore, P[non-leap year has 53 Sundays]=1/7

118

CONDITIONAL PROBABILITY Let A and B be two events. Then, conditional probability of £ given A is the probability of happening of B when it is known that A has already happened. On the other hand, the probability of happening of B when nothing is known about happening of A is called unconditional probability of B. The conditional probability of B given A is denoted by P{B\A). The unconditional probability is P{B). Let P(A) > 0. Then, conditional probability of event B given A is defined as-----

If P(A) = 0, the conditional probability P(B\A) is not defined. If A and B are independent events, occurrence of B will be independent of occurrence of A. Therefore, the conditional and unconditional probabilities are equal. That is, P(B\A) = P(B). P( A ∩ B) P ( B) = P ( A) That is, P(A ∩ B) = P(A).P(B)

INDEPENDENT EVENTS Two events A and B are independent if and only if P(A ∩ B) = P(A).P(B) If two events are independent, the occurrence or non-occurrence of one does not depend on the occurrence or non-occurrence of the other.

MULTIPLICATION THEOREM Let A and B be two events with respective probabilities P(A) and P(B). Let P(B/A) be the conditional probability of event B given that event A has happened. Then, the probability of simultaneous occurrence of A and B is –

119

If the events are independent, the statement reduces to -

MULTIPLICATION THEOREM Proof: By the definition of conditional probability, for P(A)>0,

If A and B are independent, by the definition of independence,

Exercise. a. A card is drawn at random from a pack of cards. (i) What is the probability that it is a heart ? (ii) If it is known that the card drawn is red, what is the probability that it is a heart? b. A fair coin is tossed thrice. What is the probability that all the three tosses result in heads ? Solution: a. There are 52 equally likely, mutually exclusive and exhaustive outcomes. Let events A and B be — A : card drawn is red. B : card drawn is heart. There are 26 red cards and 13 hearts in a pack of cards. Therefore, event A has 26 favourable outcomes and event B has 13 favourable outcomes. Event A ∩ B has 13 favourable outcomes because when any of the 13 hearts is drawn A ∩ B happens. Therefore, P(A) = 26/52. P(B) = 13/52 and P(A ∩ B)=13/52 (i)The unconditional probability of drawing a heart is --P(B) = 13/52 = ¼ (ii) The conditional probability of drawing a heart given that it is red card is-----

120

13 P( A ∩ B) 52 = 1 P(B/A) = = 2 26 P ( A) 52 B. Let events A, B, and C be-----A: the first toss results in head B: the second toss results in head. C: the third toss results in head. Then, P(A) = P(B) =P(C) = ½ Since A, B, and C are results of three different tosses, they are independent. Therefore, probability that all the three tosses result in head is --P[ 3 heads] = P(A ∩ B ∩ C) = P(A).P(B).P(C) 1 1 1 1 = × × = 2 2 2 8 Exercise: Two fair dice are rolled. If the sum of the numbers obtained is 4, find the probability that the numbers obtained on both the dice are evenSolution: Let events A and B be — A: the sum of the numbers is 4 B: the numbers on both the dice are even P( A ∩ B) Here, we have to find ----P(B/A) = P ( A) Event A has 3 favourable outcomes, namely, (1,3),(2,2) and (3,1)

P( B | A) =

P( A ∩ B) B( A)

∴P[Sum 4] = P(A) = 3/36

Event (A ∩ B) has 1 favourable outcomes, namely, (2,2). ∴P[Sum 4 and number even] = P(A ∩ B) = 1/36 Thus, P[Number even given Sum 4] 1 = 36 = 1 3 3 36 Exercise: A box has 1 red and 3 white balls. Balls are drawn one after one from the box. Find the probability that the two balls drawn would be red if a. the ball drawn first is returned to the box before the second draw is made. (Draw with replacement). b. the ball drawn first is not returned before the second draw is made. (Draw without replacement).

121

Solution: Let

A : the first ball drawn is red B : the second ball drawn is red.

Draw with replacement: Here, P(A) =1/4 Also, since the first ball- is returned before the second draw is made, P(B|A) =1/4

∴P[Two balls are red] = P(A ∩ B)

= P(A).P(B|A) = 1/4 * 1/4 =1/16 Draw without replacement: Here, Since the first ball drawn is not returned before the second draw is made, P(B|A) = 0/4 .'. P [Two balls are red} = P(A ∩ B) = P(A).P(B|A) = ¼ * 0/4 = 0 PROBLEMS:1 The probability that a contractor will get a plumbing contract is 2/3 and probability that he will not get an electrical contract is 5/9. If the probability of getting at least one of these contracts is 4/5, what is the probability that he will get both? Solutions: Let A: contractor gets plumbing contract B: contractor gets electrical contract Then, P(A) = 2/3 P(B`) = 5/9 and P(A ∪ B) = 4/5 Therefore, P(B) = 1-P(B`) = 4/9 By addition theorm we have, P(A ∪ B) = P(A) +P(B) – P(A ∩ B) That is, P(A ∩ B) = P(A) +P(B) – P(A ∪ B) Therefore, P[he gets both plumbing and electrical contract] = P(A ∩ B) = P(A) +P(B) – P(A ∪ B) = 2 / 3 + 4 / 9 − 4 / 5 = 14 / 45

PROBLEMS:3

122

A can solve 90 percent of the problems given in a book and B can solve 70 percent. What is the probability that at least one of them will solve a problem selected at random. Solutions: event A : student A solve the problem event B : student B solve the problem. P(at least one solve the problem) = 1-P(none solve the problem) = 1− P A ∩ B

(

)

= 1 − P ( A).P ( B) = 1 − (0.10)(0.30) = 0.97 PROBLEMS:4 The probability that a trainee will remain with a company 0.6, The probability that an employee earns more ten Rs.10,000 per year 0.5. The probability an employee is trainee who remained with the company or who earn more then Rs.10,000 per year is 0.7. What is the probability earn more than Rs.10,000 per year given that he is a trainee who stayed with the company Solutions: event A: A trainee will remain with the company Event B: A trainee earns more than Rs. 10,00. Given P(A) = 0.6 P(B) = 0.5 P(A ∪ B) = 0.7 We need to find P ( A ∩ B ) P ( A) + P ( B ) − P( A ∪ B ) 0.4 P ( B | A) = = = = 0.67 P( A) P ( A) 0.6 PROBLEMS:5 Suppose that one of the three men, a politician a bureaucrat and an educationist will be appointed as VC of the university. The probabilities of there appointment are respectively 0.3,0.25,and 0.45. The probability that these people will promote research activities if there are appointed is 0.4,0.7 and 0.8 respectively. What is the probability that research will be promoted by the new VC Solutions: event A: Politician appointed as VC event B: bureaucrat appointed as VC event C: Educationist appointed as VC event D: promotion of research activities ∴= P ( A ∩ P ) + P ( B ∩ D) + P (C ∩ D). = P ( D | A).P ( A) + P ( D | B ).P( B ) + P ( D | C ).P (C ) = (0.3)(0.4) + (0.25)(0.7) + (0.45)(0.8) = 0.655

123

PROBLEMS:6 A box contains 4 green and 6 white bolls another box contains 7 green and 8 white bolls. Two bolls are transferred from box 1 to box 2 and then a boll is drawn from box 2. What is the probability that it is white? event A: transferred balls are green event B: transferred balls are white event C: Among transferred balls one green & 1 white event D: selection of a white ball from box 2. ∴= P ( A ∩ D) + P ( B ∩ D ) + P (C ∩ D) = P ( D | A).P ( A) + P ( D | B ).P( B ) + P ( D | C ).P (C ) =

6 C2 8 C2 10 4 C1 ×6 C1 9 . + . + 10 × 10 C2 17 10 C2 17 C2 17 4

= 0.5412 PROBLEMS:7 Probabilities of Husband’s and wife’s selection to a post are 1/5 and 1/7 respectively, what is the probability that. •Both of them will be selected. •Exactly one of them will be selected •None of them will be selected Solutions: 1 event A: selection of Husband P(A) = 5 1 event B: selection of Husband P(B) = 7 (i) P(both of them will be selected) = P(A ∩ B) =P(A).P(B) 1 1 1 = × = 5 7 35 (ii) P(exactly one of them will bw selected) P ( A ∩ B ) + P ( B ∩ A) = = P ( A).P ( B ) + P ( B ).P ( A) 4 1 6 1 10 = × + × = . 5 7 7 5 35 P( A ∩ B) (iii) P(none of them will be selected) = = P ( A).P ( B ) 4 6 24 = × = 5 7 35

124

Random variable INTRODUCTION Suppose two fair coins are tossed. Here, the sample space is 5 = {TT, TH, HT, HH} Suppose to each of the four sample points in this sample space, a number is assigned as follows. Sample point

TT

TH

HT

HH

Number

0

1

1

2

Here, the assigned numbers indicate the number of heads obtained in each case. Let 'the number of heads' be denoted by X. Then, X is a function on the sample space. It takes the values 0,1 and 2 with probabilities — P[X=0] = P[no head] = ¼ P[X=1] = P[one head] = ½ P[X=2] = P[two head] = ¼ Here, X is called Random variable or Variate. RANDOM VARIABLE Random variable is a function which assigns a real number to every sample point in the sample space. The set of such real values is the range of the random variable. There are two types of random variable, namely, Discrete random variable and Continuous random variable. A Variable X which takes values x1,x2,….xn with probabilities p1,p2,….pn is a Discrete random variable. Here, the value x1,x2,….xn from the range of the random variable. A random variable whose range id uncountable infinite is a Continuous random variable. Ex1. Let X denote the number of heads obtained while tossing two fair coins. Then, X is a random variable which takes the values 0,1 and 2 wit respective probabilities ¼, ½ and ¼ . Here, X is a discrete random variable. Ex. 2. Let X denote the number obtained while throwing a fair die. Then, X is a discrete random variable taking values 1, 2, 3, 4, 5 and 6 with probability 1/6 each

125

Ex. 3. Let X denote the weight of apples. Then, X is a continuous random variable. Generally, random variables are denoted by X, Y, Z, etc. If X is a random variable, the values taken by X are denoted by x (small letter). PROBABILITY MASS FUNCTION Let X be a discrete random variable. And let p(x) be a function such that p(x) = P[X=x]. Then, p(x) is the probability mass function of X. Here, (i)p(x) ≥0 for all x (ii)∑p(x) = 1 A similar function is defined for a continuous random variable X. Its is called probability density function (p.d.f.). It is denoted by f(x). PROBABILITY DISTRIBUTION A systematic presentation of the values taken by a random variable and the corresponding probabilities is called probability distribution of the random variable.

Session 4 MATHEMATICAL EXPECTATION Mathematical expectation of a random variable Let X be a discrete random variable with probability mass function p(x). Then, mathematical expectation of X is --- E(X) = ∑x.p(x)

Mathematical expectation of a function h(x) of X Let X be a discrete random variable with probability mass function p(x). Then, mathematical expectation of any function h(X) of X is ---E[h(X)] = ∑h(x).p(x) Exercise 1 : Two fair coins are tossed once. Find the mathematical expectation of the number of heads obtained. Solution : Let X denote the number of heads obtained. Then, X is a random variable which takes the values 0, 1 and 2 with respective probabilities ¼ ½ and ¼ and That is, x 0 1 2 p(x)

¼

½

¼

126

The mathematical expectation of the number of head is

E ( X ) = ∑ x. p( x ) = 0 × RESULTS

1 1 1 + 1× + 2 × = 1 4 2 4

1.For a random variable X, the Arithmetic Mean is E(X). 2.For a random variable X, the Variance is Var(X) = E[X-E(X)]2 = E(X)2- [E(X)]2 The Standard Deviation is the square – root of the variance. Exercise: A bag has 3 white and 4 red balls. Two balls are randomly drawn from the bag. Find the expected number of white balls in the draw. Solution: Let X denote the number of white balls obtained in the draw. Then, X is a random variables which takes the values 0, 1 and 2 with respective probabilities – 4

C2 2 = 3 C 2 × 4 C1 4 C2 7 = 7 P(0) = P[one white & one red] = 7 C2

P(0) = P[both red] =

7

P(0) = P[both white] =

3 7

C2 1 = C2 7

The probability distribution of X is – x p(x)

0

1

2

2/7

4/7

1/7

E ( X ) = ∑ x. p( x) 2 4 1 6 + 1× + 2 × = 7 7 7 7 = 1(approximately ) = 0×

Thus, one white ball is expected in the draw

THEORETICAL PROBABILITY DISTRIBUTIONS In day to day life, we come across many random variables such as -----1.Number of male children in a family having three children. 2.Number of passengers getting into a bus at the bys stand .

127

3.I.Q. of children 4.Number of stones thrown successively at a mango on the tree until the mango in hit 5.Marks scored by a candidate in the P.U.E. examination. For a quick analysis of distributions of such random variables, we consider their theoretical equivalents. These equivalent distributions are originated according to certain theoretical assumptions and restrictions. Such theoretically designed distributions are called theoretical distributions. There are many types (families) of theoretical distributions. Some of them (i) Bernoulli distribution (ii) Binomial distribution (iii) Poisson distribution (iv) Hypergeometric distribution (v) Normal distribution. The Bernoulli distribution and the Binomial distribution were discovered by James Bernoulli during the first decade of eighteenth century. These works were published posthumously in 1713. The Poisson distribution was introduced by S.D. Poisson in 1837. The Normal distribution was introduced by De Moivre in 1753. This distribution is also called Gaussian distribution.

BERNOULLI EXPERIMENT A random variable X which assumes values 1 and 0 with respective probabilities p and q = 1-p is called Bernoulli variables The Bernoulli distribution is--x 1 0 p(x)

p

q

Note1: Bernoulli distribution has one constant, namely, p. This constant p is called parameter of the Bernoulli distribution. Different values of p(where 0
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF