Measures of Central Tendency
Short Description
Download Measures of Central Tendency...
Description
Quantitative Analysis
Chapter 3 Measures of Central Tendency
Submitted to: Prof. Mamta Bhrambhatt Date of Submission: 24/12/2009
1
Index 1. Measures of Central Tendency: Ungrouped Data. • Mode • Median • Mean • Percentile • Quartiles 2. Meaures of Variability: Ungrouped Data. • Range • Interquartile Range • Mean Absolute Deviation • Variance • Standard Deviation • Empirical Rule • Chebyshev’s Theorem • z Scores • Co-efficient of Variation 3. Measures of Central Tendency and Variability: Grouped Data • Mean • Mode 4. Measures of Shape • Skewness • Kurtosis • Box and Whisker plots 5. Measures of Association • Correlation
1. Measures of Central Tendency: Ungrouped Data. 2
We can use single numbers called “Summery Statistics’ to describe characteristics of a data set. Two these characteristics are particularly important to decision makers: 1. Central tendency 2. Dispersion Central Tendency: Central tendency is the middle point of a distribution. Measures of central tendency are also known as Measures of location. Measures of central tendency yield information about the center, or middle part, of a group of a numbers. It does not focus on the span of data set or how far values are from the middle numbers. Dispersion: Dispersion is the spread of the data in a distribution, that is, the extent to which the observations are scattered.
Objectives:
To use summary statistics to describe collection of data.
To use the mean, median and mode to describe how data “bunch up”
To use the range, variance and standard deviation to describe how data “spread out”.
MEASURES OF CENTRL TENDENCY: UNGROUPED DATA Mode: The mode is a measure of central tendency. It is the most common value in a distribution E.g. the mode of 3, 4, 4, 5, 5, 5, 8 is 5. Because 5 is occurring for the most of the time.
•
Bimodal -- Data sets that have two modes
•
Multimodal -- Data sets that contain more than two modes
When to use: Use the mode when the data is non-numeric or when asked to choose the most popular item.
3
• •
Advantages: • Extreme values (outliers) do not affect the mode. Disadvantages: • Not as popular as mean and median. • Not necessarily unique - may be more than one answer • When no values repeat in the data set, the mode is every value and is useless. • When there is more than one mode, it is difficult to interpret and/or compare.
Median The data must be ranked (sorted in ascending order) first. The median is the number in the middle. To find the depth of the median, there are several formulas that could be used, the one that we will use is: Depth of median = 0.5 * (n + 1)
•
Applicable for ordinal, interval, and ratio data
•
Not applicable for nominal data
When to use: Use the median to describe the middle of a set of data that does have an outlier. •
Advantages: • Extreme values (outliers) do not affect the median as strongly as they do the mean. • Useful when comparing sets of data. • It is unique - there is only one answer. Disadvantages: • Not as popular as mean.
Mean:The Mean is the average of a group of numbers and computed by summing all numbers and dividing by the number of numbers.
The population mean is
represented by the Greek letter µ . The sample mean is represented by x . The formulas for computing the population mean and the sample mean are given below.
•
Population mean:
4
N
µ= •
∑x i =1
i
N
=
x1 + x 2 + ... + x N N
Sample mean: n
x=
∑x i =1
i
n
=
x 1 + x 2 + ... + x n n
When to use: Use the mean to describe the middle of a set of data that does not have an outlier. •
•
Advantages: • Most popular measure in fields such as business, engineering and computer science. • It is unique - there is only one answer. • Useful when comparing sets of data. Disadvantages: • Affected by extreme values (outliers)
Percentiles: They are measures of central tendency that divide a group of data into 100 parts • At least n% of the data lie below the nth percentile, and at most (100 - n)% of the data lie above the nth percentile • Example: 90th percentile indicates that at least 90% of the data lie below it, and at most 10% of the data lie above it • The median and the 50th percentile have the same value. • Applicable for ordinal, interval, and ratio data • Not applicable for nominal data For Calculation: • Organize the data into an ascending ordered array. • Calculate the percentile location: •
i=
• • •
P ( n) 100
Determine the percentile’s location and its value. If i is a whole number, the percentile is the average of the values at the i and (i+1) positions. If i is not a whole number, the percentile is at the (i+1) position in the ordered array. 5
FOR EXAMPLE • Raw Data: 14, 12, 19, 23, 5, 13, 28, 17 • Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28 • Location of 30th percentile: i=
•
30 (8) = 2.4 100
The location index, i, is not a whole number; i+1 = 2.4+1=3.4; the whole number portion is 3; the 30th percentile is at the 3rd location of the array; the 30th percentile is 13.
Quartiles • Measures of central tendency that divide a group of data into four subgroups • Q1: 25% of the data set is below the first quartile • Q2: 50% of the data set is below the second quartile • Q3: 75% of the data set is below the third quartile • Q1 is equal to the 25th percentile • Q2 is located at 50th percentile and equals the median • Q3 is equal to the 75th percentile • Quartile values are not necessarily members of the data set E.g. • •
•
•
Ordered array: 106, 109, 114, 116, 121, 122, 125, 129 Q1 25 109 +114 i= (8) = 2 Q1 = = 111 .5 100 2 Q2: i=
50 (8) = 4 100
Q2 =
i=
75 (8) = 6 100
Q3 =
Q3:
116 +121 = 118 .5 2
122 +125 = 123 .5 2
Measures of Variability: Ungrouped Data •
Measures of variability describes the spread or the dispersion of a set of data. 6
3.1 RANGE: “The range is the different between the highest and lowest observed values. RANGE = value of highest observation – value of lowest observation Advantages of range: • It is easy to understand and to find • It is used in quality assurance, where the range is used to to construct a control charts. Disadvantages of range: • Its usefulness as a measure of dispersion is limited. • It is only consider highest and lowest value of a distribution • It is heavily affected by extreme values. • It is not used in open ended series. Example: The ungrouped data is as follows: 10, 2, 5, 6, 7, 3, 4 The Range is : 10-2 = 8
3.1
INTERQUARTILE RANGE:
Inter quartile range is the values of the first and third quartiles. The interquartile range (IQR) is the range of the middle 50% of the scores in a distribution. It is less affected by extremes. It is computed as follows: IQR = 75th percentile - 25th percentile IQR = Q3 – Q1 For E.g. if the 75th percentile is 8 and the 25th percentile is 6. The Interquartile range is therefore 2.
3.2VARIANCE :
Variance in population: 7
Variability can also be defined in terms of how close the scores in the distribution are to the middle of the distribution. Using the mean as the measure of the middle of the distribution, the variance is defined as the average squared difference of the scores from the mean. Example: 16 , 45, 32, 12, 34, 65, 46, 76
where σ2 is the variance, μ is the mean, and N is the number of numbers. Variance in Sample: If the variance in a sample is used to estimate the variance in a population, then the previous formula underestimates the variance and the following formula should be used: n
s2 =
∑ (x i =1
i
- x) 2
n −1 where s2 is the estimate of the .Since, in practice, the variance is usually computed in a sample, this formula is most often used.. Standard deviation: Population Standard deviation: •
It is the Square root of the population variance
σ =
σ
2
Sample Standard Deviation: •
It is the Square root of the sample variance
S=
S
2
Uses Of Standard Deviation: •
To determine, with a great deal of accuracy. 8
• • • •
Useful in describing how far individual items in a distribution depart from the mean of the distribution. Indicator of financial risk Quality Control in construction of quality control charts & process capability studies Comparing populations for household incomes in two cities & employee absenteeism at two plants
EMPIRICAL RULE:
• • • •
• •
•
It is an important rule of thumb that is used to state the approximate percentage of values that lie within a given number of standard deviations from the mean of a set of data if the data are normally distributed. It is also known as 68-95-99.7 rule In statistics, the 68-95-99.7 rule, or three-sigma rule, or empirical rule, states that for a normal distribution, nearly all values lie within 3 standard deviations of the mean. About 68% of the values lie within 1 standard deviation of the mean (or between the mean minus 1 times the standard deviation, and the mean plus 1 times the standard deviation). In statistical notation, this is represented as: μ ± σ. About 95% of the values lie within 2 standard deviations of the mean (or between the mean minus 2 times the standard deviation, and the mean plus 2 times the standard deviation). The statistical notation for this is: μ ± 2σ. Nearly all (99.7%) of the values lie within 3 standard deviations of the mean (or between the mean minus 3 times the standard deviation and the mean plus 3 times the standard deviation). Statisticians use the following notation to represent this: μ ± 3σ
This rule is often used to quickly get a rough estimate of something's probability, given its standard deviation, if the population is assumed normal, thus also as a simple test for outliers (if the population is assumed normal), and as a normality test (if the population is potentially not normal). 9
Ran ge
Population in range
μ± 1σ
68 %
μ± 2σ
95 %
μ± 3σ
99.7 %
CHEBYSHEV’S THEOREM • Applies to any distribution, regardless of shape • Places lower limits on the percentages of observations within a given number of standard deviations from the mean • At least (1-1/k2) of the elements of any distribution lie within k standard deviations of the mean CHEBYSHEVS THEOREM
Number Of Standard Deviation
Distance From The Mean
Minimum Proportion Of Values Falling Without Distance
K=2
μ ± 2σ
1-1/2² = 0.75
K=3
μ ± 3σ
1-1/3² = 0.89
K=4
μ ± 4σ
1-1/4² = 0.94
6. z Scores:-
A z score represents the number of standard deviations a value (x) is above or below the mean of a set of numbers when the data are normally distributed. Using z scores allows translation of a value’s raw distance
10
If the Z score is negative, the raw value (x) is below the mean. If the z score is positive, the raw value (x) is above the mean. For example, for a data set that is normally distributed with a mean of 50 and a standard deviation of 10, suppose a statistics want to determine the z score for a value of 70. The value is 20 units above the mean, so the z value is,
The z score is interpreted as the empirical rule states that 95% of all values are within two standard deviations of the mean if the data is approximately normally distributed. 7. Coefficient of Variation:-
The Coefficient of variation is a statistic that is the ratio of the standard deviation to the mean expressed in percentage. The coefficient of variation essentially is a relative comparison of a standard deviation to its mean. The coefficient of variation can be useful in comparing standard deviations that have been computed from data with different means. Suppose five weeks of average prices for the stock A are 57, 68, 64, 71 and 62. To compare a coefficient of variation for these prices, first determine the mean and standard deviation: µ = 64.40 and σ = 4.84. The coefficient of variation is: The standard deviation is 7.5% of mean. Sometimes financial investors use the coefficient or standard deviation or both as measures of risk. Imagine a stock with a price that never changes. An investor bears no risk of losing money from the price going down because no variability occurs in price. Suppose, in contrast, that the price of the stock influence widely. An investor who buys at a low price and sells for a high price can make a nice profit. However, if the price drops below what the investors buys it for, the stock owner is subject to a potential loss. The greater the variability is, more the potential for loss. Hence, investors use measures of variability such as standard deviation or coefficient of variation to determine the risk of a stock.
. Measures of Central Tendency and Variability: Grouped Data Mean of grouped data • Weighted average of class midpoints 11
• Class frequencies are the weights
µ=∑
fM
∑f ∑fM =
N f 1M 1 + f 2 M 2 + f 3 M 3 +⋅⋅⋅ + fiMi = f 1 + f 2 + f 3 +⋅⋅⋅ + fi
Mode of Grouped Data • Midpoint of the modal class • Modal class has the greatest frequency
4. Measures of Shape Skewness When they are displayed graphically, some distributions of data have many more observations on one side of the graph than the other. Distributions with most of their observations on the left (toward lower values) are said to be skewed right; and distributions with most of their observations on the right (toward higher values) are said to be skewed left.
Skewed left
Skewed right
12
13
1. Arithmetic Mean: •
The arithmetic mean of a set of data is the sum of the data values divided by the number of observations. If the data set is from a sample, then the sample mean, is: n
X=
∑x i =1
i
n n = sample size and Σ means "to add" If the data set is from a population, then the population mean,µ is: N xi ∑ x + x 2 + ... + x N i =1 µ= = 1 N N N= population size. Σ is a statistic and μ is a parameter. •
•
Advantages of Arithmetic Mean: •
Easy to understand.
•
Simple to compute.
•
Based on all the observation.
•
Uniquely defined.
Disadvantages of Arithmetic Mean: •
Affected by extreme value.
•
Unable to compute mean for open-ended classes.
•
Tedious to compute
14
•
The Weighted Mean: All observation do not have same importance, we use weighted average mean. The weighted average mean can be defined as
Where Xw represents the weighted average mean.
15
View more...
Comments