Descriptive and Inferential Statistics
Short Description
Paul A. Jargowsky and Rebecca Yang,31p,714kp Descriptive Statistics II. Inferential Statistics III. Cautions and Conc...
Description
Descriptive and Inferential Statistics Paul A. Jargowsky and Rebecca Yang University of Texas at Dallas I. II. III. III.
Descriptive Statistics Inferential Statistics Caut Cautio ion ns and and Concl onclus usio ions ns
GLOSSARY
e mpirically estimate standard errors. bootstrapping – a re-sampling procedure to empirically do minant quantitative or qualitative trend of a given variable (commonly central tendency – the dominant measured by the mean, the median, the mode and related mesures). confidence interval – a numeric range, based on a statistic and its sampling distribution, that contains the population parameter of interest with a specified probability. data – a plural noun referring to a collection of information in the form of variables and observations. descriptive statistics – any of numerous calculations which attempt to provide a concise summary of the information content of data d ata (for example, measures of central tendency, measures of dispersion, etc.).
sample – a specific, finite, realized set of observations of the unit of analysis. sampling distribution – a theoretical construct describing the behavior of a statistic in repeated samples.
estimate of an statistic – a descriptive measure of calculated from sample data to serve as an estimate unknown population parameter. unit of analysis – the type of thing being measured in the data, such as persons, families, households, states, nations, etc.
There are two fundamental purposes to analyzing data: the first is to describe a large number of data points in a concise way wa y by means of one or more summary statistics; the second is to draw inferences about the characteristics of a population based on the characteristics of a sample. Descriptive statistics characterize the distribution of a set of observations on a specific variable
or variables. By conveying the essential properties of the aggregation aggregation of many different different observations, these summary measures make it possible to understand understand the phenomenon under
potential voters.
I.
Descriptive St Statistics.
A number of terms have have specialized meaning in the domain of statistics. First, there is the distinction between populations and samples. Populations can be finite or infinite. An example of the former is the population of the United States on April 15, 2000, the date of the Census. An example of the latter is the flipping of a coin, which can be repeated in theory ad fixed but usually unknown. Samples are are infinitum. Populations have parameters, which are fixed used to produce statistics , which are estimates of population parameters. This section discusses discusses both parameters and statistics, and the following section discusses the validity validity of using using statistics as estimates of population parameters. There are also a number of important distinctions to be made about the nature of the data to be summarized. The characteristics of data to be analyzed limit the types types of measures that can be meaningfully employed. employed. Section B below addresses some of the most important important of these
The collection of all measurements for one realization of the unit of analysis is typically called an observation. If there are n observations and k variables, then the data set can be thought of as a grid with n x k total items of information, although more complex structures are possible. It is absolutely central to conducting and interpreting data analysis to be clear about the unit of analysis. For example, if one observes that 20 percent of all crimes are violent in nature, it does not imply that 20 percent of all criminals are violent criminals. Crimes, though obviously related, are simply a different unit of analysis than criminals; a few really violent criminal c ould be committing all the violent crimes, so that less than 20 percent of criminals are violent in nature.
1.
Levels of Measurement.
A variable is something that varies between observations, at least potentially, but not all variables vary in the same way. The specific values which a variable can take on, also known as attributes, convey information about the differences between observations on the dimension
C
Quantitative variables are measured on a continuous scale. In theory, there are an infinite number of potential attributes for a quantitative va riable. C
C
C
Ratio variables are quantitative variables which have a true zero. The existence of a true zero makes the ratio of two measures meaningful. For example, we can say that your income is twice my income because $200,000/$100,000 = 2. Interval variables are quantitative as well, but lack a true zero. Temperature is a common example: the zero point is arbitrary and in fact differs between the Centigrade and Fahrenheit systems. It doesn’t make sense to say that 80° F is twice as hot as 40° F; in Centigrade the ratio would be 6; neither ratio is meaningful.
Qualitative (or categorical) variables are discrete; that is, a measurement consists of assigning an observation to one or more categories. The attributes of the variable consist of the finite set of potential categories. The set of categories must be mutually exclusive and collectively exhaustive. That is, each observation can be assigned to one and only one category. C
C
In ordinal variables , the categories have an intrinsic order. Sears and Roebuck, for example, classifies its tools in three categories: good, better, best. They have no numerical relation; we don’t know if better is twice as good as good. But they clearly can be ordered. In nominal variables , the categories have no intrinsic order. A variable for Religion, for example, could consist of the following set of categories:
category from another. A special type of categorical variable is the dummy variable, which indicates the presence or absence of a specified characteristic. A dummy variable is therefore a nominal variable containing only two categories: one for “yes” and one for “no.” Female is one example; pregnant is an even better example. Typically such variables are coded as 1 if the person has the
characteristic described by the name of the variable and 0 o therwise; such coding simplifies the interpretation of analyses that may later be performed on the variable. Of course, other means of coding are possible, since the actual values used to indicate categories are arbitrary. The level of measurement tells us how much information is conveyed about the differences between observations, with the highest level conveying the greatest amout of information.. Data gathered at a higher level can be expressed at any lower level; however, the reverse is not true (Vogt 1993: 127).
2
Time Structure
C
C
C
Trend data : data on one object, repeated at two or more points in time. Example: a data set consisting of the inflation, unemployment, and poverty rates for the United States for each year from 1960 to 2003. Cohort data: data consisting of observations from two or more points in time, but with no necessary connection between the individual observations across time. Example: a data set consisting of the incomes and ages of a sample of persons in 1970, a different sample of persons in 1980, and so one. With such a data set, we could compare the incomes of 20-29 year olds in one decade to the same age group in other decades. We could also compare the incomes of 20-29 year olds in 1970 to 30-39 year olds in 1980, 40-49 year olds in 1990, and so on, if we wanted see how the incomes of single group changed over time. Panel data: panel data consists of measurements conducted at multiple points in time on two or more different realizations of the unit of analysis. Example: we observe the income and marital status of a set of persons every year from 1968 to 2003. “Panel” is a metaphor for the for grid-like structure of the data: we have n persons observed at t different points in time, forming a panel with n x t cells, and each of these cells consists of k variables. Panel data is very powerful, because it enables a research to observe the temporal order of events for individual persons, strengthening causal inferences the research may wish to draw.
With these basic concepts in hand, we can explore the most common descriptive statistics.
variable. There are three main measures of central tendency: *
The mode is simply the most frequently occurring single value. For example, the modal racial group in the United states is white, because there are more whites than any other racial group.
*
The median is the value of the middle observation, when the observations are sorted from the least to the greatest in terms of the their value on the variable in question. The median is not at all sensitive to extreme values. One could multiply all values above the median by a factor or 10, and the median of the distribution would not be affected. For this reason, the median is often used to summarize data for variables where there are extreme outliers. Newspapers typically report the median value of home sales.
*
The arithmetic mean, indicated by
for a population and by
for a sample, is
the most common measure of central tendency, is the sum of the variable across all the N observations in a population, or across all n observations in a sample, divided by the number of observations:
Because each observation’s value is included in the calculation, the mean is sensitive to and reflects the presence of extreme values in the data. The formulas
Note that the actual level of measurement is the limiting factor on which statistics are meaningful, not the coding system. If a variable for the day of the week a crime is committed is coded as 0 = Sunday, 1 = Monday, and so on, it is still a nominal variable. Any computer will happily calculate that the mean of the variable is around 3.5; however, this number is meaningless. The “average” crime is not committed on a Wednesday and a half. While this example is clear, it is fairly common that researchers calculate means of ordinal variables, such as Likert scales, which are coded 1 for “strongly disagree”, 2 for “disagree,” and so on up to 5 for “strongly agree.” Strictly speaking, this is an invalid calculation. One interesting exception is that one can calculate the mean for dummy variables. Because they are coded as 1 if the characteristic is present and 0 otherwise, the sum of the variable – the numerator for the calculation of the mean – is simply the count of the observations for which have the characteristic. Dividing by the total number of observations results in the proportion of observations sharing the trait. Thus, the mean of a dummy variable is a proportion. There is not one right answer to the question of which measure of central tendency is
There are a variety of variations on the concept of mean that may be useful in specialized situations. For example, the geometric mean, which is the nth root of the product of all observations, and the harmonic mean, which is the reciprocal of the arithmetic mean of the reciprocals of the values on the variable. There are numerous variations of the arithmetic mean as well, including means for grouped data, weighted means, trimmed means, and so on. Consult a statistics textbook for further information on these alternative calculations and their applications.
C.
Measures of Variability Variety may be the spice of life, but variation is the very meat of science. Variation
between observations opens the door to analysis of causation, and ultimately to understanding. To say that variation is important for data analysis is understatement of the highest order; without variation, there would no mysteries, and no hope of unraveling them. The point of data analysis is to understand the world; the point of understanding the world is the differences between
common, group of measures is based on the frequency of occurrence of different attributes of a variable. Positional measures of variation are based on percentiles of a distribution; the xth percentile of a distribution is defined as the value that is higher than x percent of all the observations. The 25th percentile is also known as the first quartile, and the 75th percentile is referred to as the third quartile. Percentile based measures of variability are typically paired with the median as a measure of central tendency; after all, the median is the 50th percentile. Deviation-based measures, in contrast focus on a summary of some function of the quantitative distance of each observation from a measure of central tendency. Such measures are typically paired with a mean of one sort or another as a measure of central tendency. As in the case of central tendency, it is impossible to state whether position-based or deviation-based measures provide the best measure of variation; the answer will always depend on the nature of the data and the nature of the question being asked.
2.
Measures of variability based on deviations.
The sum of the deviations from the arithmetic mean is always zero:
Because the positive and negative deviations cancel out, measures of variability must dispense with the signs of the deviations; after all, a large negative deviation from the mean is as much of an indication of variability as a large positive deviation. In practice, there two methods to eradicate the negative signs: either taking the absolute value of the deviations or squaring the deviations. The mean absolute deviation is one measure of deviation, but it is seldom used. The primary measure of variability is, in effect, the mean squared deviation. For a population, the variance parameter of the variable X , denoted by
is
Unlike the formulas for the population and sample means, there is an important computational difference between the formulas for the variance and standard deviation depending on whether we are dealing with a population or a sample. The formulas for the sample variance and sample standard deviation are:
Note that the divisor in these calculations is the sample size, n, minus 1. The reduction is necessary because the calculation of the sample mean used up some of the information that was contained in the data. Each time an estimate is calculated from a fixed number of observations, one degree of freedom is used up. For example, from a sample of 1 person, we could get an
maximum score is 100 or 800. For this reason, it is often useful to consider the coefficient of relative variation , usually indicated by CV, which is equal to standard deviation divided by the
mean. The CV facilitates comparisons among standard deviations of heterogenous groups by normalizing each by the appropriate mean.
3.
Measures of variability based on frequency.
The calculation of interquartile ranges and standard deviations requires quantitative data. For categorical data, a different approach is needed. For such variables, there are measures of variability based on the frequency of occurrence of different attributes (values) of a v ariable. The Index of Diversity (D) is one such measure. It is based on the proportions of the observations in each category of the qualitative variable. It is calculate as follows:
where pk is the proportion of oberservations in category k , and K is the number of categories. If
When skewness is zero, the distribution is said to be symmetric. A distribution with a long “tail” on side is said to be skewed in that direction. When the distribution is skewed to the right, the skewness measure is positive; when it is skewed to the left, skewness is negative. Kurtosis, based on the 4th power of the deviations from the mean, gauges the thickness of the tails of the distribution relative to the normal distribution. For more information on these measures, see Downing and Clark (1989) and Vogt (1993).
E.
Association Between Variables All of the measures described above are univariate, in that they describe one variable at a
time. However, there is a class of descriptive measures that describes the degree of association, or co-variation, between two or more variables. One very important measure is the correlation coefficient , sometimes called Pearson’s r. The correlation coefficient measures the degree of
linear association between two variables. For the population parameter:
product of the deviations from the respective means. These products are positive either when both deviations are positive or when both are negative. The product will be negative when the deviations have the opposite sign, which will occur when one variable is above its mean and the other is below its mean. When there is a positive linear relationship between the variables, their deviations from their respective means will tend to have the same sign, and the correlation will be positive. In contrast, when there is a negative linear relationship between the two variables, their deviations from their respective means will tend to hav e opposite signs and the correlation coefficient will be negative. If there is no linear relationship between the variables, the products of deviations with positive signs will tend to cancel out the products of deviations with negative signs, and the correlation coefficient will tend towards zero. Another feature of the correlation coefficient is that it is bounded by negative 1 and positive one. For example, if X = Y, then they have an exact positive linear relationship. Substituting X = Y into the formula yields D = 1. Similarly, if X = -Y, they have an exact negative linear relationship, and the correlation coefficient reduces to -1. Thus, the
million people say about anything without conducting a full-scale census of the population. The claim by the Times was based on a telephone survey of only 668 adults nationwide, meaning that the times did not know what the remaining 289,999,332 Americans actually had to say about the impending war. To claim to know what is on the mind of the country as whole from such a small sample seems like utter madness or hubris run amok, even if they admit their poll has “a margin of sampling error of plus or minus four percentage points.” In fact, they have a very sound basis for their claims, and under the right conditions they can indeed make valid inferences about the population as whole from their sample. Much of what we think we know about our country’s people and their attitudes – the poverty rate, the unemployment rate, the percent who believe in God, the percent who want to privatize Social Security, etc. – is information based on surveys of tiny fragments of the total population. This section briefly describes the essential logic of sampling theory, and subsequent sections illustrate some of the most important applications.
the population size. [Figure 1 about here.] The key question, as illustrated in Figure 1, is how can we make a valid inference about the population parameters from the sample statistics? We don’t usually care at all about the sample itself. Who cares what 668 individual people think about the prospect of war? Only the population values are of interest. The first step is to understand the process for drawing the sample. If the sample is drawn in a biased fashion, we will not be about to draw any inference about the population parameters. If our sample is drawn from West Point, or from Quaker Meeting House, or is the result of a selfselection process where people vote on a web site, it will not be possible to generalize to the population at large. At best, such sample would allow you to generalize about cadets, or Quakers, or people who participate in Internet surveys. We need a sample that is representative of the population of interest. The easiest way to obtain one is to draw a truly random sample, in which each member of the population of interest
The mean age of these students is the parameter we wish to estimate. Each student writes their age on a 3" by 5" card and places the card into a box. The box is shaken and three of the cards are drawn at random. We get a value for the sample mean. Clearly, we don’t know if the sample statistic is too high or too low, but it seems doubtful that we would be so lucky that it is exactly right. Even if it was, we wouldn’t have any way of knowing it. We could draw a second sample, and a third. Each time, we would get a different sample mean, depending on which card we happened to pick. Typically, if you actually try this, the values will be quite different. At this point, one might be tempted to decide that sampling is utterly useless. One might be tempted to take the average of the three sample means, and that leads to our solution. The key is to see that the sample mean is a random variable. The value of the sample mean depends on the outcome of a random process. Any random variable has a probability distribution, with a mean (or expectation) and a variance. In the case of our age example, if there are 55 students in the class, there are exactly 26,235 different combinations of three cards that can be drawn from the box. Figure 2 shows the sample space, containing all possible outcomes
On the one hand, this is very good news. It says that any given sample mean is not biased. The expectation for a sample mean is equal to the population parameter we are trying to estimate. On the other hand, this information is not very helpful. The expectation of the sample mean just tells us what we would expect in the long run if we could kept drawing sample repeatedly. In most cases, however, we only get to draw one sample. Our one sample could still be high or low, and so it seems we have not made any progress. But we have. The sample mean, as a random variable, also has a variance and a standard deviation. The Central Limit Theorem also tells us that:
This formula tells us that the degree of variation in sample means is determined by only two factors, the underlying variability in the population, and the sample size. If there is little
The Central Limit Theorem provides one additional piece of information about the distribution of the sample mean. If the population values of the variable X are normally distributed, then the distribution of sample mean of X will also be normally distributed. More importantly, even if X is not at all normally distributed, the distribution of the sample mean of X will approach normality as the sample size approaches infinity. As a matter of fact, even with relatively small samples of 30 observations, the distribution of the sample mean will approximate normality, regardless of the underlying distribution of the variable itself. We now have a complete description of the probability distribution of the sample mean, also known as the sampling distribution of the mean. The sampling distribution is a highly abstract concept, yet it is the key to understanding how we can dra wn a valid inference about a population of 290 million from a sample of only 668 persons. The central point to understand is that the unit of analysis in the sampling distribution is the sample mean. In contrast, the unit of analysis in the population and the sample is, in this case, people. The sample mean, assuming a normally distributed population or a large enough sample, is a normally distributed random
Concretely, the normality of the sampling distribution implies that the probability is 0.95 that a randomly chosen sample will have a sample mean that is within 1.96 standard errors of the true population mean, assuming the conditions for normality of the sampling distribution are met. And therefore it follows, as night follows day, that the probability must also be 0.95 that the true population mean is within two standard errors of whatever sample mean we obtain in a given sample. Mathematically,
implies that
.
One can derive the second formula from the first mathematically, but the logic is clear: if New York is 100 miles from Philadelphia, then Philadelphia is 100 miles from New York. If there is a
provide the point estimate (PE) plus or minus the margin of error (ME) , as follows:
The point estimate is simply is simply an unbiased sample statistic, and is our best guess abou t the true value of the population parameter. The margin of error is based on the distribution of the sample statistic, which in this example is a normal. Thus, distributional parameter of 1.96, based on the normal distribution, and the standard error of the estimator generate the margin of error. One practical issue needs to be resolved. If the mean of the variable X is unknown, than it is quite likely that the standard deviation of X is also unknown. Thus, we will need to use the sample standard deviation, s, in place of the population standard deviation, F, in calculating the confidence interval. Of course, s is also a random variable, and introduces more uncertainty into our estimation, and therefore our confidence interval will have to be wider. In place of the normal distribution threshold of 1.96, we will have to use the corresponding threshold from the t distribution with same degrees of freedom as were used to calculate s – that is, the sample size
enables us to make an inference back to the population. For any estimator of any population parameter, the sample by itself is useless and would not support inferences about the population. Only by reference to the sampling distribution of estimator can valid inferences be drawn. Thus, we need to understand the properties of the sampling distribution of the statistic of interest. The following section describes some of the most commonly encountered statistics and their sampling distributions.
B.
Common Sampling Distributions All statistics are estimators of population parameters that are calculated from samples,
and therefore all statistics are random variables. Table 1 below describes the essential features of the sampling distributions of the most common estimators. [Table 1 about here.] The expected value, variance, and shape of some estimators is not known. For example, there is no known analytic solution for the sampling distribution of a difference of two medians.
sample, and the measured variability of the statistic of interest across these samples is used as the estimate of the statistic’s standard error.
III.
Cautions and Conclusions.
Statistics, as a discipline, is a bit schizophrenic. On the one hand, given a specific set of observations, there are precise formulas for calculating the mean, variance, skewness, kurtosis, and a hundred other descriptive statistics. On the other hand, the best that inferential statistics can tell you is that the right answer is probably between two numbers, but then again, maybe not. Even after the best statistical analysis, we do not know the truth with complete precision. We remain in a state of ignorance, but with an important difference: we have sharply reduced the degree of our ignorance. Rather than not knowing a figure at all, such as the poverty rate for the United States, we can say that we are 95 percent confident that in 2002 the poverty was between 10.1 and 10.5 percent. A number of caveats are in order. The arguments above assume random, or at least
knowledge of the outside world. With statistics, we can compile evidence that is sufficient to convince us of a conclusion about reality with a reasonable degree of confidence. Statistics is a tool, and an extremely valuable one at that. But it is neither the beginning nor the end of scientific inquiry.
FURTHER READING.
Blalock, H. M., Jr. (1972). Social Statistics. New York: McGraw-Hill. Everitt, Brian S. (1998). The Cambridge Dictionary of Statistics. Cambridge: Cambridge, U.K.: University Press. Fruend, John E. And Ronald Walpole (1992). Mathematical Statistics, 5th ed.. Englewood Cliffs, New Jersey: Prentice-Hall. Hacking, Ian (1975). The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction, and Statistical Inference. London: Cambridge University Press.
Weisberg, Herbert F. (1992). Central Tendency and Variability. Newbury Park, Calif.: Sage Publications.
F i g u r e 1
: T
h e I n f e r e n c e P r o b l e m .
-28-
Figure 2: Samples of 3 Drawn from a Class of 55
-29-
Figure 3: Sampling Theory
-30-
View more...
Comments