Statistics

Share Embed Donate


Short Description

Statistics...

Description

χ α

STATISTICS

1

Delphina Gomes Kriti Kotnala Majida Shaheen Prerna Gupta Sonal Kulshrestha M.Sc (Previous) Food and Nutrition

DEFINITIONS OF STATISTICS  statistics : is a set of concepts, rules and procedures that help us to :   

oraganise numerical information in the form of tables, graphs and charts. understands statistical techniques underlying decisions that affect our lives and well-being make informed decisions.

 According to Croxton and Cowden statistics may be defined as the science, which deals with collection, presentation, analysis and interpretation of numerical data.  Statistics is the process of making generalization on the basis of information. It is concered with three areas: The collection and classification of data Describing and presenting data Interpreting and drawing conclusions from data.  Statistics is logic, mathematics s tool and application as the goal. Statisctis is the study of population and its variation.

BROAD CATEGORIES OF STATISTICS Statistics can broadly be split into two categories: Descriptive Statistics: Descriptive statistics deals with the meaningful presentation of data such that its characteristics can be effectively observed. It encompasses the tabular, graphical or pictorial display of data, condensation of large data into tables, preparation of summary measures to give a concise description of complex information and also to exhibit pattern that may be found in data sets. Inferential Statistics: Inferential statistics refers to decision or on other hand, deals with drawing inferences and taking decision by studying a subset or sample from the population. DATA: facts, observations, and information that comes from investigations. Measurement data sometimes called quantitative data-the result of using some instrument to measure something (e.g. test score, weight) Categorical data also referred to as frequency or qualitative data. Things are grouped according to some common properties and the numbers of members of group are recorded (eg.males/females, vehicle type.) DATA CAN BE CLASSIFIED BASED ON THEIR LEVEL OF MEASUREMENT NOMINAL-LEVEL measurement (also known as attribute or categorical measurement): data whose values describe attributes that do not imply magnitude or order. E.g. Gender, religion, disease status ORDINAL-LEVEL measurement: data whose values are ranked or ordered lowest to highest or highest to lowest values does not apply magnitude. E.g. education level, socio-economic status. INTERVAL-LEVEL measurement: data whose values are ordered and have measurable differences. RATIO-LEVEL measurement: Data, whose values can be ordered, have measurable differences and have a zero starting point. Zero point implies an absence of variable. Temperature measured in Celsius is not ratio level but is interval level. Interval level and ratio level measurement are together termed as interval-ratio measures. Examples of interval level: age, height, weight. VARIABLE: A characteristic of objects, events or individuals that has or can be assigned a numerical value. Or variable is a property of an object or event that can take on different values. For example, college major is a variable that takes on values like mathematics, computer sciences, English, psychology etc.

Discrete variable- a variable with a limited number of values (e.g. gender male/female), college class. (Freshman/sophomore/junior/senior). Continuous variable- a variable that can take on many different values, in theory, any value between the lowest and highest points on the measurement scale. Independent variable: a variable that is manipulated, measured, or selected by the researcher as an antecedent condition to an observed behavior. In a hypothesized cause and effect relationship, the independent variable is the cause and the dependent is the outcome and effect. Dependent variable- a variable that is not under the experimenter‟s control – the data. It is the variable that is observed and measured in response to the independent variable. Qualitative variable- a variable based on categorical data. Quantitative variable- a variable based on quantitative data. ROLE OF STATISTICS IN PUBLIC HEALTH AND COMMUNITY MEDICINE Statistics finds an extensive use in Public Health and Community Medicine. Statistical methods are foundations for public health administrators to understand what is happening to the population under their care at community level as well as individual level. If reliable information regarding the disease is available, the public health administrator is in a position to: ● Assess community needs ● Understand socio-economic determinants of health ● Plan experiment in health research ● Analyse their results ● Study diagnosis and prognosis of the disease for taking effective action ● scientifically test the efficacy of new medicines and methods of treatment. Statistics in public health is critical for calling attention to problems, identifying risk factors, and suggesting solutions, and ultimately for taking credit for our successes. The most important application of statistics in sociology is in the field of demography. Statistics helps in developing sound methods of collecting data So as to draw valid inferences regarding the hypothesis. It helps us present the data in numerical form after simplifying the complex data by way of classification, tabulation and graphical presentation. Statistics can be used for comparison as well as to study the relationship between two or more factors. The use of such relationship further helps to predict one factor from the other. Statistics helps the researcher come to valid conclusions STUDY EXERCISE

1. An 85 year old man is rushed to the emergency department By ambulance during an episode of chest pain. The Preliminary assessment of the condition of the man is performed by a nurse, who reports that the patients pain Seems to be „severe‟. The characterization of pain as „Severe‟ is (a) Dichotomous (b) Nominal (c) Quantitative (d) Qualitative Ans. (d)

2. If we ask the patient attending OPD to evaluate his pain On a scale of 0 (no pain) to 5 (the worst pain), then this Commonly applied scale is a (a) Dichotomous (b) Ratio Scale (c) Continuous (d) Nominal Ans. (b) 3. For each of the following variable indicate whether it is Quantitative or qualitative and specify the measurement Scale for each variable: (a) Blood Pressure (mmHg) (b) Cholesterol (mmol/l) (c) Diabetes (Yes/No) (d) Body Mass Index (Kg/m2) (e) Age (years) (f) Sex (female/ Male) (g) Employment (paid work/retired/housewife) (h)Smoking Status (smokers/non-smokers, ex-smokers) (i)Exercise (hours per week) (j) Drink alcohol (units per week) (k) Level of pain (mild/moderate/severe) Ans. (3) (a) Quantitative continuous (b) Quantitative continuous (c) Qualitative dichotomous (d) Quantitative continuous (e) Quantitative continuous (f) Qualitative dichotomous (g) Qualitative nominal (h) Qualitative nominal (i) Quantitative discrete (j) Quantitative Discrete (k) Qualitative ordinal CORRELATION AND REGRESSION The anthropologist Sir Francis Galton (1822–1911) used the term regression in explaining the relationship between the heights of fathers and their sons. Often in medical research it is desirable to analyse the relationship or association between two quantitative (i.e. continuous, discrete or numerical ordinal) variables. The term „Correlation‟ indicates the relationship between two variables in which, with changes in the values of one variable, the values in the other variable also changes. The nature and strength of relationship that exists is examined by regression and correlation analysis. When the objective is to determine association or the strength of relationship between two such variables, we use correlation coefficient (r). If the objective is to quantify and describe the existing relationship with a view of prediction, we use regression analysis. Correlation can either be positive or negative. When the values of the two variables move in the same direction, implying that if one variable increases the other variable also increases may not be in same magnitude or decrease in variable is associated with the decrease in the value of the other variable, the correlation is called „Positive‟. If on the other hand, the values of the two variables move in opposite directions so that with an increase in the value of one variable, the other variable decreases and with a decrease in the value of one variable the value of the other increases, the correlation in such case is called „Negative‟.

Correlation coefficient value as +1 Fig. - 1 and Fig. - 2 shows the scatter diagram where the relationship indicates perfect correlation, as the values of the two variables move in the same direction but in exact magnitude i.e. if one value increase in 5 units the other variable also increase in the same units whereas

Correlation coefficient value as -1 Fig. - 2 depicts perfect negative correlation where if one value increases in 5 units, the other variable decreases in the same units.

r lying in the range of 0 to 1 Fig. - 3 shows the scatter diagram where the relationship indicates positive correlation, as the values of the two variables move in the same direction.

r lying between -1 to 0 Fig. - 2 shows the scatter diagram where the relationship indicates negative correlation, as the values of the two variables move in the opposite direction.

correlation coefficient = zero Fig. - 5 shows that there is no correlation between the two variables.

Correlation Coefficient (r) Correlation coefficient is calculated to study the extent or degree of correlation between the two variables. The correlation coefficient ranges between -1 to +1 and is denoted by „r‟. A correlation of zero indicates no association whereas a correlation of 1 indicates perfect association. The sign of correlation coefficient provides information on

the type of association between the variables. If it is positive, then the high values of one variable will be associated with high values of the other variable and if the correlation coefficient is negative, then low values of one variable are associated with high values of other variable. METHODS: 1. Scatter Diagram A scatter plot is a visual description of the relationship between two continuous variables. It consists of a graph with horizontal axis (x-axis) and vertical axis (y-axis) representing the two variables. Each observation of x and y are represented by dot on the graph at the point (x, y). 2. Karl Pearson Correlation Coefficient: This correlation coefficient works when variables are continuous variables and joint distribution is following normal distribution. The Karl Pearson‟s correlation coefficient (r) also called as „Product moment correlation‟ is given by the following formula.

The numerator is the sum of cross products about the mean whereas the denominator is the square root of the product of the sum of squares of deviation about the mean for each of the two variables. 3. Spearman correlation coefficient: Sometimes the variables are not normally distributed but are ranked in order, then the appropriate correlation measure is Spearman rank correlation coefficient. The Spearman correlation coefficient also ranges from -1 to +1 and is interpreted in the same way as the „Pearson correlation coefficient‟. The spearman correlation coefficient (r) is given as follows.

Where Σd2 is the total of the squares of the difference of corresponding ranks and n is the number of pairs of observations. Coefficient of determination: Coefficient of determination is defined as square of correlation coefficient. It is the amount of variation in the dependent variable accounted for by the independent variable. In other words, if coefficient of correlation (r) between age and blood pressure is 0.8 then coefficient of determination r2 = 0.64. This is interpreted as 64% of variability in blood pressure is accounted by age whereas the remaining 36% is not by age. Other factors such as weight, diet and exercise may account for the 36% variability in blood pressure. Regression Analysis If the objective is to quantify and describe the existing relationship with a view of prediction, we use regression analysis. In regression we assume one variable generally y as dependent variable and x as independent variable. Regression analysis determine the form of the relationship by a line which best fits the data. This line is called as „Regression Equation‟. This regression line which is unique is determined by the „Least Square Method‟. The principle of least square method is that the sum of squares of deviations of the observed points about the line is minimum. The deviation is obtained by measuring the distance from observed point to the line.

The equation of the straight line is given by Y=a+bx Where, „y‟ = dependent variable; „x‟ = independent variable; „a‟ is the intercept and „b‟ is the slope of the line which measures the amount of increase or change in y for unit change in x, whose estimates are found using the normal equations given as follows.

Σy=an+bΣx and Σxy=aΣx+bΣx2

y

If values of x and y are given:

x

If values of x and y are given:

y

x

If standard deviation is given:

x b(x on y) =

b(y on x) =

In the simple linear regression, we consider only two variables one being a dependent variable and the other being the independent variable. From the simple regression equation, we predict the value of the dependent variable on the basis of independent variable. Many times the researcher may come across situations where he/ she is dealing with more than one independent variable. For example, the researcher may be interested in finding out as to how much change is likely to occur in serum cholesterol level (outcome or dependent variable) following changes in age, body weight, alcohol intake, calories consumed per day, and calories spent per day in physical exercise. Thus, we have a set of 5 independent variables, all of which are likely to affect the outcome variable by themselves. Such situations where more than one independent variable and one dependent variable are available and the aim is to predict dependent variable on the basis of many independent variables, the technique used for analysis is called as „Multivariate Regression Analysis‟. Multiple Linear Regression Model: Multiple regression is the extension of simple linear regression, except that in this regression model, we have more than one independent or explanatory variables, and the outcome (dependent) variable is continuous or discrete numerical scale. When there is one outcome (dependent) variable which is measured on a “Numerical continuous” or else “Numerical discrete” Scale; and there are more than one predictor (independent) variables, this model is used. In the example of serum cholesterol, there was one outcome variable which was measured on a continuous numerical scale (mg/dl). There were 5 independent variables, with X1 denoting body weight in Kg, X2 denoting age in years and so on till the nth (5th variable) viz, exercise energy, denoted by X5. Such a situation when there is a single outcome variable which is measured on a numerical continuous or numerical discrete scale, and there are more than one independent variables, is known as “Multiple Linear Regression Analysis”.

The advantage of multiple linear regression is that the result gives the β coefficient which indicates the change in the dependent variable for the change in independent variable. For example, if after carrying out multiple linear regression if the beta coefficient for weight is 2.1, it indicates that for every kilogram increase in weight, the serum cholesterol is likely to increase, on an average, by 2.1 mg/dl. Along with the coefficients, the analysis also gives the 95% confidence interval, the significance value for each variable and the regression equation. Univariate analysis is the simplest form of quantitative (statistical) analysis. The analysis is carried out with the description of a single variable and its attributes of the applicable unit of analysis. For example, if the variable age was the subject of the analysis, the researcher would look at how many subjects fall into a given age attribute categories. Bivariate analysis is one of the simplest forms of the quantitative (statistical) analysis. It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. In order to see if the variables are related to one another, it is common to measure how those two variables simultaneously change together (see also covariance). Bivariate analysis can be helpful in testing simple hypotheses of association and causality – checking to what extent it becomes easier to know and predict a value for the dependent variable if we know a case's value on the independent variable (see also correlation). Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. Confounding variable is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable. Body mass index (BMI) and LDL cholesterol are both established as heart disease risk factors. It reasonable to hypothesize that BMI is a causal determinant of LDL. However, age, ethnicity, smoking, and alcohol use may confound this association.

PARAMETRIC AND NON PARAMETRIC TESTS Statisticians talk about statistics in relation to parameters. A parameter is a numeric quantity, usually unknown, that describes a certain population characteristic. For example, the population mean is a parameter that is often used to indicate the average value of a quantity. Parameters are often estimated since their value is generally unknown, especially when the population is large enough that it is impossible or impractical to obtain measurements for all people. For example, it would be impossible to line up all adult human males on the planet and obtain their heights with perfect measurement, therefore, the true mean height of adult human males can only be estimated, not known. Difference between parametric and non parametric tests Parametric test Non parametric test The hypothesis testing procedure whether z test, In research many times we are not sure about the unpaired t test or paired t test requires the underlying distribution of the population. In such population distribution to be normal or cases we apply non parametric tests which are not based on the assumption of normality of population approximately normal. especially when the sample size is small (n μ0 (one-sided) Ha: μ < μ0 (one-sided)

Test Statistic • For population mean when the data is N(μ, σ):

Notes: • t ~ Student‟s t-distribution • df = n – 1 • Measures compatibility between null hypothesis and data • If n is large (≥30), CLT guarantees a approximate normal distribution and the t can be replaced with z, where z follows a standard normal distribution.

Fig : Top: Two-sided (both tails) Ha: μ ≠ μ0 Middle: One-sided (right tail) Ha: μ > μ0 Bottom: One-sided (left tail) Ha: μ < μ0

Decision holds for other situations • This rule is a very generic rule and holds for most (if not all) statistical tests • Reject H0 when the P-value is smaller than the significance level α. Otherwise: Fail to reject H0 This rule is valid in other settings, too. The p-value is the smallest level α at which the data are significant

The sampling distribution of the mean and z-scores When you first encountered z-scores, you were undoubtedly using them in the context of a raw score distribution. In that case, you calculated the z-score corresponding to some value of X as follows:

And, if the sampling distribution of X is normal, or at least approximately normal, we may then refer this value of z to the standard normal distribution, just as we did when we were using raw scores. (This

is where the CLT comes in, because it tells the conditions under which the sampling distribution of is approximately normal.) Standard scores or Z-scores Z scores are scores converted into the number of standard deviations that the scores are from the mean of their distribution. A Z score of +2.0 therefore means that the original score was 2 standard deviations above the mean. A Z of –3.5 means that the original score was three and a half standard deviations below the mean. The mean of a set of Z scores is zero. As we will need to use means of several statistics in later sections we will symbolize means by putting a bar over the symbol representing the statistic. The standard normal distribution The standard normal distribution is the normal distribution with a mean zero and a standard deviation of one, and a total area under its curve of 1.0. The meaning of the area under the curve will become clear in the examples. It is simply the proportion of cases. Tables in the text books give one or more of the following values: (a) The proportion of cases falling in the area between the mean and a given Z score, and/or (b) The proportion of cases falling beyond a given Z score. This is called the proportion in the smaller area, and/or (c) The proportion in the larger area cut-off by a given Z score. Detailed instructions and examples are given below for the use of each of these types of tables. Tables 1, 2 and 3 give selected values from each of the different types. In all of these tables the left hand column is a Z score. Before each table there is a diagram indicating the areas involved. A. An example of the first sort of table is given below in Table 1. This table gives the area between the mean and a given Z score.

Figure shows the area of the curve lying between the mean and a Z of –1.0.

TABLE 1 PROPORTION OF CASES LYING BETWEEN THE MEAN AND A GIVEN STANDARD SCORE Proportion of area of x/σ = Z Curve between M and Z 0.00 .0000 0.10 .0398 0.20 .0793 0.50 .1915 1.00 .3413 2.00 .4772 3.00 .49865 A number of problems can be solved using this table. (i) To find the proportion of cases scoring above a given point. (a) If Z is positive, the proportion scoring above a given point is given by .5000 minus the proportion lying between the mean and the value of Z. Example If Z is +1.00, the proportion of cases lying between Z and the mean is .3413, therefore the proportion above this point is .5000 minus .3413 =.1587. (b) If Z is negative, the proportion scoring above a point is given by .5000 plus the proportion lying between the mean and the Z value. Example If Z is –1.00 the proportion of cases lying between this value and the mean, will be the same as that lying between the mean and a Z scores of + 1.00, because the normal curve is perfectly symmetrical. Thus the required proportion will be .3413 + .5000 = .8413.

(ii) To find the proportion of cases falling below a certain point. (a) If Z is positive the proportions of cases falling below a given point will be equal to .5000 plus the proportion of cases between the mean and that Z score. Example The proportion of cases falling below a Z score of -2.00 is equal to .5000 + .4772 = .9772. (b) If Z is negative the proportion of cases falling below a point will equal .5000 minus the area between the mean and that Z score. Example The proportion of cases falling below a Z score of – 2.00 equals .5000 - .4772 = .0228. (iii) To find the proportion of cases falling between two given points. (a) If both Z scores have the same sign, i.e. if both are positive or both are negative, the proportion falling between the two points will be the proportion lying between the mean and the higher Z minus the proportion lying between the mean and the lower Z. Example What proportion of cases lie between a Z score of +1.00 and a Z score of + 2.00? The proportion between the mean and +2.00 = .4772, while the proportion between the mean and + 1.00 = .3413, thus the proportion lying between the two is .4772 - .3413 = .1359.

(b) If the Z scores have unlike sign, i.e. one is positive and the other is negative, then the proportion lying between them is the sum of the proportions lying between the mean and each Z score. Example What proportion of cases lie in the range between +1.00 and -.50? The proportion between the mean and + 1.00 = .3413, while that between the mean and - .50 equals .1915. Therefore, the proportion of cases falling in the range between +1.00 and -.50 = .3413 + .1915 = .5328.

B. The second type of table described above gives the proportion of cases lying further away from the mean than a given Z score. Some values from such a table are given in Table 2. Again a visual aid is provided in Figure which shows the proportion of cases falling in the area beyond a Z of +1.0.

Proportion of cases in the smaller portion

TABLE 2 PROPORTIONS IN THE SMALLER PORTION OF THE CURVE FOR DIFFERENT VALUES OF Z

x/σ = Z 0.00 0.10 0.20 0.50 1.00 2.00 3.00

Proportion falling in area further away from the mean than the specified Z score .5000 .4602 .4207 .3085 .1587 .0228 .00135

Just as in Table 1 the Z score values start at .00, which is the mean, but values in the body of this table can be seen to be .5000 minus the corresponding value in Table 1. The rules for using this Table 2 are therefore different. (i) To find the proportion of cases scoring above a given point. (a) If Z is positive the value in the table opposite that Z will be the proportion scoring above that point. Example If Z is + 1.00 what proportion of cases will score above that value? Opposite a Z of 1.00 is the proportion .1587, which is the proportion of cases scoring above that value.

(b) If Z is negative the proportion of cases scoring above the value will be 1.0000 minus the proportion opposite Z in the table. Example If Z is –1.00 what proportion of cases will obtain a higher score? The proportion of cases opposite 1.00 in the table is .1587 therefore the proportion scoring above Z score of 1.00 will be 1.0000 - .1587 = .8413. (ii) To find the proportion of cases scoring below a given point. (a) If Z is positive the proportion will be 1.00 minus the proportion opposite the value of Z in the table. Example The proportion of cases falling below a Z of +2.00 is equal to 1.00 - .0228 =

.9772.

(b) If Z is negative the proportion will be equal to the value in the table. Example The proportion of cases falling below a Z score of –2.00 equals .0228. (iii) To find the proportion of cases falling between two specified points. (a) If both Z‟s have the same sign, the proportion lying between them will be the difference between the proportions in the table corresponding to the Z’s. Example What proportion of cases lies in the range between a Z score of + 1.00 and a Z score of +2.00? Reference to Table 2 shows that a proportion of .1587 obtains higher scores than a Z of 1 and .9228 obtains higher scores than a Z of 2. By the rule the proportion lying in the range Z1 − Z2 = .1587 - .0228 = .1359. (b) If the Z’s have unlike sign, the proportion lying between them will equal .5000, Z1 plus .5000 minus the proportion corresponding to Z2. Example What proportion of cases lies in the range between a Z of + 1.00 and a Z of -0.50? The proportion corresponding to a Z of +1.0 is .1587 and the proportion corresponding to a Z of – 0.50 is .3085. Subtracting each of these from .5000 gives .3413 and .1915 respectively. Adding these gives .5328. C. The third sort of table gives the proportion of cases lying in the larger area. An example of this type of table is given in Table 3. Figure shows the proportion of cases in the larger area when Z is +1.0.

Proportion of cases in larger area when Z = +1.0

TABLE 3 AREAS IN THE LARGER PORTION OF THE NORMAL CURVE FOR DIFFERENT VALUES OF Z x /σ = Z 0.00 0.10 0.20 0.50 1.00 2.00 3.00

Area in the larger portion .5000 .5393 .5793 .6915 .8413 .9772 .99865

Problems 1. Write the rule for finding the number of cases falling above a given point. 2. Write the rule for finding the number of cases falling below a given point. 3. Write the rule for finding the number of cases falling between two points. Answers 1. To find the proportion of cases falling above a given point: (a) If Z is positive subtract the value in the table corresponding to Z from 1. Example If Z is +1.00 what proportion of cases will score above that value? The proportion corresponding to a Z of +1.00 is .8413, this proportion subtracted from 1.000 leaves .1587. (b) If Z is negative the proportion opposite Z in the table gives the proportion of cases above that point. Example If Z is –1.00 what proportion of cases will score above that point? The answer directly from the table is .8413. 2. To find the proportion of cases falling below a point: (a) If Z is positive this can be read directly from the table. Example The proportion of cases falling below a Z of +2.00 is .9772. (b) If Z is negative the proportion falling below that point will be 1.0000 minus the proportion in the table corresponding to Z. Example What proportion of cases fall below a Z of -2.00? The answer will be 1.0000 – .9772 which equals .0228. 3. To find the proportion of cases falling between two given points: (a) If the Z‟s have the same sign the answer is obtained by finding the difference between the proportions corresponding to the two Z‟s. Example The proportion of cases lying in the range between a Z of +1.00 and a Z of +2.00 will equal .9772 - .8413 = .1359. (b) If the Z‟s have unlike sign, the proportion will be the proportion corresponding to Z1 minus .5000 plus the proportion corresponding to Z2 minus .5000.

Example The proportion of cases falling between a Z of +1.00 and a Z of -.50 will be .8413 .5000 plus .6915 - .5000 = .5328. CHI SQUARE TEST Situations many times in medical research deals with situations involved with comparing two groups with the presence and absence of various diseases. Here the qualitative or categorical variables are measured in terms of “counts”. In other words, the researcher is interested in finding out the association or relationship between two qualitative variables. The statistical tests used for such variables which do not assume normality of the variable are specific and called as nonparametric tests. These are also called as distribution free test. These tests are weaker than parametric test and require fewer assumptions. For categorical data, the test used is called as chi-square test (χ2 test). Since information is qualitative, it would be collected in counts. When the information is collected in counts, it is compiled and presented in a table called as contingency table. When both the qualitative variables are dichotomous, the tabular presentation takes the form of 2 x 2 contingency table (2 rows and 2 columns). In general we can have r x c contingency table where r is number of rows and c is number of columns. Under the null hypothesis, the test statistic follows a chi-square distribution with (r-1)x(c-1) degrees of freedom. The number of subjects falling in each category from our collected sample is called observed frequencies (Oij) and the numbers of subjects in our sample, that we would expect to observe, if null hypothesis is true are called the expected frequencies (Eij). Chi-square tests have three applications: 1. χ2 test for independence to test whether two or more characteristics are associated (independent) to each other. 2. χ2 test for goodness of fit to study whether two or more independent populations are similar with respect to some characteristic. 3. χ2 test for homogeneity to study whether two study groups independently drawn from two populations are homogenous with respect to some criteria of classification. In all the three situations, the test statistic takes the same formula given as follows:

where Oij and Eij are observed and expected frequencies for ith row and jth column. The expected frequencies are calculated as follows:

The χ2 test statistic value give details as to how close the observed and expected frequencies lie. When the null hypothesis is true, the test statistics follows a χ2 distribution with (r-1)*(c-1) degrees of freedom. The calculated χ2 is compared with the χ2 table value and the decision to reject H0 is taken if the calculated test statistic value is greater than the table value for specified value of α.

ANOVA The world is complex and multivariate in nature, and instances when a single variable completely explains a phenomenon are rare. For example, when trying to explore how to grow a bigger tomato, we would need to consider factors that have to do with the plants' genetic makeup, soil conditions, lighting, temperature, etc. Thus, in a typical experiment, many factors are taken into account. One important reason for using ANOVA methods rather than multiple two-group studies analyzed via t tests is that the former method is more efficient, and with fewer observations we can gain more information. Many times the researcher is dealing with situations where there are more than 2 groups of interest. In such cases the usual z test and t test fails. The correct technique to compare means in three or more groups is analysis of variance or ANOVA. When one qualitative variable defines the groups, a one way ANOVA is used whereas when the groups are defined by two qualitative variables a two way ANOVA is used. ANOVA is a much more flexible and powerful technique that can be applied to much more complex research issues. It should be noted that the analysis of variance is not intended for testing the significance of sample variances but its purpose is to test for significance of difference among sample means. When dealing with three or more groups the variability creeps in from two sources. One source is because of the groups itself, i.e. between group variability and the second source is within group variability. One of the basic assumptions of ANOVA is that the variances in the groups being compared are all equal i.e. homogenous. Also the variable of interest in the three groups is measured on either continuous or discrete scale. In case if the variable is measured in ordinal scale we use a non parametric test, the Kruskal-Wallis test where we compare the medians in the groups rather than mean. The assumption regarding homogeneity of variances is tested using Bartlett‟s test.

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF