How to Construct and design a scale: Techniques and Guidelines...
7/10/2015
Guidelines in Scale Development Psychological and Psychometric Testing
Scale Construction
Session 6 Prof. Swati Dhir
Step 1
• Determine clearly What it is you want to measure
Step 2
• Generate an Item Pool
Step 3
• Determine the format for Measurement
Step 4
• Have initial item pool reviewed by Experts
Step 5
• Consider inclusion of Validation items
Step 6
• Administer Items to a Development Sample
Step 7
• Evaluate the items
Step 8
• Optimize scale Length
Constructs and Measurement
Purpose
Should the scale be based in theory or should you strike out in new intellectual direction ?
To design a questionnaire that provides a quantitative measurement of an abstract theoretical variable
Figuring out how to measure what you want to measure
Not all surveys are scales; Decide whether it really is a scale
Should some aspect of phenomenon be emphasized more than others ?
Good scales possess both validity and reliability
Step 1
Step 2
Step 3
•Determine clearly •Generate an Item •Determine the What it is you Pool format for want to measure Measurement
Theory as an aid to clarity
Specificity as an aid to clarity
Step 4
Step 5
•Have initial item •Consider pool reviewed by inclusion of Experts Validation items
Step 6
Step 7
•Administer Items •Evaluate the to a Development items Sample
Step 8
Construct Development : A construct is a hypothetical variable composed of different elements that are thought to be related (e.g., 5 questions tapping Job satisfaction)
•Optimize scale Length
Boundaries of the phenomenon must be recognized so that the content of the scale does not drift into unintended domains
Locus of control is a widely used concept that concerns who or what influences important outcomes in lives
Creating Items Writing good items for a scale is definitely an art rather than a science Think creatively about the construct you seek to measure Make the questions simple, specific and straightforward Avoid biased language (emotional words, emphasized text)
Multidimensional LOC
LOC: oneself , powerful others and chance or fate Depends largely what level of locus relates to the questions
Avoid double-barreled questions • Do you think that the technical service department is prompt and helpful? Avoid Nonmonotonic questions
What to include in a measure
Items that cross over into a related construct can be problematic
• Only people in the military should be allowed to personally own assault rifles.
Step 1
Step 2
Step 3
•Determine clearly •Generate an Item •Determine the What it is you Pool format for want to measure Measurement
Step 4
Step 5
•Have initial item •Consider pool reviewed by inclusion of Experts Validation items
Step 6
Step 7
•Administer Items •Evaluate the to a Development items Sample
Step 8 •Optimize scale Length
1
7/10/2015
Creating Items
Three Components of Attitudes
Redundancy: Reliability= f (no. of items) • I will do almost anything to ensure my child’s success • No sacrifice is too great if it helps my child achieve success
No. of items- 2:1 Avoid exceptionally lengthy items; Reading difficulty level Use reverse coding a number of your items • Highest value +Lowest value – selected response
Cognitive Component
Affective Component
• How a person thinks about an attitude object (product, issue, candidate, idea)
• How a person feels about an attitude object
Behavioral • A person’s behavioral predisposition to respond to an attitude object in a certain way
Common structure, self contained and no dependency between items
Step 1
Step 2
Step 3
•Determine clearly •Generate an Item •Determine the What it is you Pool format for want to measure Measurement
Step 4
Step 5
•Have initial item •Consider pool reviewed by inclusion of Experts Validation items
Step 6
Step 7
•Administer Items •Evaluate the to a Development items Sample
Step 8 •Optimize scale Length
On the Importance of Attitudes
Measurement • The term questionnaire item is used to denote a single question on a survey, corresponding to a single column in a dataset. • Scales typically denote sets of questions which become mathematical combinations of survey items.
I believe both candidates bring strengths to the table
Step 1
Step 2
•Determine •Generate an clearly What it Item Pool is you want to measure
Measurement/Scaling Properties • Assignment • You can assign objects to categories
Step 4
Step 6
Step 7
•Consider •Administer inclusion of Items to a Validation items Development Sample
Step 5
•Evaluate the items
Step 8 •Optimize scale Length
Types of Scales •
Nominal Scale • Has Assignment Only (What is Your Gender?)
•
Ordinal • Has Assignment, Order (Education)
• Order (Magnitude) • You can order objects in terms of having more or less of some quality • Distance (Equal Intervals) • The distance between adjacent points on the scale is identical
Step 3
•Determine the •Have initial format for item pool Measurement reviewed by Experts
– What is your income? (5-10k; 11-15k; 16-20k; 20-25k; 25-30k)
•
Interval • Has Assignment, Order, Equal Intervals (Temperature)
•
Hybrid Ordinally-Interval Scale • Like an ordinal scale, but researcher “pretends” it is an interval scale (e.g., assumes 1 to 7 scale is an interval scale); commonly used in questionnaires
•
Ratio • Has Assignment, Order, Equal Intervals, Absolute Zero (Number of Cars, weight)
• Origin (Absolute Zero Point) • Zero “means something” (absence of a given quality)
2
7/10/2015
Formats for Measurement
Issues in Designing Verbal Rating Scales
• Different intensities of the attribute, spaced to represent equal intervals • Could be formatted with agree-disagree response option
Thurstone Scaling
• Series of items tapping progressively higher levels of an attribute • Do you smoke? • Do you smoke more than 10 cigarettes' in a day? • Do you smoke more than a pack?
Guttman Scaling
• A list of adjective pairs either unipolar or bipolar • e.g. Friendly or not friendly ; Friendly or hostile
Semantic Differential
•
Many measures taken by researchers are verbal ratings
•
What do we need to consider when we develop verbal rating scales? – Number of categories – Forced vs. unforced scale – Balanced or unbalanced scale – Extent of verbal description – Should response categories be numbered or not – Comparative vs. noncomparative scale – Scale direction
• The item is prepared as a declarative sentence, followed by response options indicating varying degree of agreement • Widely used in measuring opinions, beliefs and attitudes
Likert Scale
Number of Response Categories?
Forced vs. Unforced Scale?
• To what extent are you satisfied with your current Laptop ?
• How likely would you be to buy a car manufactured in Brazil?
• Most researchers suggest between 5 and 7 categories; for example:
• Forced Scale (even number of options forces the respondent to lean one way or the other):
1
2
3
4
5
6
7
1
2
3
4
5
6
Extremely Dissatisfied
Dissatisfied
Somewhat Dissatisfied
Neither
Somewhat Satisfied
Satisfied
Extremely Satisfied
Very Unlikely
Unlikely
Somewhat Unlikely
Somewhat Likely
Likely
Very Likely
• Unforced scale gives people a neutral option:
• Too few does not give you enough information • Too many and it will be hard for people to discriminate between the options (e.g., a 100-point scale)
Balanced vs. Unbalanced Scale?
1 Very Unlikely
2 Unlikely
3 Somewhat Unlikely
4 Neither
5 Somewhat Likely
• India should invest in Infrastructure.
• Balanced scale (same number of positive and negative options):
• Label endpoints or label all options?
2 Dissatisfied
3 Somewhat Dissatisfied
4 Neither
5 Somewhat Satisfied
6 Satisfied
7 Extremely Satisfied
7 Very Likely
Extent of Verbal Description?
• How satisfied are you with your current hair stylist?
1 Extremely Dissatisfied
6 Likely
1 Strongly Disagree
2
3
4
5
6
7 Strongly Agree
1 Strongly Disagree
2 Moderately Disagree
3 Slightly Disagree
4 Neither Agree or Disagree
5 Slightly Agree
6 Moderately Agree
7 Strongly Agree
• Unbalanced scale (here all options are positive): 1 Somewhat Satisfied
2
3
4
5
6
7 Very Satisfied
• Unbalanced scale can give biased results; unless distribution is naturally skewed to one side of the scale, should use balanced scale
• Labeling all options can aid in interpretation
3
7/10/2015
Should Categories be Numbered? • Toyota is an Environment Friendly Company Strongly Disagree 1 -3
Moderately Disagree 2 -2
Slightly Disagree 3 -1
Neither Agree or Disagree 4 0
Slightly Agree 5 1
Comparative vs. Noncomparative? • Noncomparative question • How would you evaluate Pepsodent toothpaste?
Moderately Agree 6 2
Strongly Agree 7 3
Should we have numbers here?
• Numbers can help respondents understand scale
• Comparative question • Compared to your current brand, how would you evaluate Pepsodent toothpaste? • Comparative questions establish the referent and can be useful if you need to know how your product compares to a specific competitor or the customer’s current brand
• 1 to 7 scale quite common • But -3 to +3 can help interpretation of scale (disagree is negative, agree is positive), however, it may overemphasize negativity
• Noncomparative have the advantage of allowing the respondent to create their own referent, which can potentially improve accuracy
• Judgment call; pretesting both scales could help identify problems
Direction of Scale? • Typical direction (lower values, negative connotation on left): Strongly Disagree 1
Moderately Disagree 2
Slightly Disagree 3
Neither Agree or Disagree 4
Slightly Agree 5
Moderately Agree 6
Strongly Agree 7
• Some scales are not valenced, so must be careful about positioning. For a semantic differential scale, with amusing positioning: Unpleasant
-2
-1
0
1
2
Pleasant
Flimsy Male
-2 -2
-1 -1
0 0
1 1
2 2
Sturdy Female
Single-items adequate for measurement? • Suppose an instructor had single-question exams? • Suppose the CAT (or GMAT) had only 5 possible scores (similar to A,B,C,D,F grades)?
• This arrangement suggests that males are to be evaluated negatively; must be careful in designing scales so as not to bias results
Composite, or Multiple-Item Scales
Formative and Reflective Items
Capture the sensitivity to the continuous nature of many subtle differences among consumers Simultaneously address concerns of: Accuracy and Consistency
Formative items
Can be combined to measure the multiple aspects of a construct, though not necessary that respondents answer each item similarly
Reflective items
Measures a single trait and respondents should answer each item similarly
All relate to larger issue of measurement error Items within a scale are typically interchangeable for reflective items but not for formative items
4
7/10/2015
Formative Scale Items: Satisfaction Reflective Items: Materialism My last flight on JA departed on-time.
Timeliness
An airline could always be on-time if they made that their priority
I admire people who own expensive homes, cars, and clothes.
JA has competitive fares.
Pricing
It upsets me to know others on the same flight have paid a lower price for their seat. JA ticketing personnel are polite.
Staff
I don’t place much emphasis on the amount of material objects people own as a sign of success.*
JA has friendly reservation operators. I know it’s not the airline’s fault when a flight is cancelled.
Service
The two-item restriction on carry-on luggage is insensitive to the needs of today’s passengers. JA has ample leg-room for me in coach seating.
Some of the most important achievements in life include acquiring material possessions.
Travelling Comfort
The things I own say a lot about how well I’m doing in life. I don’t pay much attention to the material objects other people own.*
JA did not lose my luggage on my last trip.
• * Reverse coded
I have not been “bumped” from a JA flight in the last two years.”
Reviewed By Experts • Ask panel of expert to rate how relevant they think each item is to what you intend to measure • Provide the expert the working definition of the construct • Can evaluate the items clarity and conciseness (by rating relevance as high, moderate or low)
Step 2
•Determine •Generate an clearly What it is Item Pool you want to measure
Step 3 •Determine the format for Measurement
Step 4
Step 5
Step 6
•Have initial item •Consider •Administer pool reviewed inclusion of Items to a by Experts Validation items Development Sample
Step 7 •Evaluate the items
Step 8
-Social desirability scale (Strahan and Gerbasi, 1972)
- For detecting undesirable response tendencies we can use MMPI (Minnesota Multiphasic Personality Inventory) and response biases can be detected
• Reliability: Test- Retest Method, Alternate forms method and Split haves method
•Optimize scale Length
Consider inclusion of validation items • Social desirability
• Internal Validity (No confounds) • External Validity (Generalized to your target population) • Content related Evidence: Face validity • Criterion Related Evidence: Predictive Validity, Concurrent Validity • Construct Related Evidence: Convergent Validity, Discriminant Validity
• Can provide pointing out ways of tapping the phenomenon that you have failed to include
Step 1
Validity and Reliability
Administer Items to a Development Sample
• Administer items along with the pool of new items to some subjects • The subject sample should be large enough to eliminate subject variance as a significant concern • If a single scale is to be extracted from a pool of about 20 items , fewer than 300 subject may suffice • Entering the data – Using Computer software • www.surveymonkey.com • http://www.qualtrics.com Step 1
Step 2
Step 3
•Determine clearly •Generate an Item •Determine the What it is you Pool format for want to measure Measurement
Step 4
Step 5
•Have initial item •Consider pool reviewed by inclusion of Experts Validation items
Step 6
Step 7
•Administer Items •Evaluate the to a Development items Sample
Step 8 •Optimize scale Length
5
7/10/2015
Evaluate the items
Why a large sample ?? In small sample, patterns of co variation among the items may not be stable Development sample may not represent the population for which the scale is intended • Level of attributes present in sample v/s intended population • A sample that is qualitatively rather than quantitatively different from the target population (the relationship among items or constructs may differ from the population)
• An item should high co relation with the true score of latent variable – Inspect the correlation matrix – higher the co relation among items higher are the individual item reliabilities
• Reverse Scoring • Item Scale co relation- an uncorrected itemtotal co relation makes good conceptual sense , the reality is that the item’s inclusion in scale can inflate the co relation coefficient Step 1
Step 2
Step 3
•Determine clearly •Generate an Item •Determine the What it is you Pool format for want to measure Measurement
Evaluate the items • Item variance –valuable attribute for a scale item is relatively high variance • Items means – close to center of the range of possible scores is also desirable otherwise item might fail to detect certain values of construct • Coefficient alpha-is an indication of proportion of variance in the scale scores that is attributable to true score –a non central mean, poor variability, negative co relation among items, low item scale co relation and weak inter item co relation –will tend to reduce alpha
Step 5
Step 6
Step 7
•Administer Items •Evaluate the to a Development items Sample
Step 8 •Optimize scale Length
Optimize Scale length • Effect of scale length on reliability -Scale alpha is dependent on co variation among the items and no of items -If a scale reliability is too low, then brevity is no value • Effects of dropping bad items –if an item has sufficiently lower than average correlation with the other item, dropping it will raise alpha • Tinkering with scale length- items whose omission has the least –ve or most +ve effect on alpha is the best one to drop first
Step 1
Step 2
Step 3
•Determine clearly •Generate an Item •Determine the What it is you Pool format for want to measure Measurement
• Split Items- If developmental sample is sufficient large, split it into two sub samples one can serve as primary development sample and other can be used to cross validate the findings - Splitting provides valuable information about scale stability
Step 4
•Have initial item •Consider pool reviewed by inclusion of Experts Validation items
Step 4
Step 5
•Have initial item •Consider pool reviewed by inclusion of Experts Validation items
Step 6
Step 7
•Administer Items •Evaluate the to a Development items Sample
Step 8 •Optimize scale Length
Psychological and Psychometric testing
Session 8: Item Analysis Prof. Swati Dhir
[email protected]
6
7/10/2015
Item Analysis - Outline In constructing a new test (or shortening or lengthening an existing one), the final set of items is usually identified through a process known as item analysis.
—Linda Croker
Both the validity and the reliability of any test depend ultimately on the characteristics of its items.
1. Types of test items • Selected response items • Constructed response items
2. Parts of test items 3. Guidelines for writing test items 4. Item Analysis • Distracter measures • Item difficulty measures • Item discrimination measures
1. Types of test items
Selected response • Multiple choice • Likert scale • Q-sort
Constructed response • • • • •
Free response Fill-in-the-blank Essay tests Portfolios In-basket technique
A. Selected response • Task is to choose between set answers • Multiple choice or forced choice • Advantage: Ease of scoring & scoring requires little skill • Disadvantage: may test memory rather than comprehension • Correct response must be distinct • Distracters should not be obvious or ambiguous
A. Selected response • Multiple choice or forced choice • Likert format
• Test-taker chooses a point on a scale that expresses their attitude or belief • Data lend themselves to factor analysis
A. Selected response • Multiple choice or forced choice • Likert format •Q-sort
• A large set of cards each with statement referring to a “target” • Test-taker sorts cards into piles in terms of how • accurate statements are as a description of target • Generally 9 piles
7
7/10/2015
B. Constructed response items Free response Fill-in-theblank
B. Constructed response items
• Test-taker responds without constraint • Describes what is important to him/her
Strengths
• Used to test for knowledge or to find out about beliefs and attitudes
Essay tests
• Preferred when you want to assess test-taker’s ability to think analytically, integrate ideas, and express himself/herself
Portfolios
• Not really a test • Collections of things the person being evaluated has produced
In-basket technique
• Used in business; Job candidate gets a set of“everyday” problems, says how he or she would deal with those problems • Requires expert raters to grade response
Assess higher-order skills More useful feedback to test-taker Positive influence on study habits
Weaknesses Time consuming to use Possible subjectivity in scoring
Easier to create items
2. Parts of test items
3. Writing test items – guidelines
Stimulus or item stem • What the subject responds to
A. B. C. D. E. F.
Response format or method • Typically multiple choice, Likert or constructed response
Conditions governing the response • time limits; allowing probes for ambiguous responses; how response is recorded
Define clearly Generate a pool of potential items Monitor reading level Use unitary items Avoid long items Break any response “set”
Procedures for scoring the response • Particularly important for constructed response items
A. Multiple choice – distracter measures
4. Item analysis • Multiple choice distracter analysis
Item difficulty measure P
Discrimination index D
Item – total correlation
How many people choose each distracter?
• Distracters should be equally attractive • Correct choice should be based on knowledge • Where knowledge is lacking, choice should be random
8
7/10/2015
Estimation Methods
B. Item Difficulty Measure Pi The item difficulty for item i, pi , is defined as the proportion of examinees who get that item correct.
Method for Dichotomously Scored Item
P(i) = # got item correct # taking test
Though the proportion of examinees passing an item traditionally has been called the item difficulty, this proportion logically should be called item easiness, because the proportion increase as the item becomes easier.
Method for Dichotomously Scored Items Difficulty Factor
Method for Polytomously Scored Item
Grouping Method
Difficulty Factor Range 0 -1; Optimal Level is .5
R P N
P is the difficulty of a certain item. R is the number of examinees who get that item correct. N is the total number of examinees
The HIGHER the difficulty factor – the easier the question is, so a value of 1 would mean all the students got the question correct and it may be too easy If you want the subjects to master the topic area, high difficulty values should be expected
Guided Practice What is the P for Items 1-3
Example 1 There are 80 high school students attending a science achievement test, and 61 students pass item 1, 32 students pass item 10. Please calculate the difficulty for item 1 and 10 separately. P1= 0.76; P10= 0.4
Student
Raw score
Item 1
Item 2
Item 3
Item 4
Item 5
A
8
a
b
a
d
e
B
6
c
b
e
c
e
C
6
a
c
e
c
b
D
4
a
b
e
a
c
E
2
c
a
b
d
c
F
8
a
b
c
c
e
G
10
a
b
a
c
e
H
6
a
b
c
d
e
I
8
a
c
a
c
e
J
4
a
c
a
d
b
9
7/10/2015
Difficulty Factor
Method for Polytomously Scored Items
P
What does it mean? • • • • •
Item # 1 = .8 may be too easy Item # 2 = .6 good Item # 3 = .4 may be slightly difficult Item # 4 = 0.5 Optimum Item # 5 = 0.6 Good
X X max
X , the mean of total examinees’ scores on one item X max , the perfect scores of that item
The perfect scores of one open- ended item is 20 points, the average score of total examinees on this item is 11 points. What is the item difficulty? P = .55
Grouping Method (Use of Extreme Groups) (T. L. Kelley, 1939)
Upper (U) and Lower (L) Criterion groups are selected from the extremes of distribution of P PL test scores or job ratings. P U 2
PU
is the proportion for examinees of upper group who get the item correct.
PL
is the proportion for examinees of lower group who get the item correct.
Example 3 There are 371 examinees attending a language test. Known that 64 examinees of 27% upper extreme group pass item 5, and 33 examinees of 27% lower extreme group pass the same item. Please compute the difficulty of item 5. Key : 0.49
Correct Chance Effects on Item Difficulty for Multiple-Choice Item
The difficulty of one five-choice item is .50, the difficulty of another four-choice item is .53. Which item is more difficulty?
CP
KP 1 K 1
CP ,corrected item difficulty P , item difficulty K , the number of choices for that item
ANSWER CP1
KP 1 5 0.5 1 0.38 K 1 5 1
CP2
KP 1 4 0.53 1 0.37 K 1 4 1
So, the four-choice item is more difficult.
10
7/10/2015
C. Item Discrimination Measures
Item Discrimination
Item discrimination refers to the degree to which an item differentiates correctly among test takers in the behavior that the test is designed to measure
Item-total correlation
Discrimination index D
Discrimination Index D
(Used for dichotomously scored items)
• Extreme groups method – – – –
U = # getting item correct in ‘top’ group L = # getting item correct in ‘bottom’ group nU = # in top group nL = # in bottom group
D= U – L nU nL
To be able to discriminate between different levels of achievement, the difficulty factor should be between .3 and .7
Example 1
There are 141 students attending a world history test. (1) If we use the ratio 27% to determine the upper and lower group, then how many examinees are there in the upper and lower group separately? (2) If 18 examinees in upper group answer item 5 correctly, and 6 examinees in lower group answer it correctly, then calculate the discrimination index for item 5. Answer: 38, 0.315
Values of D may range from -1.00 to 1.00.
Item Total Correlation
Guidelines for Interpretation of D Value D≥.40, the item is functioning quite satisfactorily
Good item
High correlation People who get item correct have high score on the test
.30≤ D≤.39, little or no revision is required .20 ≤ D≤.29, the item is marginal and needs revision
People who get item wrong have low score on the test
Poor item
Low correlation: look at wording – may be testing reading skill
D≤.19, the item should be eliminated or completely revised
11
7/10/2015
Choice Analysis
Psychological and Psychometric Testing
Whether the examinees who choose the correct choice is more than those who choose the wrong choices Whether a lot of examinees choose the wrong choices Whether the examinees of upper group who choose the correct choice is more than the examinees of lower group Whether the examinees of upper group who choose the wrong choice is more than those of lower group
Session 8&9 Prof. Swati Dhir
Whether there is any item that quite a number of examinees make no choices
Literature Review (Home work)
Excel Add-ins • Use the Analysis ToolPak to perform complex data analysis • If data analysis command is not available • Command: File_Option_Add Ins_Manage_Select_ Analysis Toolpak (check box and ok
Research Methodology • • • • • • • • • • •
Item Generation Content validation Adding some criterion related construct Context of the study Interitem Analysis Exploratory Factor Analysis Construct validity (Convergent and Divergent) External Validity Sampling Adequacy Reliability Criterion Validity (Predictive and Concurrent)
Content Validity • • • • •
Rating by experts 80% consensus Drop the items if it is not consistent Items may be reworded Command: Analyze_ Descriptive Statistics_ Cross tabs • Select rater 1 as row and rater 2 as column • Click statistics_ select kappa_ continue
12
7/10/2015
Example
Content Validity Kappa might be interpreted (Landis & Koch,1977) Kappa
Data Entry • Files export • Variable view • Missing Values (Analyze_Missing Value) Descriptive Statistics (DS): – Frequency (Analyze_DS_Frequency
• Data cleaning
Interpretation
0.5 • Square of factor loading is the percentage of variation in the criterion we can know from the test scores • Command: Analyze_Dimension Reduction_Factor
13
7/10/2015
Eigen Values ≥ 1
Scree Test
• Most widely used of all factor number rules
– Involves constructing a graph in which eigen values from the matrix are plotted in descending order
• For any matrix of correlations, it is possible to compute a set
– Graph is then examined to determine the number of eigen values that precedes the last major drop
of numerical values called eigen values. • They reflect the variance accounted for by principal
components, – with the first value reflecting the variance explained by the
strongest component, Example of a Scree Plot
– the second value the variance explained by the second strongest
component and so on.
Limitations There is no clear definition of what constitutes a major drop. Sometimes the data may produce a gradual decreasing slope with no major break points The scree test has been found to function reasonably well in cases where strong PCs are present.
External Validity • Means and medians should not be very different • Skewness: measure of symmetry or more precisely the lack of symmetry ( Correlation Construct
Sampling Adequacy • Kaiser Meyer Olkin KMO: To check the case to variable ratio for the analysis – Range= 0-1 – Acceptance limit >0.6
• Bartlett’s test of Sphericity: Relates to the study and thereby and thereby shows the validity and suitability of the responses collected – Significant at 0.05 (with 95% confidence limit) Command: Analyze_Dimension Reduction_Factor
14
7/10/2015
Internal Consistency: Reliability
Criterion Related Evidence • Predictive Validity
• Command: Analyze_Scale_Reliability Analysis_Alpha
• R Square, Beta value and significance level • Intercorrelations among all the factors • Command: Analyze_Regression_Linear_DV and IV
15