Psychological Assessment
March 22, 2017 | Author: gehannah | Category: N/A
Short Description
Notes from the review center that I gathered. Unfinished tho...
Description
1
PSYCHOLOGICAL ASSESSMENT Conceptual Paradigm for Measurement and Evaluation Samples of Behavior:
Measurement
- Mental Abilities - Personality -
Scales (IRON) -
Interval Ratio Ordinal Nominal
Test Single Measure
Battery Tests
Assessment
Series of tests
Various Techniques (DITO) -
Evaluation (RAP) Recommendatio n
Documents - Action Plan Interview - Program Psychopathology Personality Mental Abilities Test Development Observation Diagnosis - Traits - General Intelligence (g) – IQ
- classification - States - Specific Intelligence (s) – Non-
- severity - Types – MBTI Prognosis - Aptitude - Multiple Intelligence - predicting - Interest the dev’t of the d/o - Values verbal IQ
Measurement (IRON) Parametric: Normal Distribution of Scores (Pearson’s r)
Non-Parametric: Abnormal Distribution of Scores (Spearman, (chi-square(nominal)) Interval: Temperature, Time, (IQ) – has no Ordinal: Rank, Positions, Likert Scale, Birth absolute zero Order Ratio: Weight, Height – has absolute zero Nominal: Sex, Civil Status – classifying *has absolute zero: weight – there could be no or 0 [value of] weight has no absolute zero: temperature – there’s no 0 or no temperature normal distribution of scores – if the mean, median, mode are all the same (measures of central tendency)
abnormal distribution of “ – skewed
Objective Tests
Psychological Tests Projective Tests (WIDU)
Standardized Wishes Test Administration, Scoring, Intrapsychic conflict – conflict bet. Interpreting test scores desires & morals Limited number of responses – multiple Desires choice; true or false Unconscious motives Group Tests Subjectivity on test – Norms interpretation/clinical judgment - norm-referenced test (NRT) Self-administered/individual tests Unlimited no. of responses - criterion-referenced test (CRT) Norms – where we base the scores of the test takers -- transform scores into a meaningful scale > NRT – age norms > CRT – ex. how would we know if a basketball player is skillful? -> sharpshooter; there’s certain criterion to be met
2 Medium of Psychological Tests
Battery of Tests – sets of tests
Paper and pencil Objects: wooden blocks, puzzles Machine: Galvanic skin responses (ex. EEG, CT Scan) Computer
Psychological Tests Ability Tests Intelligence Tests -
Personality Tests
Achievement Tests
Verbal Intelligence Non-verbal Intelligence
-
Ex. Weschler Adult Intelligence Scale Stanford Binet Int. Scale Culture Fair Intelligence Test
measures the extent if one’s knowledge various academic subject
Ex. Achievement Test (what has been learned?) Stanford Achievement Test in reading
Personality Tests -
Object ive
Traits / Domains or Factors
Ex. Myers-Briggs Test Inventory
* Usually, no right or wrong answers
Aptitude Tests (predicting) -
Various skills / competencies
Ex. Differential Aptitude Test Results are integrated into a single score interpretation
Documents -records, protocols, collateral reports
Assessment Techniques (DITO) Interviews Tests -interview responses, screening
-Initial assessment > verification Forms: written, verbal, visual,
Evaluation – Recommendation, Action Plan, Program Development - Summarizing results of assessment Test to SSCCRREEN
Screen applicants Self-understanding Classify people Counsel individuals Retain, dismiss, or promote employees Research for programs, test construction Evaluate performance for decision-making Examine and gauge abilities Need for diagnosis and intervention
Observation -behavioral observation -observation checklist
VALIDITY – measures what it purports to measure
Content Validity - degree to which the tests represent the essence, the topics, and the areas that the test is designed to measure (appropriate domain) - Primary concern of test developers because it is the content of the items that really reflects the whatness of the property intent to be measured - Ex. achievement, aptitude, personality tests - table of specification (blueprint) (TOS) (under analysis) TOS – generate items → checked/validated by (at least 3) experts a.k.a “raters” Depression ↓ *Suicidal Ideation |* SelfDomains* (* in the box ) harm -
-
Procedures on how to achieve high degree of content validity 1. Pre-survey or Review of Related Literature - Focus on the theoretical constructs that is related to the test you are planning to make, test used, purpose of the said test, areas covered, format, scaling techniques, etc. - This may start the development phase of the instrument you are to construct. Item analysis – focuses on the items itself o Ability, aptitude tests (tests that have right & wrong answers)
Factor analysis – focuses on the domains (if a factor really is a factor) o Personality tests o Uses Chronbach alpha - Empirical research 2. Development of Table of Specification (TOS) - Determining the areas of concepts that’ll represent he nature of the variable being measured and the relative emphasis of each area are essentially judgmental - A detailed TOS includes areas / concepts, objectives, number of items in each area 3. Consultation with Experts (raters) - After making your own judgments, you need to consult your thesis adviser or someone who has the expertise in making judgment about the representativeness / relevance of the entries made in your TOS 4. Item Writing - At this stage, you should know what type of items you are supposed to construct: the type of instrument, format, scaling, and scoring techniques - Every test item is based on the creative talent of the item writer and on the background on the test content Construct Validity - Theoretical domains, factors / components - Personality
X Y Optimism↑ Optimism ↑ (convergent) Constructs X Y Optimism↑ Pessimism↓
1. Convergent V – direct correlations between variables ( X↑Y↑)
Measure that correlates well with other tests believed to measure the same construct 2. Divergent V (Discriminant) – demonstrates that a test measures something different from that other available tests measures - A tests should have low correlations, or evidence for what the test does not measure -
Criterion-related Validity is estimated by correlating a subject’s score on a test with an analysis of their behavior on an independent real life criterion. If this criterion you need to assess and correlate is occurring now, you are assessing concurrent validity. If the assessment criterion is to occur in the future, you are assessing predictive ability. Construct Validity (a.k.a true validity) is the extent to which there is evidence that a test measures a particular hypothetical construct. For example, are we really measuring intelligence with an IQ test where there are so many competing theories regarding what intelligence actually is? Coefficient value – estimate value Variability – margin of errors (because we’re human beings) Unsystematic error can result from varied assessment implementation. E.g. scoring via raters RELIABILITY – consistency This suggests that the scores you gather on
psychological tests are not in fact true of real Observed Test Score = True Score + Measurement Error scores. But, rather, those scores represent a X=T+e combination of many factors. In theory, the reliability coefficient (rxx) gives us an index of the influence of true scores and error scores on any given test. It is the ratio of true score variance of the total variance of the test. In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that this r represents an rxx.
Models / Types of Reliability (the type depends on what test you are going to measure) 1. Test – Retest Reliability – Pearson’s r - Gives the same test to the same group of test takers on 2 different occasions - Scores on the 1st administration are compared to scores on the 2 nd administration using r - 15 days or a month - Too early – familiarity (carryover effect); too long – maturity - Often researchers consider this to be a better measure of temporal stability (consistency of test scores..) - Assumption: people don’t change on 2 administrations - PROBLEM: Practice or carryover effects ( beneficial to the test takers) 2. Alternate Forms of Reliability – r - To eliminate the practice effects and other problems with the test-retest method (i.e reactivity), test developers often give 2 highly similar forms of the test to the same people at different times. - Reliability, in this case, is again assessed at different times. - To develop alt. form that is equivalent in terms of content, response, and statistical characteristic. - PROBLEM: difficulty of developing anotherSpearman-Brown (equivalent; same difficulty) form of the Formula test kr
rxx=
3. Split-half Reliability – Spearman-Brown prophecy where
( 1+ ( k −1 ) ) r rxx – reliability coefficient r – coefficient k–
-
Measures the internal consistency of the test Eliminate / reduce the problems of the ff: 1. 2. 3.
The need for 2 admin. of a test The difficulty of developing another form Carryover or reactivity effect
1. KR—20 (Kruder & Richardson, 1937, 1939) – for tests which questions can be scores either 0 or 1 (binary; dichotomous) 2. Coefficient alpha (Cronbach, 1951) – rating scales that have 2 or more possible answers Problem: whether the test being split is homogenous (i.e measuring one characteristic) or heterogenous (i.e measuring many characteristics) every item is compared to one another
Split-half reliability is mostly similar to internal consistency. halves of the were (correlated) measured
3. Scorer Reliability (inter-rater reliability) – judgments or ratings made by different scorers are often compared using correlation to see how much they agree. If tests are being used to make important final decision about people then the reliability of a test should be high (0.95) Lower reliability levels may be acceptable when: Making preliminary decisions, Sorting people into groups, Conducting research, etc.
Standard Error of Measurement (SEM or Standard Deviation) - Index of measurement of inconsistency or the amount of expected error in an individual score (i.e how much is the score is likely to differ)
Standard Deviation: high – heterogenous (more spread) low – homogenous (less spread) *in terms of scores
Factors that can affect reliability 1. Errors that can increase or decrease individual score: - the test itself - the test administrator - the test scoring - the test taker 2. Test length – as a rule, adding more homogenous items will increase the reliability of the test. 3. Method used to estimate reliability – split-half reliability methods yield higher reliability estimates than test-retest or alt. forms methods Psychometric properties: - reliability (consistency) - validity (measures what it intends to measure) - norming - standardization The goal is to increase the probability of getting the true score and minimizing the standard error of measurement. Test score is composed of observed score (actual score), true score (reflection of what you really know), and error score (difference between the true score and the actual score)
trait score – sources of errors that reside within the individual taking the test (excuses:
hungry,
Observed Score = true score + error score
headache, unprepared, etc.)
method score – sources of errors that reside in the testing situation (lousy instructions, too warm/cold room, missing pages, etc.)
Reliability=
True Score True Score+ Error Score
Interrater reliability =
Number of agreements Number of disagreements
error ↓ reliability ↑
Stability – the same results are obtained over repeated administration of the instrument. - Test-retest reliability - parallel, equivalent, or alternative forms Homogeneity – internal consistency (unidimensional) - item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbachalpha Item-total correlations – each item on an instrument is correlated to total score – an item with low correlation may be deleted. Highest and lowest correlations are usually reported - only important if homogeneity of items is desired Kuder-Richardson coefficient – when items have dichotomous response e.g. yes/no (binary) Cronbach’s-alpha – Likert scale or linear graphic response format - compares the consistency of responses of all items on the scale (may need to be computed for each sample) Equivalence – consistency of agreement of observers using the same measure among alternative forms of a tool - parallel of alternate forms (described under stability) - interrater reliability TEST CONSTRUCTION (has rudiments, process) Test Planning Decision to develop a Standard Test (1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not adequate for one reason or another. Weschler’s idea of WAIS was originated from the army alpha (literate soldiers) and army
beta (illiterate soldiers), that’s why there are vocabulary and performance tests.
Weschler – both covers fluid and crystallized intelligences Culture Fair Intelligence test – looks into specific intelligence
difference between the two, in terms of defining intelligence
Subject Matter Experts – test developer must seek help of the experts in evaluating the test items and even the identified constructs of component of the test Writing Items – depending on whether the scale is to assess an attitude, content knowledge, ability or personality traits; stick to the pattern (ex. don’t shift from declarative to interrogative statement) Guidelines 1. Deal only with one central thought; more than 1 is called double-barreled. Poor item: Better item:
My instructor grades fairly and quickly My instructor grades fairly.
2. Be precise Poor item: Better item:
I received good customer service from Y Company. A member of the scales staff at Y Company asked me if he could assist me within minute of entering the store.
3. Be brief 4. Avoid awkward wording or dangling constructs. Poor item: Being clear is the overall guiding principle in writing items. Better item: The overall guiding principle in writing items is to be clear. * Active voice is more preferred than passive voice.
5. Avoid irrelevant information 6. Present items in positive language * If it’s inevitable, when using ‘not’, italicize or CAPITALIZE it.
7. Avoid double negatives 8. Avoid terms like all and none Poor item: Better item:
Which of the following never occurs … Which of the following is extremely unlikely to occur?
9. Avoid indeterminate items like frequently or sometimes 10.Have someone else review your items
Table of Specifications (blueprint) Cognitive Domain – factual knowledge, ideas, and intellectual abilities Affective Domain – most with the values of a learner including his interests, appreciation, and attitudes Psychomotor – readiness for a particular action that may either be mental, physical, or emotional Item Analysis - Way of measuring the quality of questions – seeing how appropriate they were for the respondents and how well they measured their ability / trait - Way of measuring items over and over again in different tests with prior knowledge of how they are going to perform, creating a population of questions with known properties (e.g. test bank) - At least 3 or 4 times more CTT – “true score model” ( X = T + e ) – easiest and most widely used form of analyses – performed on the test as a whole rather than on the item and although item statistics can be generated, they apply only to that group of students on that collection of items. – a set of psychometric procedures used to test items and scales reliability, difficulty, discrimination, etc … – assumes that every person has true score on an item pr a scale of
Item Analysis Classical Test Theory Latent Trait Models (CTT) Item Response Rasch Models Theory (IRT) 1P 2P 3P 4P similar
Level of Difficulty – proportion of percent of examinees that answered the item correctly. In order to determine the difficulty level, table the number of examinees with the correct answer in the item and then apply the formula.
P=
Nu N x 100
Table of % in level of difficulty 91 % and above Very easy Unacceptable 79% - 90% Easy Acceptable Optimum difficulty / Highly Acceptable 26% - 78%
where: P = % of students who answered the items correctly Nu = number of examinees who answered the items correctly 11% - 25% N = total examinees consisting the 2 groups
10%
and below
moderate
Difficult Very difficult
Acceptable Unacceptable
Level of Difficulty Using Upper and Lower Groups 1. Score the papers after checking 2. Arrange the papers from highest to lowest score 3. Determine the upper and the lower group by x27% with the total number of examinees.
4. The top 27% of the examinees is considered the upper group while the bottom 27% of the total examinees comprises the lower group 5. Get both frequencies of the examinees that answered the item correctly from the 2 groups 6. Determine the difficulty level and the discriminating power Discriminating Power determines the difference between examinees who have done well and those who did poorly in a particular item. To determine the discriminating level, perform the steps in the difficulty level, then, determine the difference of the 2 groups and divide the difference by the half of the total examinees…. (? Di natapos) Discriminability Item/Total Correlation – every item will be correlated to the total score – point biserial method is best used Point Biserial Method – dichotomous scored items / items with a correct answer – one dichotomous variable (correct/incorrect) correlated with one continuous variable (total score) is a point biserial correlation – correlate the proportion of people getting each item with the total test score CTT
LTM
-
gauge the performance itself but not trait derives it
-
-
has the test as its basis
-
-
statistics are often generalized to similar students taking a similar test only applies to those students taking that test
-
aims to look beyond that at the underlying traits which are producing the tests performance measured at item level and provides sample-free measurement
Latent Trait Models (LTM) – made in 1940’a but widely used in 1960s – practically unfeasible to use these without specialized software Item Response Theory (IRT) – family of latent trait models used to establish psychometric properties of items and scales – sometimes referred to as modern psychometrics because … has completely replaced CTT – can predict if one has guessed an item 3 Basic Components
(ex. individual differences
on a construct)
1. Item Response Function (IRF) – math function that related the latent trait to the probability of endorsing an item. good item
2. Item Information Function – an indication of item quality, an item’s ability to differentiate among respondents. 3. Invariance – item characteristics Item Response Theory (IRT) – the relationship between examinee trait level, item properties and the ability of endorsing the item.
– can be converted into Item Characteristic Curves (ICC) which are graphical functions that represents the respondents ability. Item Parameters Location – an item’s location “b” is defined as the amount of the latent trait needed to have a 0.5 probability of endorsing the item. Item Parameters Discrimination (a) – indicates the steepness of the IRT at the item’s location – how strongly related the item is to the latent trait like loadings in a factor analysis
View more...
Comments