PRINCIPLES OF LANGUAGE ASSESSMENT – TIPS FOR TESTING
Short Description
PRINCIPLES OF LANGUAGE ASSESSMENT – TIPS FOR TESTING...
Description
PRINCIPLES OF LANGUAGE ASSESSMENT – TIPS FOR TESTING Cori na Ceban, Ceban, Engli sh Teache Teacher, r, DD I ,
“V.Alecsandri” Lyceum, Bălți
How do you know if a test is effective, eff ective, appropriate, useful, or, in down-to earth terms, a "good" test? For the most part, that question can be answered by responding to such questions as: Can it be given within appropriate administrative constraints? Is it dependable? Does it accurately measure what you want it to measure? Is the language in the test representative of real-world real -world language use? Does the test provide information that is useful for the learner? These questions help to identify five cardinal criteria for "testing a test": practicali ty, r eli abil abil i ty, vali dity, auth auth enti city, and and washbac washback k . We will look at each one here; however, because all five principles are context dependent, there is no priority order implied in the order of presentation.
Practicality refers to the logistical, down-to-earth, administrative issues involved in making, giving, and scoring an assessment instrument. These include "costs, the amount of time it takes to construct and to administer, ease of scoring, and ease of interpreting/reporting the results" (McMillan, 2007, p . 51). A test that fails to meet such criteria is impractical. Consider the following attributes of practicality: A PRACTICAL TEST R E • stays within budgetary limits M • can be completed by the test-taker test -taker within appropriate time constraints E M • has clear directions for administration B • appropriately utilizes available human resources E R • does not exceed available material resources • considers the time and effort involved for both design and scoring A reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield similar results. We might capsulate the principle of reliability in the following: R A RELIABLE TEST . . . E M • is consistent in its conditions across two or more administrations E • gives clear directions for scoring/evaluation M • has uniform rubrics for scoring/evaluation B E • lends itself to consistent application of those rubrics by the scorer R • contains items/tasks that are unambiguous to the test-taker
The most common learner-related issue in reliability is caused by temporary illness, fatigue, a "bad day," anxiety, and other physical or psychological factors, which may make an observed score deviate from one's "true" score. Also included in this category are such factors as a test-taker's testwiseness, or strategies for efficient test-taking (McMillan, 2007, p. 80). For the classroom teacher, student-related unreliability may at first blush seem to be a factor beyond control. We're We' re accustomed accu stomed to simply expecting some students to be anxious or overly nervous ner vous to the point that they "choke" in a test administration context. But the experience of many teachers suggests otherwise. In the second half of this chapter, some tips will be offered that may help minimize studentrelated unreliability. Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occurs when two or more scorers yield consistent scores of the same test. Failure to achieve intra-rater reliability could stem from lack of adherence to scoring criteria, inexperience, inattention, or even preconceived biases. Popham (2006) provided some helpful hints on how to ensure inter-rater inter-r ater reliability. reliabilit y. Rater-reliability issues are not limited to contexts in which two or more scorers are involved. Intra-rater reliability is an internal factor, a common occurrence for classroom teachers. Violation of such reliability
can occur in cases of unclear scoring criteria, fatigue, bias toward particular "good" and "bad" students, or simple carelessness. By far the most complex criterion of an effective test-and arguably the most important principleis validity, "the extent to which inferences made from assessment results are appropriate, meaningful, and useful in terms of the purpose of the assessment" (Black&Wiliam, 1998, p.26). In somewhat more technical terms, McMillan (2007), who is widely recognized as an expert on validity, defined validity as "an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment" (p. 11). We might infer from fr om these definitions the following attributes of validity: A VALID TEST R E • measures exactly what it proposes to measure M • does not measure irrelevant or "contaminating" variables E M • relies as much as possible on empirical evidence (performance) B • involves performance that samples the test's criterion (objective) E R • offers useful, meaningful information about a test-taker's test -taker's ability • is supported by a theoretical th eoretical rationale or argument A valid test of reading ability actually measures reading ability-not 20/20 vision, or previous knowledge of a subject, or some other variable of questionable relevance. To measure writing ability, one might ask students to write as many words as they can in 15 minutes, and then simply count the words for the final score. Such a test would be easy to administer (practical), and the scoring quite dependable (reliable), but it would not constitute a valid test of writing ability without some consideration of comprehensibility, rhetorical discourse elements, and a nd the organization of ideas, among other factors. In some cases, it may be appropriate to examine the extent to which a test calls for performance that matches that of the course or unit being tested. In other cases, we may be concerned with how well a test determines whether students have reached an established set of goals or level of competence. Statistical correlation with other related but independent measures are another widely accepted form of evidence. Other concerns about a test's validity may focus on the consequences of a test, beyond measuring the criteria themselves, or even on the test-taker's perception of validity. We look at four types of evidence below. At the same time, La Marca (2006) and other assessment experts "grudgingly" agree that test appearance does indeed have an effect that neither test-takers nor test designers can ignore. Students may for a variety of reasons feel that a test isn't testing what it's supposed to test, and this might affect their performance and, consequently, create student-related unreliability referred to previously. So student perception of a test's fairness is significant to classroom-based assessment because it can affect student performance/reliability. Teachers can increase student's perception of fair tests by using • a well-constructed, well-constructed, expected format with familiar tasks • tasks that can be accomplished within an allotted all otted time limit • items that are clear and uncomplicated • directions that are crystal clear • tasks that have been rehearsed in their previous course work • tasks that relate to their course work (content validity) • a difficulty level that presents a reasonable challenge Finally, the issue of face validity reminds us that the psychological state of the learner (confidence, anxiety, etc.) is an important ingredient in peak performance. Students can be distracted and their anxiety increased if you "throw a curve" at them on a test. They need to have rehearsed test tasks before the fact and feel comfortable with them. A classroom test is not the time to introduce new tasks, because you won't know if student difficulty is a factor of the task itself or of the objectives you are testing. A fourth major principle of language testing is authenticity, a concept that is difficult to define, especially within the art and science of evaluating and designing tests. Stiggins (2006) defined authenticity as "the degree of correspondence of the characteristics of a given language test task to the
features of a target language task" (p. 14) and then suggested an agenda for identifying those target language tasks and for transforming them into valid test items. As mentioned, authenticity is not a concept that easily lends itself to empirical definition or measurement (La Marca[2006] discussed the difficulties of operationalizing authenticity in language assessment.) After all, who can certify whether a task or language sample is "real-world" or not? Often such judgments are subjective, and yet authenticity is a concept that language-testing experts have paid a great deal of attention to (Black & Wiliam). Further, according to McMillan (2007), many test types fail to simulate real world tasks. Essentially, when you make a claim for authenticity in a test task, you are saying that this task is likely to be enacted in the real world. Many test item types fail to simulate real-world tasks. They may be contrived or artificial in their attempt to target a grammatical form or a lexical item. The sequencing of items that bear no relationship to one another lacks authenticity. One does not have to look very long to find reading comprehension passages in proficiency tests that do not reflect a real-world passage. In a test, authenticity may be present in the following ways: R AN AUTHENTIC TEST E M • contains language that is as natural as possible E • has items that are contextualized rather than isolated M • includes meaningful, relevant, interesting topics B E • provides some thematic organization organizat ion to items, such as through a story line or episode R • offers tasks that replicate real-world tasks The authenticity of test tasks in recent years has increased noticeably. Two or three decades ago, unconnected, boring, contrived items were accepted as a necessary component of testing. Things have changed. It was once assumed that large scale testing could not include performance of the productive skills and stay within budgetary constraints, but now many such tests offer speaking and writing components. Reading passages are selected from real-world sources that test-takers are likely to have encountered or will encounter. Listening comprehension sections feature natural language with hesitations, white noise, and interruptions. More tests offer items that are "episodic" in that they are sequenced to form meaningful units, paragraphs, or stories. We invite you to take up the challenge of authenticity in your classroom tests. As we explore many different types of tasks in this book, especially in Chapters 6 through 9, the principle of authenticity will be very much in the forefront. A facet of consequential validity discussed above is "the effect of testing on teaching and learning" (Pophan, 2003, p. 1), otherwise known in the language assessment field as washback. To distinguish the impact of an assessment, discussed above, from washback, think of the latter as referring almost always to classroom-based issues such as the extent to which assessment affects a student's future language development. McMillan (2007, p . 241) reminded us that the washback effect may refer to both the promotion and the inhibition of learning, thus emphasizing what may be referred to as beneficial versus harmful (or negative) washback. The following factors comprise the concept of washback: R A TEST THAT PROVIDES BENEFICIAL WASHBACK E M • positively influences what and how teachers teach E • positively influences what and how learners learn M • offers learners a chance to adequately prepare B E • gives learners feedback that enhances their t heir language development R • is more formative in nature than summative • provides conditions for peak peak performance by the learner In classroom-based assessment, washback can have a number of positive manifestations, ranging from the benefit of preparing and reviewing for a test to the learning that accrues from feedback on one's performance. Teachers can provide information that "washes back" to students in the form of useful diagnoses of strengths and weaknesses. Washback also includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment. Informal performance assessment as sessment is by nature more likely l ikely to t o have builtin washback effects eff ects because the teacher is
usually providing interactive feedback. Formal tests can also have positive washback, but they provide no beneficial washback if the students receive a simple letter grade or a single overall numerical score. The challenge to teachers is to create classroom tests that serve as learning devices through which washback is achieved. Students' incorrect responses can become windows of insight into further work. Their correct responses need to be praised, especially when they represent accomplishments in a student's developing language competence. Teachers can suggest strategies for success as part of their "coaching" role. Washback enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, inters language, and strategic investment, among others. One way to enhance washback is to comment generously and specifically on test performance. Many overworked (and underpaid) teachers return tests to students with a single letter grade or numerical score and consider their job done. In reality, letter grades and numerical scores give absolutely no information of intrinsic interest to the student. Grades and scores alone, without comments and other feedback, reduce the linguistic and cognitive performance data available to student to almost nothing. At best, they give a relative indication i ndication of a formulaic judgment of performance as compared to others in the class-which fosters competitive, not cooperative, learning. The five principles of practicality, reliability, validity, authenticity, and washback go a long way toward providing useful guidelines for both evaluating an existing assessment procedure and designing one on your own. Quizzes, tests, final exams, and standardized proficiency tests can all be scrutinized through these five lenses. Are there other principles that should be invoked in evaluating and designing assessments? The answer, of course, is yes. Language assessment is an extraordinarily broad discipline with many branches, interest areas, and issues. The process of designing effective assessment instruments is far too complex to be reduced to five principles. Good test construction, for example, is governed by research-based rules of test preparation, sampling of tasks, item design and construction, scoring responses, ethical standards, and so on. But the five principles cited here serve as an excellent foundation on which to evaluate existing instruments and to build your own. References 1. Black, P., & Wiliam, D. (1998a). Assessment and classroom learning. Asse ssment in Education: Principles, Policy, and Practice, 5(1), 7 – 73. 73. 2. La Marca. P. (2006, June). Assessment literacy: literac y: Building capacity for improving student learning. Paper presented at the National Conference on Large- Scale Assessment, Council of Chief State School Officers, San Francisco, CA. 3. McMillan, J. H. (Ed.). (2007). Formative classroom assessment: Theory into practice. New York: Teachers College Press. 4. Popham, W. J. (2006). Mastering assessment: A self-service system for educators. New York: Routledge. 5. Stiggins, R. J. (2006). Assessment for Learning: A Key Ke y to Student Motivation and Learning. Phi Delta Kappa Edge, 2(2), 1 – 19. 19.
View more...
Comments