Development of a Procedure for Establishing Occupational Examination Cut Scores: A NOCTI Example
Richard A. Walter
The Pennsylvania State University
Jerome T. Kapes
Texas A&M UniversityTraditional Approaches for Establishing Cut Scores
High-stakes testing of teachers is not a new phenomenon in Pennsylvania. Candidates for vocational teacher certification have been required by the Pennsylvania Department of Education to demonstrate their content mastery via an occupational competency examination since the 1920s. "Certificates to teach the vocational subjects require in addition to the minimum professional requirements satisfactory evidence of practical experience" (Department of Public Instruction, 1928, pp. 18-19). As a result, the three primary institutions engaged in the preparation of vocational teachers (The Pennsylvania State College, The University of Pittsburgh, and The University of Pennsylvania) all began individual efforts to develop and administer occupational competency examinations. That process remained virtually unchanged until 1939 when the Department of Public Instruction issued Bulletin 101, Administration of Vocational Education Programs in Pennsylvania, which included the specification, "Preparation of trade tests for the selection of prospective vocational teachers should be made jointly and administered by all three teacher-training institutions" (p. 114).
With passage of the Vocational Act of 1963, vocational schools and programs began to multiply rapidly, placing enormous strain upon Pennsylvania's loosely organized system of occupational competency assessment that grew out of Bulletin 101's requirement for joint administration. As a result, the position of State Coordinator of Occupational Competency Assessment (OCA) was created in 1970 to bring more centralization and standardization to the process. In addition, the Pennsylvania OCA Consortium was formed to assist the coordinator with administration. Its membership was comprised of representatives of the Bureau of Vocational Education, the Bureau of Teacher Certification, the vocational administrators organization, and each of the designated teacher-preparation institutions (Walter, 1984).
As in Pennsylvania, education department staff members in other states were also experiencing increasing demands for vocational teachers. Responding to that need, the U.S. Office of Education funded a research project at Rutgers University to investigate the need for and feasibility of a national occupational competency testing organization. Panitz and Olivo (1970) stated, "Two one-day institutes at Rutgers University, attended by representatives of twenty-three (23) states, concluded that the development and implementation of an occupational competency examination program on a nationwide basis would be a more efficient use of personnel and would provide higher quality examinations" (p. 1). As a result of this initial grant and a subsequent grant to develop and pilot five examinations, the National Occupational Competency Assessment Institute (NOCTI) and the NOCTI Consortium of States emerged.
The opinion of those who advocated for NOCTI, that nation-wide interest in occupational competency testing would continue to grow, received a boost when Pennsylvania joined the consortium in 1975. Not only was Pennsylvania the largest consumer of tests in the consortium, its membership also greatly expanded the potential sources of revenue by providing NOCTI with access to the extensive test bank that had been developed (Walter, 1984).Holistic Impressions
As one of the first items of business, the members of Pennsylvania's OCA Consortium standardized procedures for developing and validating new occupational competency tests, as well as for establishing the cut scores at which candidate pass/fail decisions were determined. As specified within the Pennsylvania Policy Manual for Administration of The Occupational Competency Assessment Program (Bureau of Vocational Education, 1977, p. 19), "The draft test will be duplicated (50 copies) with excess items and administered to 10 occupational instructors and/or occupational incumbents and to as many as 50 graduating secondary students who prepared for that occupation." The data provided from this piloting process were then used to conduct an item analysis to guide revisions and to establish internal reliability for the instrument. Further, Consortium policies specified, "Initially test norms will be based upon the results of testing 10 occupational teachers/occupational incumbents, but will be updated as data becomes available through actual use with candidates" (p. 19). These procedures remained in place until the examinations used in Pennsylvania had been fully integrated into the NOCTI test bank. From that point forward, the responsibility of piloting new and revised examinations was assumed by NOCTI.
The change in responsibility for the piloting of examinations also dictated a change in the procedures for establishing the cut scores for the pass/fail decisions required of the OCA Coordinators at Indiana University of Pennsylvania, The Pennsylvania State University, and Temple University. As recommended by Kapes and Welch (1985), the Consortium members replaced the previous system of establishing cut scores with one based upon the national mean scores for each examination, secured from NOCTI, as the theoretical cut score. Kapes and Welch further recommended, "Due to the inherent error involved in testing, the 95% confidence interval should be used in implementing any cutoff decisions using occupational competency exams" (p. 8). Therefore, the procedure adopted by the Pennsylvania OCA Consortium specified that the cut score for each written and performance test would be established by subtracting two Standard Errors of Measurement (SEM) from the national mean score and rounding the result to the nearest whole number. The revised system of piloting instruments and establishing cut scores worked well until a new issue emerged.
By relinquishing control to NOCTI of developing, revising, and piloting to establish normative data for the examinations used to certify vocational teachers, members of Pennsylvania's OCA consortium no longer made the decisions about prioritizing the schedule under which those activities took place. Examinations that remained critical elements within Pennsylvania's teacher certification process were frequently appearing at the bottom of the schedule. The situation was exacerbated by a burgeoning market for student tests that consumed NOCTI resources originally devoted to teacher testing. Although ongoing discussions produced changes in the schedule of examination development and revision, the piloting of new and revised examinations to establish normative data from which cut scores could be calculated remained a major problem. The NOCTI staff members were encountering major difficulties in conducting the traditional processes for establishing normative data. The result was extreme delays (in some cases, three to four years) in making new and revised examinations available for use. In Pennsylvania, that has dictated a return to the use of oral examinations conducted by a panel of incumbent workers rather than the more preferable written and performance exams, since the process of certifying new teachers cannot be postponed. Therefore, recently the members of Pennsylvania's OCA Consortium decided to investigate alternative procedures for establishing cut scores for NOCTI examinations.
Methodologies for Establishing Cut Scores
A review of the literature revealed three major approaches to setting cut scores for credentialing or competency tests, although over 30 different methods have been described (Behuniak, Archamboult, & Gamble, 1982; Hambleton, 1998). The three overall approaches can be described as judgments based upon (a) holistic impressions of the entire test, (b) the content of individual test items, and (c)) examinees' test performance (Crocker & Algina, 1986). All three approaches share the requirements for selecting competent judges, training the judges, collecting the judgments, and combining the judgments to derive a cut score (Livingston & Zieky, 1982; Hambleton, 1998).Item Content
This methodology is based upon the assembly of a panel of subject matter experts or judges who are charged with the responsibility of individually reviewing the entire test. Jaeger (1991) recommends a relatively large number of judges (15 to 20) who receive clear instructions and training on the process. Instructions to the judges may also include factors such as item content and difficulty, the nature of the test (e.g., multiple choice, constructed response, performance), and consequences of false positive or false negative decisions, in setting a realistic passing score. Subsequent to their review, each judge provides an estimate of what proportion of items should be answered correctly by an individual who is minimally competent in the subject matter. The cut score is then determined by calculating the mean of all of the judges estimated proportions of correct responses.Performance of Examinees
Hambleton (1998) identified three primary content methods within the work of the respective authors, Nedelsky (1954), Angoff (1971), and Ebel, 1972. Each of the three methods has been implemented with numerous variations since its creation.
The Nedelsky (1954) approach is designed to be applied to tests that use multiple-choice items for judging minimal competency. A panel of subject matter experts is assembled and each is instructed to (a) cross out, for each item, the responses that a minimally competent person should be able to eliminate from consideration in selecting the correct response; (b) record the reciprocal of the number of choices remaining (e.g., if two choices within a five-item question were eliminated, the reciprocal of the three remaining choices would be one third); and (c)) sum the reciprocals over all of the items within the test to arrive at the probable score of a minimally qualified candidate. The average of all of the judges' scores is then calculated to produce a cut score.
The Angoff (1971) approach is a variation of Nedelsky (1954). Within this approach, each judge is asked to estimate the proportion of minimally competent individuals who would answer each item correctly. The resulting probabilities (p-values) are then summed over all judges for each item, and then over all items to calculate the correct proportion (p-value) required to pass the examination.
The Ebel (1972) approach combines elements of both the Nedelsky (1954) and Angoff (1971) approaches. The essence of this approach is the inclusion of a two-dimensional grid within which each member of the panel of judges divides the items into essential, important, acceptable, and questionable categories on the content relevance dimension, and easy, medium, and difficult on the item difficulty dimension. After all test items are placed within one of the 12 grid categories, each judge decides what proportion of items within each category a minimally competent individual would correctly answer. The average proportion (p- value) over all items and all judges produces the cut score.
In order for any of the judge or content subject matter expert methods to work well, there are two important considerations. The number of judges should be relatively large (i.e., 10 or more), and the judges must be well trained (i.e., understand and practice using the method).Other Methods
This category is the result of the risk that the subjective judgments of experts may result in unrealistic expectations (usually too high) and suggests an alternative to secure actual performance data (Livingston & Zieky, 1982; Hambleton, 1998). Within this approach, subject matter experts select individuals for inclusion in a group representing a known level of competency. One method is to construct contrasting groups: one composed of individuals who are qualified, the other composed of individuals who are not qualified. Subsequent to the completion and scoring of the examination, the individual scores are plotted on a continuum. The cut score would be established at a point that produces the fewest misclassifications.
The borderline group method is a variation that involves selection of a single group of individuals who are judged to be borderline on the competencies being assessed. Subsequent to their completion of the examination, the cut score is established as the median for the group.Comparing the Methodologies
Jaeger (1982) proposed an approach that combines all three of the previous categories. Within this approach, each member of the panel of subject matter experts is asked to rate each test item "yes" (1) or "no" (0) on the basis of whether or not a competent individual would select the correct response. By summing the 1's for all judges, a p-value is established for each item. Following administration of the test to a heterogeneous group (composed of competent and not competent individuals) and the calculation of the actual p-values, the judges are asked whether or not they wish to change their yes/no rating. The cut score is then established by calculating the median p-value for all judges over all items.Selection of the Examination
Presented in Table 1 are the summarized advantages and disadvantages of each of the six methods reviewed. Based upon those characteristics, it was decided to use a combination of approaches: use the Nedelsky (1954) method to establish the provisional cut scores for the written exam, and use a variation of the Angoff method to establish the provisional cut scores for the performance test.
Using the Methodology: A NOCTI ExampleSelection of the Panel
A recently adopted change in requirements that opened the subbaccalaureate certification route for childcare instructors in Pennsylvania created an immediate need for an examination. The NOCTI staff already had developed the experienced worker examination entitled, "Early Childhood Care and Education", but had not been able to secure sites for the traditional piloting procedures prior to its release. Therefore, that was the examination selected as the focus of this study.Table 1
Advantages and Disadvantages of Methods for Setting Cutoff Scores
Method Advantages Disadvantages
Holistic Requires less work by judges Least sophisticated Less time consuming psychometrically Used with performance tests Least reliable Item Content Item level judgments Works with M-C items Nedelsky Good reliability Time consuming Item Content Item level judgments Time consuming Angoff Used with M-C or performance tests Good reliability Item Content Includes evaluation of Most time consuming Eble item importance Used with M-C or performance tests Good reliability Contrasting Based upon actual Requires a subject pool Groups performance Requires administration More certain of results Time consuming and in practice expensive Used with M-C or performance tests Good reliability Borderline Based upon performance Subject pool may be difficult Group More certain of results to define and obtain Used with M-C or Requires administration performance tests Time consuming and Good reliability expensive
To facilitate the intended use of this study as a pilot for defining the procedures by which subsequent efforts to establish cut scores for additional examinations would be standardized, the minimum number of participants for the panel of subject matter experts was established at 15. Selection of individuals as members of the panel was conducted to insure that each had relevant work experience and that the group represented a range of child care providers in terms of size of staff, age range of clients, and location of the facility. Potential members of the panel were initially contacted by phone to provide an overview of the task and to verify their eligibility to be selected as a member of the panel of experts. Follow-up letters detailing the logistics and goals of the session were mailed to those who were selected as participants.Training the Panel
The initial plan for convening the panel of experts was to conduct an introductory session on Friday evening, followed by the analysis of the written and performance tests on Saturday. However, as a result of the phone conversations with potential members of the panel, it became clear that the pick-up times for the children served by the facilities ruled out the evening session. Therefore, the meeting of the panel of experts was convened in June 2002. Following a brief introduction, the session continued with an overview of the process through which vocational teachers secure their certificates and the role that occupational competency examinations play within the process. Next the protocols for reviewing the tests and the intended outcomes of the session were reviewed.Written Test
A multiple-choice format pretest had been prepared in order to provide the panel members with an opportunity to practice the Nedelsky method as it would be applied to the "Experienced Worker Early Childhood Care and Education" written test. The pretest contained eight items drawn, with permission, from the on-line practice test for the written portion of the driver's license examination prepared by the Pennsylvania Department of Transportation (2002). The panel members were instructed to draw a diagonal slash through the alternatives that a minimally competent driver should be able to eliminate as a distracter in their process of selecting the correct answer. Since the pretest contained four items with five choices, three with four choices, and one with two choices, it facilitated a demonstration of the reciprocal calculations, as well as application of the minimally competent criterion. A brief group discussion to double-check the panel members' understanding of the process was conducted subsequent to their independent completion of the evaluation of the pretest. Based upon that discussion, two conclusions were drawn: the process of eliminating alternative responses based upon the concept of a minimally competent worker was well-understood by the members of the panel, and requiring each panel member to calculate the item reciprocal values was not useful and was therefore eliminated. Instead, the researchers performed the calculations after all data were collected.Performance Test
The members of the panel of experts were provided with copies of the "Experienced Worker Early Childhood Care and Education" written test without any indication of the correct responses. They were instructed to draw a diagonal slash through the letter of the alternate responses for each item that could be eliminated as a distracter by a minimally competent worker. They were further instructed that upon completion of their analysis, they should move to an adjoining room to meet with one of the researchers. As expected, the panel members completed the task at varying rates, with the first moving to the adjoining room after 70 minutes and the last after 105 minutes.
In the adjoining room, one of the researchers met with panel members to request that they now mark the correct answer for each item by circling the letter of that response. In addition, they were instructed to place a check mark next to any item about which they wished to comment. After completing the task of selecting the correct answers, panel members were asked to write their specific comments about any items they had previously checked as problem items.Analysis
The next session was devoted to the "Experienced Worker Early Childhood Care and Education" performance test. Each panel member was provided with a copy of the examiner's guide and the performance test. The researchers chose to use the first item of the performance test as the pretest and conducted it as both an individual and a group activity. The panel members were asked to locate the process and product scoring criteria in the examiner's guide and estimate the level on the five-point scale at which a minimally competent worker would be able to perform the required task. Following the determination that all panel members had completed the task, a discussion of their selections followed to reestablish the concept of minimal competency and establish consensus on its application to the performance test. The researchers also explained that the individual scores for each item would be summed over all items and over all judges to arrive at the expected performance level for a minimally competent worker. Panel members then proceeded through the remainder of the test items individually.Recommendations
Subsequent to the conclusion of the work performed by the panel of judges, one of the researchers calculated the reciprocal scores for each item within the written examination. The examination consisted of 196 multiple-choice items, each with four alternatives. Therefore, the reciprocals were calculated on the basis of (a) no alternative eliminated, p = .25; (b) one alternative eliminated, p = .33; (c) two alternatives eliminated, p = .50; and three alternatives eliminated, p = 1.00. The performance examiner's guide details 62 criteria, each based upon a five-point scale, with A on the high end and E on the low end. An A would result in a p-value of 1.00, a B in a p-value of .8, etc. For each panel member, the researcher recorded the point value of the expected level of competence assigned to each examination item. The reciprocals and levels of performance were then entered into separate Excel spreadsheets to facilitate calculation of the mean for each item over all judges, the mean of all items for each judge, and the mean of all items over all judges.
The two means calculated for all items over all judges established the theoretical cut scores for the written and performance exams. As discussed earlier, the members of Pennsylvania's OCA Consortium adopted a policy in 1985 to establish the actual cut scores based upon the following formula: theoretical cut score (mean) minus two times the SEM (Kapes & Welch, 1985). The calculation of the SEM value requires a measure of reliability that is not available for examinations utilizing this methodology. Therefore, it was decided to estimate the SEM by averaging the SEM's for the 34 NOCTI examinations currently administered through the three Pennsylvania OCA test centers, as reported by Kapes (2001).
Table 2 presents the reciprocal values for the written examination based upon each panel member's decision as to how many alternatives could be eliminated as distracters by a minimally competent worker. The item numbers are displayed in the first column, the item-by-item decision of each judge in the next columns, and the mean for each item across all judges in the last column. The bottom row of the table presents the mean for each judge across all items and the mean of the means. The synthetic item difficulty (p-values) assigned by each member of the panel ranged from .25 (difficult) through 1.00 (easy). The means over all 196 items for each member of the panel are displayed in the bottom row and range between .49 and .89. The item means over all judges displayed in the last column range between .39 and 1.0 for the 196 items. The overall synthetic mean difficulty of the written examination (the theoretical cut score) is reflected in the mean of means displayed in the lower-right-hand corner (.7349).Table 2
Written Examination Item Difficulties
Item J 1 J 2 J 3 J 4 J 5 J 15 Mean
1 .50 .33 1.00 1.00 1.00 . .50 .6433 2 .50 .50 .50 1.00 .50 . .33 .5940 3 1.00 1.00 1.00 1.00 1.00 . .50 .8167 4 .50 1.00 .33 .50 .50 . .33 .5607 5 .50 1.00 .50 .50 .50 . .50 .5160 6 .33 1.00 1.00 .33 .50 . .50 .6100 7 .25 .50 .50 1.00 .50 . .50 .5273 8 1.00 .50 1.00 1.00 .50 . .50 .6553 9 1.00 1.00 .30 1.00 .33 . 1.00 .7433 10 1.00 .50 .50 1.00 .50 . .50 .6107 11 .50 .33 1.00 .50 .50 . 1.00 .5993 12 .33 1.00 1.00 1.00 .33 . .50 .5993 13 .50 1.00 .50 1.00 .50 . 1.00 .5660 14 .33 .50 .50 .50 .30 . .50 .4153 15 1.00 .50 .50 1.00 1.00 . 1.00 .7887 16 1.00 1.00 1.00 .33 .33 . 1.00 .7500 17 .33 1.00 .50 1.00 1.00 . .50 .6773 18 1.00 1.00 .50 1.00 .30 . 1.00 .8553 19 1.00 1.00 1.00 1.00 .50 . 1.00 .7887 20 .50 1.00 .30 .50 .50 . .50 .6213 21 .50 .50 1.00 1.00 .30 . .33 .5660 . . . . . . . . . 196 1.00 1.00 .50 1.00 .50 1.00 .6887
Total 146 170 133 163 114 142 144 Mean .74 .87 .68 .83 .59 .73 .7349
The expected levels of competency on the five-point scale predicted to be achieved by a minimally competent worker on the 62 process and product items contained within the performance examination are noted in Table 3. The item numbers appear in the first column, followed by the predicted performance levels over the 62 items for each member of the panel, and the mean for each item over all judges in the last column. The bottom row displays the means over all items for each judge and the mean of means over all items and all judges. The predicted levels of performance range between range between 3 and 5. The means over all 62 items for each member of the panel, displayed in the bottom row, range between 4.16 and 4.56. The item means over all judges displayed in the last column range between 3.73 and 5.0 for the 62 items. The overall predicted level of performance achieved by a minimally competent worker on the performance examination is reflected in the mean of means (4.3763). Since the maximum possible score of 5 is equal to 100% and 4 is equal to 80%, the theoretical cut score for the performance examination is therefore .875, or 87.5% of the total possible points.Table 3
Performance Examination Expected Levels of Competence
Item J 1 J 2 J 3 J 4 J 5 J 15 Mean
1 5 5 5 5 5 . 5 5.0000 2 4 5 5 5 5 . 5 4.8000 3 4 5 5 4 4 . 5 4.7333 4 4 4 4 5 5 . 4 4.4667 5 4 3 4 4 4 . 4 4.0000 6 4 4 5 5 5 . 5 4.5333 7 4 4 5 5 4 . 4 4.6000 8 4 4 5 5 5 . 4 4.4000 9 4 4 5 5 5 . 4 4.4000 10 5 4 4 5 4 . 5 4.4000 11 5 5 5 5 4 . 5 4.6667 12 4 5 5 5 5 . 4 4.4667 13 4 5 4 4 4 . 4 4.3333 14 5 3 4 4 4 . 5 4.2667 15 4 4 4 4 4 . 5 4.2667 16 4 4 4 4 4 . 5 4.1333 17 4 4 5 4 4 . 4 4.0000 18 5 4 4 5 5 . 5 4.5333 19 5 5 5 4 4 . 5 4.4667 20 4 4 5 5 4 . 5 4.5333 21 4 4 5 4 4 . 5 4.4000 . . . . . . . . 62 4 3 4 3 3 4 3.4000
Total 274 259 271 275 264 287 271.3 Mean 4.42 4.18 4.37 4.44 4.26 4.63 4.3763
As discussed previously, the Pennsylvania OCA Consortium requires that the theoretical cut scores for the written and performance examinations be adjusted by subtracting two times the SEM. Therefore, to use these cut scores with prospective teachers of child care in Pennsylvania, the averaged SEM for the written examination (3.19) and the performance examination (4.13) were doubled and subtracted from the respective theoretical cut scores. The results of this process were cut scores of 67.62% for the written examination and 79.24% for the performance examination.
There has been considerable research on the methods described here since they were developed. In general, the research appears to indicate that different outcomes are often achieved using different methods. Also, different outcomes have been observed using the same method but different judges. The most consistent results have been achieved when the number of judges is relatively large and they have been trained by working with each other in a group setting. When judges don't all agree and the number of judges is relatively large, it is possible to drop the extreme ratings and average the rest. However, this does not help if there are two distinct groups of judges that don't agree. Some literature suggests that convergence on a cutoff score can be achieved by having judges with relatively high or low estimated p-values for an item describe to the group why they rated the item as they did, and then ask all judges to consider changing their rating based on the most convincing arguments.
One additional comment about judges may be useful, and this concerns their source. Teachers of occupational programs, both high school and postsecondary, and business and industry personnel who are currently engaged in the occupation are the most likely candidates. A mixed panel representing both groups may be the most workable for reasons of both validity and credibility. However, one shortcoming of using teachers is that they will then know the exact test content and may find it hard to resist passing this knowledge on to prospective applicants.
It should also be mentioned here that many of these methods work best when the test is of a written and objective nature (e.g., multiple choice). However, the basic notion that expert judges can make a decision about what a competent individual should be able to do as well as know is viable for all forms of exams; but it may require more time, thought, and effort.
The purpose of the research described within this article was to answer the question posed by the members of the Pennsylvania Occupational Competency Assessment Consortium, "Is there a viable alternative to the traditional methodology used to establish cut scores for NOCTI examinations?" Based upon the results of this pilot project, the authors have concluded that the answer is "yes."
The Nedelsky (1954) and Angoff (1971) methods selected for use within this study are grounded upon the two underlying conditions required for the successful use of judge or subject matter experts: (a) the number of judges should be relatively large (10-15 members) and (b) the judges must be well trained. The results of this pilot study reinforce both of these requirements.
Since both methods rely upon judges with different opinions, it is crucial to convene a panel that contains a sufficient number of individuals to foster those differences. In this study, the researchers also selected panel members to insure that the overall composition would reflect a range of childcare providers in terms of size of staff, age of clients, and location of the facility. By averaging the p-values across all judges, a more appropriate mid-range value of .7349 was determined and used as the theoretical cut score.
Similar results are apparent in a review of the bottom row of Table 3, where the performance examination means for each judge are displayed, with judge 15 at the easy end of the range (4.63) and judge 12 at the difficult end of the range (4.16). Again, by averaging the p-values across all judges, a more appropriate mid-range value of 4.3763 was determined and used as the theoretical cut score. It should also be noted that the range for the performance examination means is much narrower than the range for the written examination, with all values falling within the 80% level of the five-point scale.
The necessity of providing specific training in the processes to be implemented and familiarizing panel members with the overall purpose of the activity were also confirmed in this pilot study. Questions posed by panel members during the orientation clearly communicated their desire to understand the "why" behind the use of occupational competency examinations in Pennsylvania. Further, the discussion provided an opportunity to assist each panel member in developing an understanding of the balance necessary between providing a fair evaluation of a teacher candidate's skills and knowledge, and insuring that students are instructed by a well-qualified representative of the occupational area.
The value of pretesting was also clearly communicated as panel members progressed from displaying puzzled facial expressions and hesitation at starting the process to displaying confident facial expressions and eager readiness to apply the techniques to the actual examinations. The training for the written examination also revealed that requiring each panel member to calculate the item reciprocals would serve as a time-consuming distraction, rather than as an added value to the process.
Whatever method is used to establish a cut score point, it is possible to adjust the obtained cutoff score to take into consideration other factors. For example, as was the case in this study, the reliability of the test can be considered. This can be done by computing the SEM and either subtracting or adding it to the cutoff, depending on the desired effect. The direction in which to make an adjustment would depend upon the relative importance of making a false positive or false negative decision. If it were less desirable to pass an unqualified individual than fail a qualified individual (e.g., as in the nuclear industry), the cutoff would be increased to minimize false positives. If it is relatively easy and inexpensive for an individual to retake the exam, setting the standard relatively high does not have major negative consequences. On the other hand, if failing the exam has expensive negative consequence for the examinee and/or society (e.g., denying a prospective CTE teacher the opportunity to teach), the SEM may be used to lower the passing score to give the benefit of test error to the examinee.
While SEM adjustments can be used with both written (e.g., multiple choice) and performance exams, it makes most sense to utilize interrater reliability vs. internal consistency coefficients in the calculation of the SEM for performance exams. This, of course, requires more effort, since more than one judge or evaluator is needed to obtain interrater estimates.
Based upon the results of this study, several recommendations appear appropriate. First of all, additional studies of this nature would be useful, with some refinements and extensions to the research design. One possible extension that comes to mind is having the panel employ the Nedelsky (1954) method as described here and afterwards ask them to select the correct answer. This process would yield two sets of data, the estimated item difficulties and overall cut score, as well as actual measures of item difficulties and obtained scores. These data could be compared using means, standard deviations, and correlations to examine the extent to which judges' estimations mirror their actual performance. It may also shed light on some post analysis options such as the benefit of dropping extreme scores for the calculation of passing or cut scores.
A second area that needs further study is the use of this method to set passing scores for performance measures such as the performance sections of the NOCTI exams. In order to employ the method used in this study, it was necessary to adapt it to the five-point scale that NOCTI exams currently use. The NOCTI scales were written to be behavioral descriptions; and the observations of panel members, as well as the researchers, were that they often did not yield an equal interval continuum. Therefore, asking judges to choose a point on the scale that would be expected of a minimally competent worker was difficult; and a large majority of judgments were rendered in the four- to five- point range. The panel members also observed that some of the scales were awkward because it was difficult to write five behavioral descriptions to fit with the tasks required. Although the attempt to construct behavioral scales is to be appreciated as a mechanism to make judging less subjective, for the type of activities utilized in the NOCTI performance exams, behavioral scales may not work well. Rather, it may be better to simply set 4 on a 5-point scale as the passing competency level for each component of the exam to be scored and let the judges (evaluators) decide if the observed performance is at level 4, above (level 5), or below at a level 3, 2, or 1.
Using this approach to scoring performance exams would allow judges more flexibility, and it would be consistent with a criterion- or performance-based standard that would reside in the judgment of competent evaluators rather than the exam writers. With this approach, the overall score for an examinee could still be obtained by averaging over all components to be scored, with an average of 4 needed to pass the exam; or the exam could be scored in major topical areas, with "passing" set at an average of 4 in each area.
Which of the possible approaches to judging and scoring performance exams will work best is a question that could benefit from much further study. With some creativity on the part of the researcher, it may be possible to compare both the behavioral and nominal five-point scales using the same judges and examinees to see what differences result in terms of passing decisions.
Angoff, W. H. (1971). Norms, scales, and equivalent scores. In R.L. Thorndike (Ed). Educational measurement (2nd ed.). Washington, D.C.: American Education on Education.
Behuniak, P., Jr., Archambault, F. X., & Gable, R. K. (1982). Angoff and Nedelsky standard setting procedures: Implications for the validity of proficiency test score interpretation. Educational and Psychological Measurement, 10, 95-105.
Bureau of Vocational Education. (1977). Pennsylvania policy manual for administration of the occupational competency assessment program for vocational instructional certification candidates and vocational intern candidates. Harrisburg: Pennsylvania Department of Education.
Croacker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Reinhart and Winston.
Department of Public Instruction. (1928). Certification of teachers. Harrisburg: Commonwealth of Pennsylvania.
Department of Public Instruction. (1939). Administration of vocational education in Pennsylvania (Bulletin 101). Harrisburg: Commonwealth of Pennsylvania.
Ebel, R. L. (1972). Essentials of educational measurement (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Hambleton, R. L. (1998, January). Setting performance standards on achievement tests: Meeting the requirements of Title I. A commissioned paper for the Council of Chief State School Officers, Washington, DC.
Jaeger, R. M. (1982). An iterative structured judgment process for establishing standards on competency tests: Theory and application. Educational Evaluation and Policy Analysis, 4, 41-475.
Jaeger, R. M. (1991). Selection of judges for standard setting. Educational Measurement: Issues and Practices, 10, 3-6, 10.
Kapes, J. T. (2001). Pennsylvania pass/fail cutoff scores for NOCTI written and performance exams based on 2001 national norm data. Report prepared for the Bureau of Career and Technical Education, Pennsylvania Department of Education.
Kapes, J. T. (1998). Setting cutoff scores and equating alternate measures for occupational competency assessment. Issue paper prepared for the Pennsylvania Department of Education, Bureau of Vocational Technical Education.
Kapes, J. T., & Welch, F. G. (1985). Final report: Review of the scoring procedures for the occupational competency assessment program in Pennsylvania. University Park: Division of Occupational and Vocational Studies, The Pennsylvania State University.
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3-19.
Panitz, A., & Olivo, C. T. (1970). National occupational competency testing project: The state of the art of occupational competency testing. New Brunswick: Department of Vocational-Technical Education, Rutgers University.
Pennsylvania Department of Transportation. (2002). Crossroads: Stories about teen driving. Retrieved February 20, 2002, from http://www.dmv.state.pa.us/crossroads/quizzes/quizhome.html
Walter, R. A. (1984). An analysis of selected occupational competency assessment candidate characteristics and successful teaching. Unpublished doctoral dissertation, The Pennsylvania State University, University Park.
Walter is Associate Professor in the Department of Workforce Education and Development at The Pennsylvania State University in University Park, Pennsylvania and can be reached at firstname.lastname@example.org. Kapes is Professor Emeritus of Educational Psychology at Texas A&M University in College Station, Texas and can be reached at email@example.com.