Post Hoc Analysis of Test Items Written by Technology Education Teachers
W. J. Haynie, III
Technology education teachers frequently author their
own tests. The effectiveness of tests depends upon many
factors, however, it is clear that the quality of each
individual item is of great importance. This study sought
to determine the quality of teacher-authored test items in
terms of nine rating factors.
BACKGROUND
Most testing in schools employs teacher-made tests
(Haynie, 1983, 1990, 1991; Herman & Dorr-Bremme, 1982;
Mehrens & Lehmann, 1987; Newman & Stallings, 1982). Despite
this dependance upon teacher-made tests, Stiggins, Conklin,
and Bridgeford (1986) point out that "nearly all major
studies of testing in the schools have focused on the role
of standardized tests" (p. 5).
Research concerning teacher-constructed tests has found
that teachers lack understanding of measurement (Fleming &
Chambers, 1983; Gullickson & Ellwein, 1985; Mehrens &
Lehmann, 1987; Stiggins & Bridgeford, 1985). Research has
shown that teachers lack sufficient training in test
development, fail to analyze tests, do not establish
reliability or validity, do not use a test blueprint, weight
all content equally, rarely test above the basic knowledge
level, and use tests with grammatical and spelling errors
(Burdin, 1982; Carter, 1984; Gullickson, 1982; Gullickson
& Ellwein, 1985; Hills, 1991). Technically their tests are
simplistic and depend upon short answer, true-false, and
other easily prepared items. Their multiple-choice items
often have serious flaws--especially in distractors (Haynie,
1990; Mehrens & Lehmann, 1984, 1987; Newman & Stallings,
1982).
A few investigations have studied the value of tests as
aids to learning subject content (Haynie, 1987, 1990, 1991;
Nungester & Duchastel, 1982). Time on-task has been shown
to be very important in many studies (Jackson, 1987; Salmon,
1982; Seifert & Beck, 1984). Taking a test is a time
on-task learning activity. Works which studied testing
versus similar on-task time spent in structured review of
the material covered in class have had mixed results, but
testing appears to be at least as effective as reviews in
promotion of learning (Haynie, 1990; Nungester & Duchastel,
1982). Research is lacking on the quality of tests and test
items written by technology education teachers.
PURPOSE
The purpose of this investigation was to study the
quality of technology education test items written by
teachers. Face validity, clarity, accuracy in identifying
taxonometric level, and rates of spelling and punctuation
errors were some of the determinants of quality assessed.
Additionally, data were collected concerning teachers'
experience levels, highest degree held, and sources of
training in test construction. The following research
questions were addressed in this study:
1. What types of errors are common in test items?
2. Do the error rate or types of errors in teacher
constructed test items vary with demographic factors?
3. Do teachers understand how to match test items to
curriculum content and taxonometric level?
METHODOLOGY
SOURCE OF DATA
Between April 23, 1988 and January 8, 1990, a team of
15 technology education teachers worked to develop test
items for a computerized test item bank for the North
Carolina State Department of Public Instruction (SDPI). The
work was completed under two projects funded by SDPI and
directed by DeLuca and Haynie (1989, 1990) at North Carolina
State University. The data for this study came from the
items developed in those projects.
TEST ITEM AUTHORS
The teachers were selected on recommendation of
supervisors, SDPI consultants, or teacher educators. All
were recognized as leaders among their peers and most had
been nominated for teacher of the year or program of the
year commendation. They were all active in the North
Carolina Technology Education Association and supported the
transition to the new curriculum. Table 1 displays
demographic data concerning the test item authors.
TABLE 1
PROFILE OF AUTHORS' DEMOGRAPHIC FACTORS
---------------------------------------------------
Graduate
Years of Undergraduate Test &
Teaching Highest Test & Measure Measure
Author Experience Degree Courses Courses
---------------------------------------------------
1 9 B.S. 0 0
2 5 B.S. 1 0
3 23 B.S. 0 0
4 4 B.S. 0 1
5 5 B.S. 0 1
6 23 M.Ed. 0 1
7 19 M.Ed. 0 1
8 17 M.Ed. + 2 yrs. 0 2
9 25 M.Ed. 0 0
10 5 M.Ed. 0 0
11 7 M.Ed. 0 0
12 7 B.S. 0 0
13 7 M.Ed. 0 0
14 15 B.S. 1 0
15 5 B.S. 1 1
---------------------------------------------------
TRAINING OF AUTHORS
Teachers came to the university campus for a workshop
on April 23, 1988. Project directors oriented teachers to
the computerized test bank, reviewed the revised technology
education curriculum, and explained how to develop good test
items. A 13 page instructional packet was also given to
each author. It should be noted that the training session
and instructional packet may confound attempts to generalize
these findings.
The authors were required to develop and properly code
six items which were submitted for approval and corrective
feedback before they were allowed to proceed. The teachers
who authored the items were paid an honorarium for their
services.
EDITING AND CODING OF ITEMS
Each item was prepared on a separate sheet of paper
with a coding sheet attached and completed by the teacher.
The coding sheet identified the author, the specific
objective tested, the taxonometric level, and information
for the computerized system. The project directors edited
the items with contrasting colored felt tip pens on the
teachers' original forms.
DESIGN OF THIS STUDY
The data for this investigation were the editing
markings on the original test items submitted by the
teachers. Scores for 9 scales of information were recorded
for analysis. Each of the scales was established so that a
low score would be optimal. The scales were Spelling Errors
(SE), Punctuation Errors (PE), Distractors (D), Key (K),
Usability (U), Validity (V), Stem Clarity (SC), Taxonomy
(TX), and an overall Quality (Q) rating. After all of the
ratings were completed, the General Linear Models (GLM)
procedure was used for F testing and the LSD procedure was
used when t-tests were appropriate.
FINDINGS
SPELLING ERRORS (SE)
The frequency and percentage of scores for the 993
items on the nine ratings, and mean scores of each factor,
are shown in Table 2. An item's SE rating indicates how
many words were misspelled in the item. There were 98 items
(10%) which had one or more spelling errors. Spelling
errors are detrimental to good teaching and testing. However
the literature shows that this problem is common to other
disciplines.
TABLE 2
RATINGS OF TEST ITEM QUALITY
-----------------------------------------------------------
Frequency of % of Mean
Items With Items/ Item
Rating Category Score Each Score Score Score SD
-----------------------------------------------------------
Spelling Errors (SE) 0 895 90.1
1 76 7.7
2 11 1.1
3 6 0.6
4 3 0.3
5 1 0.1
6 1 0.1
SE Totals --- 993 100% 0.14 0.52
-----------------------------------------------------------
Punctuation Errors(PE) 0 735 74.0
1 220 22.2
2 25 2.5
3 4 0.4
4 1 0.1
5 8 0.8
PE Totals --- 993 100% 0.38 0.68
-----------------------------------------------------------
Distractors (D) 0 447 45.0
1 398 40.1
2 95 9.6
3 30 3.0
4 9 0.9
5 14 1.4
D Totals --- 993 100% 0.79 0.96
-----------------------------------------------------------
Key (K) 0 889 89.5
2 104 10.5
K Totals --- 993 100% 0.21 0.61
-----------------------------------------------------------
Usability (U) 0 249 25.1
1 265 26.7
2 159 16.0
3 131 13.2
4 74 7.5
5 50 5.0
6 21 2.1
7 11 1.1
8 16 1.6
9 17 1.7
U Totals --- 993 100% 2.02 2.04
-----------------------------------------------------------
Stem Clarity (SC) 0 602 60.6
1 352 35.4
2 39 3.9
SC Totals --- 993 100% 0.43 0.57
-----------------------------------------------------------
Taxonomy (TX) 0 835 84.1
1 124 12.5
2 34 3.4
TX Totals --- 993 100% 0.19 0.47
-----------------------------------------------------------
Quality (Q) 0 208 20.9
1 235 23.7
2 200 20.1
3 129 13.0
4 74 7.5
5 58 5.8
6 42 4.2
7 17 1.7
8 10 1.0
9 12 1.2
10 2 0.2
11 3 0.3
12 1 0.1
13 1 0.1
14 1 0.1
15 0 ---
16 0 ---
17 1 0.1
Q Totals ---- 993 100% 2.28 2.20
----------------------------------------------------------
NOTE. There were 993 items.
The authors were compared on each of the scales to
determine whether they differed significantly and to see if
similar or dissimilar errors were made by different authors.
On the spelling errors factor authors were found to differ
significantly: F(14, 978) = 11.99, p<.0001. Follow-up
analysis with the LSD procedure showed that 5 authors had
significantly fewer spelling errors and 3 authors had more
than the average number of errors in spelling (Table 3).
Two of the authors with numerous spelling errors also had
other defects and were rated significantly worse in the
overall Quality (Q) rating (authors 1 and 9). However, only
1 of the authors with a significantly low rate of spelling
errors was rated favorably in the Quality rating, so
spelling accuracy alone is insufficient to identify good
test item writing ability.
TABLE 3
MEANS OF EACH AUTHOR ON THE 9 RATING CATEGORIES
-----------------------------------------------------------
N Per Item Means
Author Items SE PE D K U V SC TX Q
-----------------------------------------------------------
1 92 0.29 0.37 1.29 0.68 2.95 0.09 0.53 0.24 3.51
** ** ** ** * **
2 102 0.01 0.17 0.59 0.12 1.34 0.16 0.38 0.11 1.54
* *
3 32 0.21 0.41 1.16 0.44 2.88 0.28 0.47 0.59 3.56
** ** ** ** **
4 103 0.17 0.39 1.28 0.33 2.76 0.16 0.49 0.17 2.98
** **
5 100 0.17 0.39 0.94 0.24 3.01 0.22 0.67 0.29 0.92
** ** **
6 56 0.11 0.38 1.14 0.32 2.25 0.32 0.43 0.34 3.04
** ** **
7 62 0.26 0.24 0.55 0.26 1.77 0.13 0.39 0.35 2.18
** **
8 104 0.07 0.22 0.71 0.17 1.70 0.38 0.38 0.19 2.13
* **
9 42 0.43 0.83 1.21 0.09 3.21 0.26 0.79 0.29 3.90
** ** ** ** ** ** ** **
10 50 0.04 0.98 0.16 0.00 1.46 0.00 0.28 0.04 1.50
* ** * * * * *
11 46 0.00 0.28 0.74 0.00 1.85 0.13 0.35 0.09 1.59
* * *
12 28 0.21 0.07 0.39 0.00 1.04 0.11 0.29 0.18 1.25
* * * *
13 82 0.06 0.01 0.14 0.02 0.71 0.30 0.23 0.07 0.85
* * * * * ** * * *
14 48 0.13 0.31 0.29 0.00 1.19 0.04 0.42 0.08 1.27
* * * * * *
15 46 0.09 0.17 0.87 0.04 1.54 0.00 0.26 0.00 1.43
* * * * *
Grand ___ ____ ____ ____ ____ ____ ____ ____ ____ ____
Means --- 0.14 0.33 0.79 0.21 2.02 0.19 0.43 0.19 2.28
---------------------------------------------
NOTE. There were 993 items.
* Significantly low (better), p<.05.
** Significantly high (worse), p<.05.
Years of teaching experience and other demographic data
were presented in Table 1. Teachers were divided into two
groups of experience level: fewer than 8 years experience
(8 teachers who authored 557 items), and more than 8 years
experience (7 authors, 436 items). On the Spelling Errors
factor these groups were compared and there was a
significant finding of F(1, 991) = 10.48, p<.0012. Follow-up
analysis by the LSD procedure showed that the less
experienced teachers had significantly fewer spelling
errors. None of the other demographic variables were found
to differ significantly on the rate of spelling errors.
PUNCTUATION ERRORS (PE)
The PE rating (Table 2) was the total number of
punctuation errors. The most frequent errors were omission
of punctuation at the end of the stem or use of the wrong
punctuation there. Frequently statements were ended with
question marks or stems which should have ended with a colon
were left with no punctuation. This score may be inflated
spuriously by those unique errors which may not have been
made in normal prose writing by the same teachers. Among
the 15 authors, a significant difference was found in the PE
category: F(14, 978) = 8.12, p<.0001 (Table 3). No
significant differences were found among any demographic
variables on the rate of punctuation errors.
DISTRACTORS (D)
Errors in distractors other than spelling or
punctuation were summed in the Distractors (D) category
(Table 2). Frequently these errors either eliminated
distractors or targeted the correct answer due to
incompatibility between the stem and the alternatives
because of lack of agreement in: singular, plural,
introductory article, tense, or in one case even gender.
A significant finding of F(14, 978) = 13.37, p<.0001
was attained and follow-up by LSD showed that 3 authors (10,
13, and 14) had significantly lower error rates. Two of
those 3 authors who had superior distractors were also among
the best in the overall Quality rating. All three of the
authors who rated poorest in the overall Quality rating,
also rated significantly worse in this Distractors category.
Apparently this is one aspect of test writing which needs to
be stressed to teachers.
All 4 of the demographic variables studied were found
to be significantly related to errors in distractors: Years
of experience, F(1, 991) = 10.55, p<.0012, the less
experienced teachers authored superior distractors; Highest
degree held, F(1, 991) = 23.21, p<.0001, those with graduate
degrees wrote better distractors; Undergraduate courses,
F(1, 991) = 11.46, p<.0007, those who had taken an
undergraduate testing and measurement course prepared better
distractors; and Graduate courses, F(1, 991) = 13.23,
p<.0003, graduate courses also appeared beneficial.
KEY (K)
The Key (K) rating simply indicates whether the answer
marked in the teacher's original version of the item was
indeed correct. Since incorrect keying was considered a
more damaging error than a misspelled word or other common
error, a rating of 2 was given for incorrectly keyed items.
This resulted in greater increase of the summation
categories (Usability and Quality) due to incorrect keying
than for other types of errors. Regrettably, 10.5% of the
items were keyed incorrectly (Table 2).
The authors differed significantly in the Key rating:
F(14, 978) = 8.01, p<.0001. Table 3 shows the teachers'
means and the results of LSD comparisons. Six authors keyed
their items more accurately than others and one teacher was
very inaccurate in keying. Teachers with less than eight
years of experience keyed more accurately than more
experienced teachers, F(1, 991) = 19.82, p<.0001; and
teachers with graduate degrees also more accurately keyed
their items, F(1, 991) = 12.90, p<.0003.
USABILITY (U)
The Usability (U) rating was found by counting all
proofreading and editing marks of all types on the teachers'
original forms--thus it included the sum of all the above
categories plus other errors and defects not included in
them. An example of an error which would not be counted in
the first four ratings but would be included here is an item
which begins with a blank. Such an item would have a U
rating which equalled the sum of all SE, PE, D, and K
ratings plus 1.
The teachers did differ significantly when compared on
the Usability of their items: F(14, 978) = 11.99, p<.0001.
Comparisons via LSD found that three teachers developed
items with superior usability and five teachers authored
significantly less usable items (Table 3). The teachers
with fewer than eight years of experience developed more
usable test items according to this rating: F(1, 991) =
7.47, p<.0064. Teachers with graduate degrees wrote more
useful items, F(1, 991) = 16.42, p<.0001, and both
undergraduate and graduate testing and measurement courses
appeared to be effective in helping teachers develop usable
items: Undergraduate courses, F(1, 991) = 26.68, p<.0001;
and Graduate courses, F(1, 991) = 12.05, p<.0005.
VALIDITY (V)
Items were carefully read and compared to the
objectives they were intended to test. A Validity (V) rating
of 0 indicated the item clearly possessed face validity. An
item which was obviously off the subject was rated 2 and
items which tested information immediately adjacent to the
intended information were rated 1 to indicate that validity
was questionable.
The authors differed significantly in how valid their
items appeared to be: F(14, 978) = 3.99, p<.0001. It is
noteworthy that the Validity rating did not necessarily
correspond to others in the study. One of the authors
(number 1) who rated significantly better in terms of
validity was one of the worst rated authors in five other
categories. Likewise, one other author (number 13) who rated
superior in eight other categories (including Q) was
significantly worse in the Validity category.
The findings related to the demographic variables were:
Less experienced teachers wrote more valid items, F(1, 991)
= 4.32, p<.038; teachers with only Bachelor's degrees wrote
more valid items than those with graduate degrees, F(1, 991)
= 11.47, p<0007; teachers who had experienced undergraduate
test and measurement courses submitted more valid items,
F(1, 991) = 9.29, p<.0024; and graduate courses also helped
teachers write more valid items, F(1, 991) = 10.01. p<.0018.
STEM CLARITY (SC)
Stem Clarity (SC) was a subjective rating indicating
how clearly understandable the stem appeared. If the item's
stem seemed clear enough to lead knowledgeable students to
the correct response, regardless of other types of errors
(SE, PE, D, K, U, or V ratings), then that item was rated 0
in the SC category. Items which were confusing to read with
no clear purpose set forth in the stem were rated 2. Items
which would likely work but had some element of confusion
were rated 1. Table 2 shows that most items were judged to
be reasonably clear in intention.
The finding of F(14, 978) = 4.57, p<.0001 documents
that teachers did vary in their ability to write clear item
stems. It would seem reasonable to assume that authors who
made many spelling and punctuation errors would also have
difficulty wording their stems clearly. This, however, was
not true in these findings. Of the demographic factors
investigated, only highest degree held was related to the
ability to prepare clearly worded stems: F(1, 991) = 6.34,
p<.0120, teachers with graduate degrees developed superior
items in terms of stem clarity.
TAXONOMY (TX)
The Taxonomy (TX) rating indicates the extent to which
teachers accurately identified the taxonometric level of the
cognitive domain for each item. Teachers prepared items to
match specific objectives and then coded them. The codes
used were derived from the first three levels of Bloom's
Taxonomy: 1 indicated simple knowledge, 2 indicated
comprehension, and 3 indicated application or higher levels
of learning.
Of the 993 items prepared for the test item bank,
the authors indicated that they felt 559 (56%) operated at
level 1 (knowledge), 379 (38%) operated at level 2
(comprehension), and only 55 (5.5%) operated at level 3
(application or above). The rating in the TX category
assigned for this study indicates how well, in the
researcher's judgement, the item authors had accurately
identified the proper taxonometric level. This was done
after reading the objective to be tested by each item and
then carefully reading the item to see if it operated at the
level indicated by the teacher. A rating of 0 in the TX
category indicates that the item appeared to be accurately
coded by the teacher. A rating of 2 indicated that there
was a clear mismatch between the level at which the teacher
desired the item to function and the level at which the
researcher judged the item would actually operate. Ratings
of 1 in the TX category indicate that the researcher felt
the author's coding was questionable.
Table 2 shows that 84% (835) of the items had been
correctly coded for taxonometric level. Teachers did vary
significantly in their ability to code items according to
taxonomy: F(14, 978) = 5.20, p<.0001. All teachers who
rated poor in this rating also had poor ratings in at least
one other category, most rated poor in at least two others.
Teachers who rated superior in the TX rating also rated
superior in at least two other ratings. Teachers with less
than 8 years of experience were significantly more accurate
in coding by taxonomy than the more experienced teachers,
F(1, 991) = 21.08, p<.0001. Undergraduate test and
measurement courses, F(1, 991) = 9.29, p<.0024, appeared to
be helpful in enabling teachers to identify the correct
taxonometric level of test items, however, graduate courses
were not found to be a significant factor here, F(1, 991) =
2.65, p<.0711.
QUALITY (Q)
The overall Quality of the test items was summarized in
the Q rating. The Q rating was found by summing all of the
other ratings except Usability (U), which was already a
partial summation. The Q ratings (Table 2) range from 0 (an
item judged to need no editing of any sort and believed to
operate exactly as the submitting author had intended) to a
high value of 17.
A finding of F(14, 978) = 14.79, p<.0001, shows that
teachers differed in Q ratings (see Table 3). All of the
teachers who differed significantly in the Q rating had also
differed in several other categories. Experienced teachers
prepared items with poorer overall quality than
inexperienced teachers: F(1. 991) = 20.67, p<.0001.
Teachers with graduate degrees produced items identified to
have better quality: F(1, 991) = 13.44, p<.0003.
Undergraduate test and measurement courses helped teachers
develop higher quality items, F(1, 991) = 35.45, p<.0001,
and so did graduate courses, F(1, 991) = 11.14, p<.0009.
DISCUSSION
Though the sample included only 15 teachers, the
findings presented in this study suggest that technology
education teachers have some of the same difficulties in
developing useful test items that teachers in other
disciplines face. Despite the fact that these carefully
selected teachers were given special training to improve
their items, less than 21% of the items they prepared were
flawless. Earlier works identified spelling, punctuation,
grammar, clarity, validity, reliability, taxonometric level,
problems in distractors, and other mechanical factors to be
problem areas in teacher-made tests. Six of these problems
were investigated in this study. Additionally, errors in
keying items, a general overall quality assessment, and
preparation of technology education teachers to write test
items were factors considered by this study.
It was demonstrated that teachers differed
significantly in their ability to prepare good test items,
and that undergraduate and graduate courses in testing and
measurement, though they appear to be helpful in many ways,
are not taken by all teachers. These courses improved
teachers' ability in developing distractors, and preparing
valid and useful items. Undergraduate courses were also
shown to help teachers identify the proper taxonometric
level of their items.
Teachers with graduate degrees developed items which
were superior in 5 of the ratings in this study:
distractors, keying of items, usability, stem clarity, and
overall quality. However, teachers who had only Bachelor's
degrees were significantly better in developing items judged
to have good face validity.
Teachers with fewer than 8 years of experience
developed items with better overall quality (Q rating) than
those who had more experience. The less experienced
teachers significantly outperformed their more experienced
peers on 7 of the quality factors studied: spelling,
distractors, key accuracy, usability, validity, taxonomy,
and overall quality. These findings were unanticipated and
could possibly be explained by any of several competing
theories. Perhaps teachers who have been in the profession
longer than 8 years have begun to burn out and have less
time or patience to devote to extra assignments such as the
test item development projects in which they participated.
Alternatively, it could simply be true that teachers who
earned their degrees in recent years had received better
preparation to develop test items. Still another
possibility is that this could be a spurious finding due to
the small sample size (15 teachers) or some other unknown
error in sampling.
This investigation did not examine the validity of
teachers' total tests. It was limited to study of
individual items. Often, when an item was judged to lack
face validity, another item for an adjacent objective was
better suited and the pair of items together was valid to
test the two objectives. This informal finding would be
difficult to quantify and demonstrate. However, since 85%
of the items were judged to have good face validity and only
4% were judged to be invalid, if any sizeable portion of the
remaining 10% (judged marginally valid) were in fact
usefully valid or could become valid when switched with
neighboring items on the same test, then it would be safe to
conclude that these technology teachers can develop
reasonably valid tests.
Previous research has shown tests to be time on-task
activities which promote learning of the subject matter
tested. One criticism of teacher-made tests has been that
they waste time. If the tests are good ones then much of
the time devoted to them may be well spent. However, poorly
developed tests would still be a waste of time for learning
and evaluation purposes. This study identified several
weaknesses in test items developed by teachers. Other
factors, such as selection of different types of items for
differing objectives, total test validity, problems in
scoring and grading, instructions to students about tests,
and others could not be addressed in this particular
study--but they remain as important research problems.
These questions need to be answered before meaningful
conclusions can be drawn about the learning value of time
students spend taking teacher-made tests.
It is concluded that technology teachers could be
better prepared to develop tests if more of them were
required to take a testing and measurements course. It is
also concluded that the teachers in this sample are
generally capable of developing valid test items, but that
the items teachers prepare vary in the 9 aspects of overall
quality as predicted by previous research.
REFERENCES
Burdin, J.L. (1982). Teacher certification. In H.E. Mitzel
(Ed.), Encyclopedia of education research (5th ed.). New
York: Free Press.
Carter, K. (1984). Do teachers understand the principles for
writing tests? Journal of Teacher Education, 35(6),
57-60.
DeLuca, V.W. & Haynie, W.J. (1990). Updating,
computerization, and field validation of
competency-based test-item banks for selected
construction and communications technology
courses (Contract No. RFP 90-A-07). Raleigh, NC: North
Carolina State Department of Public Instruction.
DeLuca, V.W. & Haynie, W.J. (1989). Updating,
computerization , and field validation of
competency-based test-item banks for selected
manufacturing technology education courses (Contract No.
RFP 88-R-03). Raleigh, NC: North Carolina State
Department of Public Instruction.
Fleming, M. & Chambers, B. (1983). Teacher-made tests:
Windows on the classroom. In W. E. Hathaway (Ed.),
Testing in the schools: New directions for testing and
measurement, NO. 19 (pp.29-38). San Francisco:
Jossey-Bass.
Gullickson, A.R. (1982). Survey data collected in survey of
South Dakota teachers' attitudes and opinions toward
testing. Vermillion: University of South Dakota.
Gullickson, A.R. & Ellwein, M.C. (1985). Post hoc analysis
of teacher-made tests: The goodness-of-fit between
prescription and practice. Educational Measurement:
Issues and Practice, 4(1), 15-18.
Haynie, W.J. (1983). Student evaluation: The teachers' most
difficult job. Monograph Series of the Virginia
Industrial Arts Teacher Education Council, Monograph
Number 11.
Haynie, W.J. (1987). Anticipation of tests as a learning
variable. Unpublished manuscript, North Carolina State
University, Raleigh, NC.
Haynie, W.J. (1990). Effects of tests and anticipation of
tests on learning via videotaped materials. Journal of
Industrial Teacher Education, 27(4), 18-30.
Haynie, W.J. (1991). Effects of take-home and in-class tests
on delayed retention learning acquired via
individualized, self-paced instructional texts.
Manuscript submitted for publication.
Herman, J. & Dorr-Bremme, D.W. (1982). Assessing
students: Teachers' routine practices and reasoning.
Paper presented at the annual meeting of the American
Educational Research Association, New York.
Hills, J.R. (1991). Apathy concerning grading and testing.
Phi Delta Kappan, 72(7), 540-545.
Jackson, S.D. (1987). The relationship between time and
achievement in selected automobile mechanics classes.
(Doctoral dissertation, Texas A&M University).
Mehrens, W.A. & Lehmann, I.J. (1984). Measurement and
Evaluation in Education and Psychology. 3rd ed. New
York: Holt, Rinehart, and Winston.
Mehrens, W.A. & Lehmann, I.J. (1987). Using teacher-made
measurement devices. NASSP Bulletin, 71(496), 36-44.
Newman, D.C. & Stallings, W.M. (1982, March). Teacher
competency in classroom testing, measurement
preparation, and classroom testing. Paper
presented at the Annual Meeting of the National Council
on measurement in Education. (In Mehrens & Lehmann,
1987)
Nungester, R.J. & Duchastel, P.C. (1982). Testing versus
review: Effects on retention. Journal of Educational
Psychology, 74(1), 18-22.
Salmon, P.B. (Ed.). (1982). Time on task: Using
instructional time more effectively. Arlington, VA:
American Association of School Administrators.
Seifert, E.H. & Beck, J.J. (1984). Relationships between
task time and learning gains in secondary schools.
Journal of Educational Research, 78(1), 5-10.
Stiggins, R.J. & Bridgeford, N.J. (1985). The ecology of
classroom assessment. Journal of Educational
Measurement, 22(4), 271-286.
Stiggins, R.J., Conklin, N.F. & Bridgeford, N.J. (1986).
Classroom assessment: A key to effective education.
Educational Measurement: Issues and Practice, 5(2),
5-17.
----------------
W.J. Haynie, III is Associate Professor, Department of
Occupational Education, North Carolina State University,
Raleigh, NC.
Permission is given to copy any
article or graphic provided credit is given and
the copies are not intended for sale.
Journal of Technology Education Volume 4, Number 1 Fall 1992