Effects of Multiple-Choice and Matching Tests On Delayed Retention Learning In Postsecondary Metals Technology
W.J. Haynie, IIIr
North Carolina State University
The importance of testing in education and the many value-charged issues surrounding it make testing an important research topic. Research on testing has historically concerned standardized tests, while a large amount of evaluation in the schools is accomplished via teacher-made tests (Haynie, 1983, 1990a; Herman & Dorr-Bremme, 1982; Mehrens, 1987; Mehrens & Lehmann, 1987; Moore, 2001; Newman & Stallings, 1982; Stiggins, Conklin, & Bridgeford, 1986). The issues of teacher-made tests that should be investigated include frequency of use, quality, benefits for student learning, optimal types to employ, and usefulness in evaluation. Previous findings cast some doubt on the ability of teachers to develop effective tests (Carter, 1984, Fleming & Chambers, 1983; Gullickson & Ellwein, 1985; Haynie, 1992, 1997a; Hoepfl, 1994; Moore; Stiggins & Bridgeford, 1985). Even so, Mehrens and Lehmann pointed out the importance of teacher-made tests in the classroom and their recognized ability to be tailored to specific instructional objectives. Teacher-made tests continue to be important in technology education and a crucial area for research (Haynie, 1990b; Mehrens & Lehmann).
One type of teacher-made test that warrants study is the matching test. Many similarities exist between matching and multiple-choice tests. A series of related multiple-choice test items may be converted to the matching test format, but care must be taken not to invalidate some items due to clues found in other parts of the test (Moore, 2001). There is also the problem that occurs when one assembles unrelated bits of information into confusing matching items. Some teacher-made matching tests suffer from these problems (Haynie, 1983, 1992; Moore, 2001). Since technology education teachers do make frequent use of matching items, it is timely to conduct research about their effectiveness for evaluation and their effects on student learning.
The effectiveness of tests in promoting delayed retention has been the focus of several studies in various settings (Haynie 1990a, 1990b, 1991, 1994, 1995, 1997b; Nungester & Duchastel 1982). In general, all of these studies have shown test-taking to enhance delayed retention learning. However, reviewers of some of the earlier works in a technology education setting by Haynie (1990a, 1990b, 1991, 1994) criticized four aspects of the protocol. In these early studies, students in the control group did not expect to be tested and may not have studied the information in earnest. The design of the studies made it difficult to distinguish learning gains due to the act of taking the test from those made during increased study time prior to the test. Additionally, students in all of the experimental and control groups in the early studies did not expect the results of any of the tests to count in the determination of their course grades, thus they may not have taken a serious approach to this entire unit of instruction. Finally, the only attempt to insure equal ability among the groups in those studies was randomization of treatment assignment; no pretesting was done. In the present study, care has been taken to avoid all of these problems and to insure equal entering ability via comparison of scores on a pretest.
Purpose of the Study
The purpose of this study was to investigate the value of multiple-choice and matching tests as aids to retention learning within a technology education context. Retention learning (Duchastel, 1981) as used here refers to learning that lasts beyond the initial testing and is assessed with tests administered two or more weeks after the information has been taught and tested. A delay period of three weeks was used in this study. Initial testing (Duchastel) referred to the commonly employed evaluation via tests that occurs at the time of instruction or immediately thereafter. Delayed retention tests are research instruments administered two or more weeks after instruction and initial testing to measure retained knowledge (Duchastel; Haynie, 1990a, 1990b, 1991, 1994, 1995, 1997b; Nungester & Duchastel, 1982). The delayed retention test results were the only experimental data analyzed in this investigation.
In addition to studying the relative learning benefits of matching and multiple-choice tests, this study attempted to determine if it was exposure to the initial test or the benefits of time spent studying that enhanced delayed retention. The following research questions were addressed by this study.
- If delayed retention learning is the objective of instruction, does initial testing of the information aid retention learning?
- Does initial testing by matching tests aid retention learning as effectively as initial testing by multiple-choice tests?
- Will information that is not reflected on the initial tests be learned equally well by students assessed via matching and multiple-choice tests?
Population and Sample
Undergraduate students in nine intact postsecondary metals technology classes were provided a booklet on new materials developed for space exploration. There were 148 students divided into three groups: Group A (multiple-choice test, n = 52), Group B (matching test, n = 49), and Group C (no test, n = 47). All groups were from the technology education metals technology classes at North Carolina State University. Students were freshmen and sophomores in technology education, design, or in various engineering curricula. Students majoring in aerospace engineering were deleted from the final sample because much of the material was novel to other students but had previously been studied by this group.
Group assignment to instructor was not randomized, due to scheduling constraints; however, all sections were taught either by the researcher or graduate assistants, each teaching some control and some experimental sections. The course instructor gave no instruction or review to any of the groups except the booklets, and all announcements and directions were provided via scripted standard statements. Three intact class sections were combined to form each experimental or control group. Class sections had between 18-22 students. Random assignment of treatments to sections, deletion of students majoring in aerospace engineering, and absences on testing dates resulted in final group sizes that were unequal. All groups used Sthe same laboratory complex during instructional and testing periods. This helped to control extraneous variables due to environment.
Design of the Study
At the beginning of the course it was announced that students would be asked to participate in an experimental investigation about different types of tests while they studied subject matter about high-tech materials reflected in the newly revised course outline. All other instructional units in the course were studied by students working in self-paced groups and taking subtests on the units as they learned them. Three examination dates were used for administration of these regular subtests. One regular subtest that was taken by all students on the first examination date, termed Common Test A, concerned metallurgy, precision measurement, and general metal processing. This subtest was used as a pretest to establish equal entering ability among the groups. The experimental study began the class meeting following the first examination date, so students could see that none of the eight regular subtests covered the information in this unit of study. After the regular subtests that had been taken were discussed and results given to students, the instructions for participation in the study were read.
All students were given a copy of a 34-page study packet prepared by the researcher. The packet, entitled High Technology Materials (Haynie, 1991), discussed composite materials, heat-shielding materials, and nontraditional metals developed for the space exploration program. Uses of these materials in consumer products were also illustrated. The packet was in booklet form. It included the following resources typically found in textbooks: (a) a table of contents, (b) text (written by the researcher), (c) halftone photographs, (d) quotations from other sources, (e) diagrams and graphs, (f) numbered pages, (g) excerpts from other sources, and (h) an index with 119 entries correctly keyed to the page numbers inside. Approximately one third of the information in the text booklet was actually reflected in the tests. The remainder of the material appeared to be equally relevant, but served as a complex distracting field to prevent mere memorization of facts. Students were instructed to use the booklet as if it were a textbook and study as they normally would.
Students in all three groups were instructed to study the materials in preparation for an in-class objective test to be given in two weeks. On the announced test date, all booklets were collected and initial testing was conducted according to treatment group. Group A was given the multiple-choice form of the initial test; Group B took the matching form of the test; and Group C was not tested initially. Group C had been told to prepare for the in-class test just as the other groups had been directed, and it is assumed that they studied in much the same manner and depth. However, when the test date arrived, they were told that the test simply was not ready and they would not have to take the test. There was concern that students in the control group might feel cheated due to the lack of this test. Therefore, these students were told that their highest regular subtest would be counted double in determining their final grades. The offer was made that if this adjustment did not please any students, they could see their instructor for other possible arrangements. This adjustment met with the approval of all students in the control group classes.
Three weeks later, all groups were asked to take an unannounced delayed retention test on the same material. The students were told at this time that the true objective of the experimental study was to see which type of test (or no test) promoted delayed retention best. It was further explained that their earlier test scores, if any, were not a part of the study data in any way. Students were asked to do their best and were assured that the scores on this surprise test did not affect their grades. Though informed that participation was fully voluntary, all students who had been present for the earlier sessions did participate cooperatively by taking the delayed retention test.
The initial tests were parallel forms of a single 20-item test. The multiple-choice version used by Group A was carefully converted to matching form for use by Group B. When developing the matching form, care was taken to observe the precautions presented by Haynie (1983, 1992) and Moore (2001). Multiple-choice items had five response alternatives, and matching items appeared in clusters that had related stems and alternatives developed directly from the multiple-choice form of the test. The multiple-choice form had been used in numerous previous studies using the protocol of this investigation. The same information was reflected by both tests; and they operated primarily at the first three levels of the cognitive domain: (a) knowledge, (b) comprehension, and (c) application.
The delayed retention test was a 30-item, multiple-choice test; 20 of the items were alternative forms of the same items used in the initial multiple-choice test. These served as a subtest of previously tested information. The remaining 10 items were similar in nature and difficulty, but they had not appeared in any form on either of the initial tests. These were interspersed throughout the test, and they served as a subtest of new information. The subtest on new information was used to determine if students' learning gains resulted from exposure to a scenario involving studying for and then taking a test (of either form), or from exposure to a specific form of the test. Did the fact that Group A initially took a multiple-choice test and Group B took a matching test affect their retention differentially?
The delayed retention test was developed and used in a previous study (Haynie, 1990a). It had been refined from an initial bank of 76 paired items and examined carefully for content validity. Cronbach's Coefficient Alpha procedure was used to establish a reliability of r = .74 for the delayed retention test. Thorndike and Hagen (1977) asserted that tests with reliability approaching r = .70 are within the range of usefulness for research studies.
Students were given initial instructions concerning the learning booklets and were directed to return them on the announced date of their test. Groups A and B were tested with the multiple-choice and matching forms of the test, respectively, on the announced test date. On the announced test date, Group C was informed that they would not be tested. All groups returned all learning booklets on the announced test date. The unannounced delayed retention test was administered three weeks later.
The data were analyzed with Statistical Analysis System (SAS) software from the SAS Institute, Inc. The answer forms were scanned, and the data were stored on a floppy disk. The General Linear Models (GLM) procedure in SAS was chosen for omnibus testing rather than analysis of variance (ANOVA) because it is less affected by unequal group sizes. A simple one-way GLM analysis was chosen because the only data consisted of the delayed retention test means of the three groups. The means of the two subtest sections of the retention test were then similarly analyzed by one-way GLM procedure to detect differences in retention of previously tested and novel information. Follow-up comparisons were conducted via Least Significant Difference t-test (LSD) as implemented in SAS. Alpha was set at the p < .05 level for all tests of significance.
The means and standard deviations of the three groups on the pretest (Common Test A) are shown in Table 1. Since this test was actually taken on the class day immediately before the study materials were distributed and explained, a finding of F(2, 145) = 0.49, p < 0.615, confirmed that the groups were of generally equal ability at the beginning of the study.Table 1
Means, Standard Deviations, and Sample Sizes for the Common Test A Scores
Metals Pretest Groups ___________________________________ M SD
Group A N=52 21.54 5.94 Group B N=49 22.49 5.24 Group C N=47 21.66 4.24
Note. No significant differences at the p < .05 level
The means, standard deviations, and final sizes of the three groups on the delayed retention test (including the two subtests and the total scores) are presented in Table 2. The overall difficulty of the test battery and each subtest can be estimated by examining the grand means and the range of scores. The grand mean of all participants was M = 16.96, with a range of 4 to 27 on the total 30-item test. The overall mean on the 20-item subtest of previously tested material was M = 12.61, with a range of 3 to 19; and the overall mean on the 10-item subtest of new information was M = 4.38, with a range of 0 to 9. No student scored 100% on any test, and the overall means were close to 50% on each test, so the tests were difficult. The overall means, however, were not used in any other analysis of the data.Table 2
Means, Standard Deviations, and Sample Sizes for Delayed Retention Test Scores
Subscale A Subscale B Total Test Previously Novel Represented Information Treatment ___________ ____________ ___________ M SD M SD M SD
Group A Mult-Choice Test 16.98 4.8 12.79 3.3 4.19 2.0 n=52 Group B Matching Test 18.89 3.9 13.63 2.9 5.27 1.7 n=49 Group C No Test Given 15.02 4.3 11.36 2.8 3.65 1.9 Control n=47 ___________ ___________ ___________ Overall 16.96 4.4 12.61 3.0 4.38 1.9 n=148
The GLM procedure was used to compare the three treatment groups (Group A, multiple-choice test; Group B, matching test; and Group C, no test) on the means of the total delayed retention test scores. A significant difference was found among the total test means:F(2, 145) = 9.28, p < .0002 (see Table 3). Following this significant finding, the GLM procedure was again employed to examine the means of each subtest. Significant differences were found among the means of the subtest of previously tested information, F(2, 145) = 6.92, p < .0013 (Table 4), and among the means on the subtest of new information, F(2, 145) = 8.87, p < .0002 (Table 5).Table 3
Comparison of Group Means on the Total Test
Source Sum of Mean Findings DF Squares Square F p-value
Treatments 2 360.54 180.27 9.28 0.0002* Error 145 2816.45 19.42 Total 147 3176.99
Comparison of Group Means on the Subtest of Previously Tested Information
Sum of Mean Source DF Squares Square F p-value Findings
Treatments 2 126.14 63.07 6.92 0.0013* Error 145 1320.91 9.11 Total 147 1447.05
Note. * significant difference at the p < .05 level.Table 5
Comparison of Group Means on the Subtest of New Information
Sum of Mean Source DF Squares Square F p-value Findings
Treatments 2 64.63 32.32 8.87 0.0002* Error 145 528.18 3.64 Total 147 592.81
Note. * significant difference at the p < .05 level.
Follow-up comparisons were conducted via t-test (LSD) procedures in SAS. The critical value used was t (145) = 1.97, p < .05. The contrasts of rank-ordered means on the total delayed retention test scores were as follows: Group C (M = 15.02), Group A (M = 16.98), and Group B (M = 18.90). Both of the tested groups, Group A and Group B significantly outscored the control group (Group C) on the total test. Additionally, Group B (which took the matching test in the initial testing) outscored Group A (which had taken the multiple-choice form of the initial test).
The contrasts of the rank-ordered means on the subtest of previously tested information (delayed retention test) were as follows: Group C (M = 11.36), Group A (M = 12.79), and Group B (M = 13.63). Again, both of the previously tested groups (A and B) outscored the control group (Group C). However, in this subtest, Group B did not significantly outscore Group A, as it did on the total test score.
The contrasts of scores on the subtest of new information were as follows: Group C (M = 3.65), Group A (M = 4.19), and group B (M = 5.27). On this subtest, only Group B (matching test) outscored the control group (Group C). Group A, which had taken the initial multiple-choice test, did not significantly outscore the control group and scored significantly lower than Group B (matching test).
Three research questions were addressed by this study.
1. If delayed retention learning is the objective of instruction, does initial testing of the information aid retention learning?
Within the constraints and limitations of this study, it appears that students do retain more learned information if they are tested. This finding is in harmony with findings in several previous studies (Haynie 1990a, 1990b, 1991, 1994, 1995, 1997b, 2002; Nungester & Duchastel, 1982). In many of the earlier studies, it could not be determined whether the retention gains were due to increased study in anticipation of a test or due to the act of taking the test, because (in those studies) the control group had been previously told that they would not be tested and that the unit of instruction did not count in determination of course grades. In this study, however, all groups (including the control group) were initially told that the unit did count in their course grades and that they would be tested with an objective test upon completion of the two-week study period. So, the finding here that both tested groups initially outscored the untested control group demonstrated that it was the act of taking the test (rather than increased study in anticipation of an upcoming test) that resulted in the increased retention learning.
2. Does initial testing by matching tests aid retention learning as effectively as initial testing by multiple-choice tests?
Examination of the contrasts of results on the subtest of previously tested information revealed that there was no significant difference between the means of Group A (multiple-choice test) and Group B (matching test). Both of these means were significantly higher than those of the untested (control) group. Thus, it appears that initial testing via a matching test does aid retention as effectively as testing via a multiple-choice test.
3. Will information that is not reflected on the initial tests be learned equally well by students assessed via matching and multiple-choice tests?
The subtest of new information (10 items interspersed within the delayed retention test) was used to examine this question. The result examined here is the finding that the matching test group (Group B) significantly outscored the control group (C) on the subtest of new information, while the multiple-choice test group (Group A) did not significantly outscore the control group on this subtest. However, a conclusion that matching tests are superior in their ability to enhance retention learning of information not reflected on the test would be unjustified because (a) the differences are small (possibly unimportant,) even though they did attain the level of statistical significance; (b) the means of the matching test group (Group B) exhibited a non-significant trend to be slightly higher than those of the multiple-choice group (Group A) in both the pretest and all of the comparisons of means on the delayed retention test; and (c) faced with no logical reason to expect the matching test to promote this type of retention better than the multiple-choice test, it does not seem to be a reasonable conclusion. Given this stance, there would be two logical alternatives: either simply conclude that the matching tests are as effective as multiple-choice tests for promoting retention of material not reflected on the test, or speculate about other possible explanations. The "as effective" viewpoint is chosen here, and speculation is presented about a plausible explanation involving closer examination of another finding within the study.
Since both of the tested groups outscored the control group on the subtest of previously tested information, it could be possible that Group A (multiple-choice test) was actually the lowest-ability group in the study and that a part of the reason that they outscored the control group on the subtest of previously tested information was due to familiarity with the form of the test in combination with the increased exposure to the material resulting from the time devoted to actually taking any form of the test. Thus, the final conclusion posed here concerning this research question is that matching tests appear to be as effective as multiple-choice tests in promoting retention of information not reflected on the test, and that further study is needed on this issue.
Since testing is time consuming and value charged, it is important to learn as much as possible about testing and the effects of tests on learning. Much research has been conducted concerning standardized tests and the effectiveness of tests for evaluation, but little has been done to examine questions related to the effects of teacher-made tests on learning and retention.
This study was limited to one setting within technology education. It used learning materials and tests designed to teach and evaluate a limited number of specified objectives concerning one body of subject matter. The sample used in this study may have been unique for unknown reasons. Therefore, studies of a similar design that use different materials and are conducted with different populations are needed. However, on the basis of this one study, it is recommended that (a) when useful for evaluative purposes, classroom testing should continue to be employed due to its positive effects on retention learning; (b) either matching or multiple-choice tests may be used to promote retention learning, so the choice of which form to use depends upon evaluation issues; and (c) it appears that matching tests are as effective as multiple-choice tests in promoting retention of information that is not actually reflected on the test itself.
More study is needed to further refine the conclusions made here and to answer additional questions about testing in the technology classroom. The time devoted to testing and reviewing for tests will be maximized only if it is also effective in promotion of learning and retention. Since teacher-made tests have been shown to be laden with technical problems that hinder their usefulness for evaluation, it is important that their role in the promotion of retention be well understood and maximized. Further research is needed to remove some of the mystery and opinion surrounding this important topic.
Carter, K. (1984). Do teachers understand the principles for writing tests? Journal of Teacher Education, 35(6), 57-60.
Duchastel, P. (1981). Retention of prose following testing with different types of test. Contemporary Educational Psychology, 6, 217-226.
Fleming, M., & Chambers, B. (1983). Teacher-made tests: Windows on the classroom. In W. E. Hathaway (Ed.), Testing in the schools: New directions for testing and measurement, No. 19 (pp.29--38). San Francisco: Jossey-Bass.
Gullickson, A. R., & Ellwein, M. C. (1985). Post hoc analysis of teacher-made tests: The goodness-of-fit between prescription and practice. Educational Measurement: Issues and Practice, 4(1), 15-18.
Haynie, W. J. (1983). Student evaluation: The teacher's most difficult job. Monograph Series of the Virginia Industrial Arts Teacher Education Council, Monograph Number 11.
Haynie, W. J. (1990a). Effects of tests and anticipation of tests on learning via videotaped materials. Journal of Industrial Teacher Education, 27(4), 18-30.
Haynie, W. J. (1990b). Anticipation of tests and open space laboratories as earning variables in technology education. In J. M. Smink (Ed.), Proceedings of the 1990 North Carolina Council on Technology Teacher Education Winter Conference. Camp Caraway, NC: NCCTTE.
Haynie, W. J. (1991). Effects of take-home and in-class tests on delayed retention learning acquired via individualized, self-paced instructional texts. Journal of Industrial Teacher Education, 28(4), 52-63.
Haynie, W. J. (1995). In- class tests and posttest reviews: Effects on delayed-retention learning. North Carolina Journal of Teacher Education, 8(1), 78--93.
Haynie, W. J. (1997a). An analysis of tests authored by technology education teachers. Journal of the North Carolina Council of Technology Teacher Education, 2(1), 1-15.
Hoepfl, M. C. (1994). Developing and evaluating multiple-choice tests. The Technology Teacher, 53(7), 25-26.
Herman, J., & Dorr-Bremme, D. W. (1982). Assessing students: Teachers' routine practices and reasoning. Paper presented at the annual meeting of the American Educational Research Association, New York.
Mehrens, W. A. (1987). "Educational Tests: Blessing or Curse?" Unpublished manuscript.
Mehrens, W. A., & Lehmann, I. J. (1987). Using teacher-made measurement devices. NASSP Bulletin, 71(496), 36-44.
Moore, K. D. (2001). Classroom teaching skills (5th ed). New York: McGraw-Hill.
Newman, D. C., & Stallings, W. M. (1982, March). Teacher competency in classroom testing, measurement preparation, and classroom testing practices. Paper presented at the Annual Meeting of the National Council on measurement in Education. (In Mehrens & Lehmann, 1987).
Nungester, R. J., & Duchastel, P. C. (1982). Testing versus review: Effects on retention. Journal of Educational Psychology, 74(1), 18-22.
Stiggins, R. J., & Bridgeford, N. J., (1985). The ecology of classroom assessment. Journal of Educational Measurement, 22(4), 271-286.
Stiggins, R. J., Conklin, N. F., & Bridgeford, N. J. (1986). Classroom assessment: A key to effective education. Educational Measurement: Issues and Practice, 5(2), 5-17.
Thorndike, R. L., & Hagen, E. P. (1977). Measurement and evaluation in psychology and education. New York: Wiley.
Haynie is Associate Professor in the Department of Mathematics, Science, and Technology Education at North Carolina State University in Raleigh, North Carolina. Haynie can be reached at firstname.lastname@example.org.