JVER v26n2 - American Vocational Education Research Association Members' Perceptions of Statistical Significance Tests and Other Statistical Controversies

Volume 26, Number 2

2001

American Vocational Education Research Association Members' Perceptions of Statistical Significance Tests and Other Statistical Controversies

Howard R. D. Gordon
Marshall University

Abstract

The purpose of this study was to identify AVERA members' perceptions of statistical significance tests. A simple random sample was used to select 113 AVERA members for participation. The Psychometrics Group Instrument was used to collect data. Two-thirds of the respondents were males, 93% had earned a doctoral degree, 67% had more than 15 years of experience in educational research and 82.5% were employed at the university level. There was general disagreement among respondents concerning the proposition that statistical significance tests should be banned. Respondents in the study were less likely to realize that stepwise methods do not identify the best predictor set of a given size. The study also revealed that studies with non-significant results can still be very important.

Historically, vocational education (career and technical/workforce education) at the secondary and postsecondary levels has suffered from a "second-class citizen" image. This image has carried over into higher education. Departments of vocational teacher education at the university level have not always been held in the highest esteem. Whether merited or not, this stigma has been attached to research in vocational education. Research conducted in vocational education at the university often has been viewed as less than first-rate. According to Moore ( 1992 ), "We place too much emphasis on statistical significance and not enough emphasis on practical or applied significance of the research. We need to pay more attention to selecting problems for study" ( p. 11 ).

Educational research is an ongoing process, which starts at the determination of the problem, followed by execution of research procedures ( Gay, 1996 ). The subsequent stages of the process, including statistical analysis, are logically influenced by the nature of the research problem and the methodological strategy of a study.

During the past two decades, there has been an increase in vocational education research. The growth in vocational education research has been accompanied by an increase in the use of statistical techniques, with both positive and negative results. In a 1981 study by Oliver, some of the positive effects are described as: (a) more complex problems are being investigated, (b) the information produced is becoming more meaningful, and (c) the efficiency of the research is increasing. The negative effects primarily are that some problems and issues have arisen. Oliver ( 1981 ) noted that "statistical techniques are being used in cases where the assumptions are not being met and there is generally a failure to distinguish between statistical significance and practical importance" ( p. 9 ).

Conceptual Framework and Related Literature

The empirical-analytic paradigm of research in vocational education heavily relies on the use of statistics ( Smith, 1984 ). The impact of statistical methods on vocational education research was recognized by many researchers in the field ( Cheek, 1988 ; Oliver, 1981 ; Warmbrod, 1986 ; Zhang, 1993 ).

The use of statistics in educational research can be traced back as early as 1901 when Edward L. Thorndike published his Notes on Child Study ( Walker, 1956 ). However, it was around 1949 that "the era of empirical generalization" finally arrived in educational research ( West & Robinson, 1980 ). In spite of frequent calls from many researchers in vocational education to broaden paradigms for inquiry ( Zhang, 1993 ), quantitative research still prevailed in the field during the 1980s ( Hillison, 1989 ; Lynch, 1983 ). Several studies concurred that ANOVA, correlations, t-tests, regression, chi-square tests, and multivariate techniques were among the most frequently used techniques in educational research ( Zhang, 1993 ). The use of variations on statistical significance tests was popularized in the social sciences by Sir Ronald Fisher, Jerzy Neyman, and Egon Person ( Huberty, 1987 ). Today, most researchers implicitly employ some hybrid of the logics suggested by these three figures ( Thompson, 1996 ).

The etiology of the propensity to conduct statistical significance tests can be traced to two dynamics. The first involves an unrecognized error in logic when consciously trying to be scientific, whereas the second dynamic occurs as a frankly irrational process. These two dynamics undergirding continued emphasis on statistical tests must be understood if reform efforts are to be effective ( Thompson, 1996 ).

Statistical significance testing has existed in some form for approximately 300 years ( Daniel, 1998 ) and has served an important purpose in the advancement of inquiry in the social sciences. The controversy about the use or misuse of statistical significance testing that has been evident in the literature for the past 10 years has become the major methodological issue of our generation ( Kaufman, 1998 ).

Bracey ( 1988 ) reminded us that "statistical significance has nothing to do with meaningfulness" ( p. 257 ). Kupfersmid ( 1988 ) observed that a "problem related to the meaningfulness of 'statistically significant' findings is that what is 'significant' in a meaningful sense may be contradictory" ( p. 636 ). Tests of statistical significance are overused and misused in an attempt to make a poor or mediocre study appear good ( Moore, 1992 ).

Why do educational researchers place such emphasis on statistical significance? Soltis ( 1984 ) provided a clue.

Much of the social and behavioral sciences have developed their present forms by consciously seeking to imitate the methods and forms of the natural sciences, many educational researchers have tried to travel the same royal road to knowledge, legitimacy and status. ( p. 6 )

Shaver ( 1992 ) maintained that educational researchers insist on tests of statistical significance because they "provide a façade of scientism in research. For many in educational research, being quantitative is equated with being scientific…despite the fact that some scientists and many psychologists…have managed very well without inferential statistics" ( p. 2 ).

Few researchers understand what statistical significance testing is, and what it is not, and consequently their results are misinterpreted. Even more commonly, researchers understand elements of statistical significance testing, but the concept is not integrated into their research ( Thompson, 1994a ). For example, the influence of sample size on statistical significance may be acknowledged by a researcher, but this insight is not conveyed when interpreting results in a study with several thousand subjects. Because statistical significance tests have been so frequently misapplied, some reflective researchers ( Carver, 1978 ; Meehl, 1978 ; Schmidt, 1996 ; Shulman, 1970 ) have recommended that statistical significance tests be completely abandoned as a method for evaluating statistical results.

Biskin ( 1998 ) argues that practical or clinical significance can be noteworthy even when results are not statistically significant. Conversely, he argues that even results are or would be statistically significant, at least in some such cases "the researcher's prime consideration should be effect size." Vogt ( 1999 ) provides the following definitions of effect size:

(a) Broadly, any of several measures of association or of the strength of a relation, such as Pearson's r or eta. Effect size often is thought of as a measure of practical significance. (b) A statistic, often abbreviated D or delta, indicating the difference in outcome for the average subject who received a treatment from the average subject who did not (or who received a different level of the treatment). This statistic is often used in meta-analysis. It is calculated by taking the difference between the control and experimental groups' means and dividing that difference by the standard deviation of the control group's scores-or by the standard deviation of the scores of both groups combined. (c) In statistical power analysis, effect size is the degree to which the null hypothesis is false. ( p. 94 )

By contrast, tests of the null hypothesis only allows you to conclude that a relationship is significantly larger than zero, but they do not tell you by how much. Effect size measures do. Thus, the effect size is an estimate of the degree to which a phenomenon is present in a population ( Vogt, 1993 ).

Reporting effect sizes has three important benefits. First, reporting effects facilitates subsequent meta-analyses incorporating a given report. Second, effect size reporting creates a literature in which subsequent researchers can more easily formulate more specific study expectations by integrating the effects reported in related prior studies. Third, and perhaps most importantly, interpreting the effect sizes in a given study facilitates the evaluation of how a study's results fit into existing literature, verbs the explicit assessment of how similar or dissimilar results are across related studies, and potentially informs judgment regarding what study features contributed to similarities or differences in effects ( Vacha-Haase, Nilsson, Reetz, Lance, & Thompson, 2000 ).

Biskin ( 1998 ) reported that as a research area matures, effect size should be deemed more important than statistical significance. Recent empirical studies of articles published since 1994 in psychology, counseling, special education, and general education suggest that merely "encouraging" effect size reporting ( APA, 1994 ) has not appreciably affected actual reporting practices ( Vacha-Haase & Thompson, 1998 ). Kotrlik ( 2000 ) proposed that authors should report effect sizes in the manuscript and tables when reporting statistical significance in the Journal of Agricultural Education (the only career and technical/workforce education journal with this requirement).

Numerous effect sizes can be computed. Useful reviews of various choices are provided by Kirk ( 1996 ), Olejnik and Algina ( 2000 ), Rosenthal ( 1994 ), Snyder and Lawson ( 1993 ). Although there is a class of effect sizes that Kirk ( 1996 ) labeled "miscellaneous" (e.g., the odds ratios that are so important in loglinear analyses), there are two major classes of effect sizes for parametric analyses.

The first class of effect sizes involves standardized mean differences. Effect sizes in this class include indices such as Glass' D, Hedges' g, and Cohen's d. For example, Glass' D is computed as the difference in the two means (i.e., experimental group means minus control group mean) divided by the control group standard deviation, where the SD computation uses n-1 as the denominator. When the study involves matched or repeated measures designs, the standardized difference is computed taking into account the correlation between measures ( Dunlap, Cortina, Vaslow & Burke, 1996 ).

However, not all studies involve experiments or only a comparison of group means. Since all parametric analyses are part of one linear model family, and are correlational, variance-accounted-for effect sizes can be computed in all studies, including both experimental and non-experimental studies. Effect sizes in this second class include indices such as r≤, R≤, and ı≤. For example, for regression, R≤ can be computed as the sum-of-squares explained divided by the sum-of-squares total. Or, for a one-way ANOVA, ı≤ is computed as the sum-of-squares explained divided by the sum-of-squares total ( Vacha-Haase, et al., 2000 ).

Cohen ( 1988 ) provided rules of thumb for characterizing what effect sizes are small, medium, or large. He emphasized that the interpretation of effects requires the researcher to think more narrowly in terms of a specific area of inquiry. He emphasized that the evaluation of effect sizes inherently requires the researcher's explicit personal value judgment regarding the practical or clinical importance of the effects. According to Wiersma ( 2000 ), assigning descriptions of magnitude to effect sizes is somewhat subjective. Effect sizes from .05 to .20 are quite small. An effect size approaching 1.0, say .75 to .80, indicates a powerful effect. Effect sizes from .25 to .70 are considered moderate to substantial. Seldom do effect sizes exceed 1.0, though such effect sizes are possible ( Wiersma, 2000 ).

Tyron ( 1998 ) reported the following:

The fact that statistical experts and investigators publishing in the best journals cannot consistently interpret the results of these analyses is extremely disturbing. Seventy-two years of education have resulted in minuscule, if any, progress toward correcting this situation. It is difficult to estimate the handicap that widespread, incorrect, and intractable use of a primary data analytic method has on a scientific discipline, but the deleterious effects are doubtless substantial. ( p. 796 )

Several empirical studies have shown that many researchers do not fully understand the statistical tests that they employ (Mittag & Thompson, 2000 ; Nelson, Rosenthal, & Rosnow, 1986 ; Oakes, 1986 ; Rosenthal & Gaito, 1963 ; Zuckerman, Hodgins, Zuckerman, & Rosenthal, 1993 ). In their AERA study on statistical significance tests, Mittag and Thompson ( 2000 ) recommended that other national research associations conduct similar studies to resolve conflicting views related to the use of statistical tests.

At present, there is a dearth of information in the literature about the perceptions of career and technical/workforce education researchers toward statistical significance tests. The significance of this study is to serve as a framework for promoting further discussion of controversial statistical issues among career and technical/workforce education researchers.

The primary purpose of this study was to establish baseline information regarding AVERA members' perceptions of statistical significance tests. The following objectives guided the study:

To explore current perceptions of AVERA members regarding statistical significance tests.

To determine perceptions of AVERA members regarding selected statistical issues, such as score reliability and stepwise methods.

Method

Population and Sample

The population consisted of current AVERA members (N = 160) during the 2000-2001 school year. Due to the lack of available resources and other restrictions, the decision was made to use a probability sample instead of the total population. A simple random sample is also the single best way to obtain a representative sample ( Gay & Airasian, 2000 ). The AVERA membership directory was used to identify the population. Using a formula suggested by Krejcie and Morgan ( 1970 ), a sample size of 113 AVERA members was needed, based upon a 5% degree of accuracy and a 95% confidence level. A simple random sample was selected from the population using the random number generator in Microsoft Excel.

Instrumentation

The Psychometrics Group Instrument ( Mittag, 1999 ) was used to determine participants' perceptions of statistical significance tests and other statistical issues. The core of the instrument (part II) contains 29 Likert-type items with a 1-5 response scale (1 = disagree, 2 = somewhat disagree, 3 = neutral, 4 = somewhat agree, and 5 = agree). The instrument has a reliability coefficient of .90 ( Mittag & Thompson, 2000 ). Content and face validity for the adapted instrument were established by a panel of five faculty members in adult and technical education at Marshall University. The Likert-type scale items were pilot-tested for reliability with a group of 12 AVERA members not included in the sample. The reported reliability coefficient of the pilot study was .89. The completed study had a reported reliability coefficient of .83. Appropriateness and permission for the use of this instrument was discussed with the author. Some items were reverse-worded so as to minimize response set influences. Mittag and Thompson ( 2000 ) recommend the recoding of reverse-worded items, so that higher scores have a consistent meaning.

Data Collection

Elements of Dillman's ( 2000 ) mail and internet surveys were utilized to achieve optimal return rate. Data collection began in October and was concluded in December, 2000.

To control nonresponse error and maintain validity, early and late respondents were compared statistically ( Ary, Jacobs, & Razavieh, 1996 ). Research shows that nonrespondents are often similar to late respondents ( Miller & Smith, 1983 ). A late respondent was classified as one who returned his or her questionnaire during December. Statistical tests revealed no differences between respondents. Respondents' data were compiled, yielding a total response rate of 35%.

According to Kerlinger ( 1986, p. 380 ), survey mail response rates are often about 30%. The critical question when such response rates are realized is whether the respondents are still representative of the population to which the researcher wishes to generalize. Mittag and Thompson ( 2000 ) reported that "response profiles should be analyzed to provide at least some insight regarding the issue(s)" ( p. 15 ). Although the results of this study may not be generalized to the entire population of American Vocational Education Research Association members, the results can still provide valuable information for career and technical/workforce education researchers.

Data Analysis

Data were analyzed using the Statistical Package for the Social Sciences (SPSS Version 9.0 for Windows). Descriptive statistics were used to organize and summarize the data.

Findings

Demographic Characteristics

Sixty-seven percent of the respondents were males. A majority of the respondents (93%) had earned a doctoral degree. Sixty percent of the respondents revealed that they had over 15 years of experience in educational research. The respondents' work settings were as follows: university (82.5%), school district (7.5%), business (5.0%), and other (5.0%).

Perception Clusters

The 29 items evaluated nine clusters of perceptions. Table 1 presents responses to the first five items, which measured general perceptions and the ongoing significance controversy.

Respondents were in general agreement (M = 4.47, SD = .60) that this controversy is likely to continue for many years in the future. The respondents also agreed (M = 4.25, SD = .87) that researchers should use the phrase "statistically significant," rather than "significant," to describe their results. There was general disagreement (M = 1.70, SD =.88) among respondents concerning the proposition that statistical significance tests should be banned.

Table 1

AVERA Members' General Views Regarding Statistical Testing

No. Perception Statement/Item M ^a SD

1.
Controversies regarding the use of significance tests have existed for many years in the past, and will doubtless continue for many years in the future.
4.47
0.60

2.
It would be better if everyone used the phrase "statistically significant" rather than "significant," to describe the results when the null hypothesis is rejected.
4.25
0.87

3.
Most studies are conducted with insufficient statistical power against Type II error.
3.41
0.85

5.
All that significance means is that the researcher rejected the null hypothesis.
3.02
1.44

4.
Science would progress more rapidly if tests of significance were banned from journal articles.
1.70
0.88

^a Note: Response scale: 1 = disagree, 5 = agree.

Table 2

AVERA Members' Perceptions of the General Linear Model

No. Perception Statement/Item M SD

26.
It is not possible to use regression to statistically test the null that means of different groups are equal.
3.70
0.88

12.
All statistical analyses (e.g., t-tests, ANOVA, r, R) are correlational.
2.37
1.17

Note: For item 26, after recoding, 1 = agree, 5 = disagree.
For item 12, 1 = disagree, 5 = agree.For item 12, 1 = disagree, 5 = agree.

Table 2 shows means and standard deviations of respondents' perceptions of the General Linear Model (GLM). Respondents slightly disagreed that regression could be used to test hypotheses about means. As reported in Table 2, respondents also slightly disagreed that all statistical analyses are correlational.

Participants were asked whether stepwise methods identify the best variable set, and whether the results can be used to infer variable importance. As reported in Table 3, these two views were perceived by respondents as neutral to slightly agreeable (M = 3.47 to 3.55).

Table 3

AVERA Members' Perceptions of Stepwise Methods

No. Perception Statement/Item M ^a SD

13.
In regression and other analyses, stepwise methods can reasonably be used to identify the best subset of predictors of a given subset size.
3.55
0.95

20.
When researchers do stepwise analyses, the order of the entry of the variables (1st, 2nd, etc.) provides one useful indication of the importance of the variables.
3.47
0.98

^a Note: Response scale: 1 = disagree, 5 = agree.

Table 4

AVERA Members' Perceptions of Score Reliability

No. Perception Statement/Item M SD

23.
Poor reliability of data in a given study will tend to lower or attenuate the effect sizes that are detected.
3.62
1.12

28.
Reliability does not directly affect the likelihood of obtaining significance in a given study.
3.45
1.21

7.
On its face the statement, "the reliability of the test," asserts an untruth, since reliability is not a characteristic of a given test.
2.85
1.18

19.
Testing the significance of a reliability of validity coefficient with null hypothesis that r2 = 0 is not useful or productive.
2.80
0.99

Note: For items 7, 19, and 23, 1 = disagree, 5 = agree. For item 28, after recoding, 1 = agree, 5 = disagree.

Table 4 shows respondents' perceptions of score reliability. Respondents were slightly in agreement with item 23 (M = 3.62, SD = 1.12). Item 23 addressed the influence of poor reliability of data on "effect sizes".

Views regarding Type I and Type II errors are reported in Table 5. Respondents reported a mean rating score of 2.27 for item 9 (a Type II error is impossible if the results are statistically significant).

Perceptions regarding the influence of sample sizes on statistical tests are reported in Table 6. Respondents disagreed (M = 2.37, SD = 1.31) that "statistically significant results are more noteworthy when sample sizes are small."

Table 5

AVERA Members' Perceptions of Type I and II Errors

No. Perception Statement/Item M SD

22.
It is possible to make both Type I and Type II error in a given study.
3.37
1.21

17.
Type I errors may be a concern when the null hypothesis is not rejected.
2.72
1.19

29.
Type II errors are probably fairly common within published research.
2.52
1.01

9.
A Type II error is impossible if the results are statistically significant.
2.27
1.10

Note: For items 17, 22, 29, after recoding, 1 = agree, 5 = disagree.
For item 9, 1 = disagree, 5 = agree.

Table 6

AVERA Members' Perceptions of Sample Size Influences

No. Perception Statement/Item M ^a SD

16.
Every null hypothesis will eventually be rejected at some sample size.
3.15
1.29

25.
Significance tests are partly a test of whether the researcher had a large sample.
2.87
1.18

10.
Statistically significant results are more noteworthy when sample sizes are small.
2.37
1.31

^a Note: Response scale: 1 = disagree, 5 = agree.

Table 7 shows respondents' perceptions of whether statistical probabilities are exclusively measures of effect size. A mean rating of 3.82 was reported for item 14 (failure to obtain statistical significance means that results were not noteworthy or important).

Table 7

AVERA Members' Perceptions of Effect Sizes

No. Perception Statement/Item M ^a SD

14.
If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important.
3.82
1.19

11.
Smaller p values provide direct evidence that study effects were larger.
3.27
1.17

24.
The p values reported in different studies cannot be readily compared, because these values are confounded with different sample sizes across studies.
3.15
1.23

Note: For items 11 and 14, after recoding, 1 = agree, 5 = disagree.
For item 24, 1 = disagree, 5 = agree.

Perceptions of p values are summarized in Table 8. Respondents agreed that "studies with non-significant results can still be very important" (M = 1.45, SD = 1.19).

Table 8

AVERA Members' Perceptions of p Values

No. Perception Statement/Item M ^a SD

27.
Unlikely results are generally more important or noteworthy.
3.50
1.06

6.
Finding that p < .05 is one indication that the results are important.
2.80
1.41

18.
Studies with non-significant results can still be very important.
1.45
1.19

Note: Note. After recoding 1 = agree, 5 = disagree.

Finally, participants were asked about whether p values evaluate population parameters and result replicability. As revealed in Table 9, respondents' perceptions were slightly agreeable to neutral (M = 2.22 to 3.05).

Table 9

AVERA Members' Perceptions of p as Replicability Evidence

No. Perception Statement/Item M ^a SD

8.
Smaller and smaller values for the calculated p indicate that the results are more likely to be replicated in future research.
3.05
1.21

15.
The p values that are calculated in a given study test the probability of the results occurring in the sample, and not the probability of results occurring in the population.
2.82
1.33

21.
Significance tests evaluate the probability that the results for the sample are the same in the population.
2.22
1.09

Note: For items 8 and 21, after recoding 1 = agree, 5 = disagree.
For item 15, 1 = disagree, 5 = agree.

Discussion, Conclusions, and Recommendations

It appears that AVERA members who were most comfortable with and interested in statistical issues (quantitative methods) may have been most likely to respond to the survey. AVERA members' general views regarding statistical testing appeared to be consistent with previous research ( Carver, 199 3; Mittag & Thompson, 2000 ; Thompson, 1996 ).

Respondents were more likely to slightly disagree with the two views pertaining to the General Linear Model (GLM). These findings contradict a previous study reported by Mittag and Thompson ( 2000 ). In their study, respondents were basically neutral on: (a) the point of whether all statistical analyses (e.g., t-tests, ANOVA, r, R) are correlational, and (b) respondents agreed that regression could be used to test a hypothesis about means. Statisticians have argued that parametric methods are part of a single family, and that all are correlational ( Cohen, 1968 ; Knapp, 1978 ; Mittag & Thompson, 2000 ; Thompson, 1991 ). One important implication of the GLM is that r2 analogs can be reported as effect sizes in all analyses ( Mittag & Thompson, 2000 ).

The two views pertaining to stepwise methods were more likely to be perceived as acceptable for identifying the best variable set and importance. These findings suggest that some AVERA researchers are not aware that stepwise methods do not identify the best predictor set of a given size ( Cliff, 1987 ; Huberty, 1989 ; Thompson, 1995 ). In a recent study by Thomas ( 2000 ), over 70% of AVERA members indicated a need for adequate workshops on emerging statistical techniques and research methods. Future researchers in the field may consider additional preparation in statistics so as to comprehend some of the advanced techniques which are used in current research literature in career and technical/workforce education.

Stepwise methods are especially problematic when statistical significance tests are invoked to determine stopping positions, because the methods have several problems associated with conventional statistical significance applications ( Carver, 1987 ; Cohen, 1994 ; Thompson, 1993, 1994a, 1994b, 1994c ). As a general proposition, there are readily available software programs to assist with appropriate variable selection efforts. Thus, stepwise analyses should be eschewed in favor of programs such as those offered by McCabe ( 1975 ), the Morris program distributed within Huberty's ( 1994 ) book, or SAS procedure RSQR. Regarding interpretations involving the origins of explained variance (i.e., variable ordering), a useful alternative is simply to consult standardized weights (beta weights) and structure coefficients ( Thompson & Borello, 1985 ).

Overall, views regarding score reliability appeared to be neutral. These findings are consistent with a similar study reported by Mittag and Thompson ( 2000 ) for the American Educational Research Association. It is important to remember that a test is not reliable or unreliable. Reliability is a property of the scores on a test for a particular population of examinees. Thus, authors should provide reliability coefficients of the data being analyzed. Interpreting the size of the observed effects requires an assessment of the reliability of the scores ( Wilkinson & The APA Task Force on Statistical Inference, 1999, p. 596 ).

Views pertaining to Type I and Type II errors appeared to be neutral. Examination of these findings revealed a mixed perception of the definition of a Type I error. By definition, a Type I error can only occur if results are statistically significant ( Oliver, 1981 ).

Respondents were more likely to have a neutral perception regarding (a) whether "significance tests are partly a test of whether the researcher had a large sample," and (b) "every null hypothesis will eventually be reflected at some sample size." Mittag and Thompson ( 2000 ) reported similar findings. Several factors can influence the size of the sample used in a research study, but with the exception of cost, information about such factors is often incomplete and it becomes difficult to set an exact size ( Wiersma, 2000 ). Hinkle and Oliver ( 1983 ) discuss estimating necessary sample size based on certain characteristics.

Studies with non-significant results can still be very important. Tyler ( 1931 ) pointed out that "differences which are statistically significant are not always socially important. The corollary is also true: differences which are not shown to be statistically significant may nevertheless be socially significant" ( pp. 116-117 ). Meehl ( 1997 ) characterized the use of the term "significant" as being "cancerous" and "misleading"( p. 421 ) and advocated that researchers interpret their results in the terms of confidence intervals rather than p values. Moore ( 1992 ) noted,

We as vocational educators should be proud of our improving process as "research technicians". I am not advocating we do away with statistical testing. However, I am cautioning that we must not get caught up in the misguided belief that having statistically significant things makes our research significant. ( p. 5 )

These findings suggest that it is critical that research in career and technical education be meaningful and of value. "Progress has no greater enemy than habit" ( McCracken, 1991, p. 303 ). As a profession we must break out of the habit of simply describing relationships and differences between and among groups. The explanation of the phenomena must be our goal.

Issues raised in this study may be applicable to other disciplines. Joint efforts between career and technical education and other fields of education should be considered in offering statistics courses at all levels due to the similarity in the use of statistics techniques across the fields.

For further study, it is recommended that research be conducted to determine AVERA members' perceptions of qualitative research and its impact on career and technical education.

References

American Psychological Association (1994). Publication manual of the American Psychological Association (4th ed.). Washington, DC: Author.

Ary , D., Jacobs, L., & Razavieh, A. (1996). Introduction to research in education (5th ed.). Ft. Worth, TX: Holt, Rinehart, and Winston.

Biskin , B. H. (1998). Comment on significance testing. Measurement and Evaluation in Counseling and Development , 31, 58-62.

Bracey , G. W. (1988). Tips for readers of research. Phi Delta Kappan , 70, 257-258.

Carver , R. P. (1978). The case against statistical significance testing. Harvard Educational Review , 48, 378-399.

Carver, R. (1993). The case against statistical significance testing revisited. Journal of Experimental Education , 61, 287-292.

Cheek , J. G. (1988). Maintaining momentum in vocational education research. Journal of Vocational Education Research , 13, 1-17.

Cliff , N. (1987). Analyzing multivariate data . San Diego, CA: Harcovert Brace Jovanovich.

Cohen , J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin , 70, 426-443.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Cohen, J. (1994). The earth is round (p<.05). American Psychologist , 49, 997-1003.

Daniel , L. G. (1998). Statistical significance testing: A historical overview of misuse and misinterpretations with implications for the editorial polices of educational journals. Research in the Schools , 5 (2), 23-32.

Dillman , D. A. (2000). Mail and internet surveys: The tailored design method (2nd ed.) . New York: John Wiley & Sons.

Dunlap , W.P., Cortina, J.M., Vaslow, J.B., & Bruke, M.J. (1996). Meta-analysis of experiments with matched groups or repeated measures designs. Psychological Methods , 1, 170-177

Gay , L. R. (1996). Educational research: Competencies for analysis and application (5th ed.). Upper Saddle River, NJ: Prentice-Hall.

Gay, L. R., & Airasian, P. (2000). Educational research: Competencies for analysis and application (6th ed.) Upper Saddle River, NJ: Prentice-Hall.

Hillison , J. (1989). Using all tools available to vocational education researchers. Journal of Vocational Education Research , 15 (1), 1-8.

Hinkle , D. E., & Oliver, J.D. (1983). How large should the sample be? A question with no simple answer? Or… Educational and Psychological Measurement , 43, 1050-1051.

Huberty , C. J. (1987). On statistical testing. Educational Researcher , 16, 4-9.

Huberty, C. J. (1989). Problems with stepwise methods-better alternatives In B. Thompson (Ed.), Advances in social science methodology (Vol.1, pp.43-70). Greenwich, CT: JAI Press.

Huberty, C. J. (1994). Applied discriminant analysis . New York: Wiley.

Kaufman , A. S. (1998). Introduction to the special issue on statistical significance testing. Research in the Schools , 5 (2), 1.

Kerlinger , F. N. (1986). Foundations of behavioral research (3rd. ed.). New York: Holt, Rinehart and Winston.

Kirk , R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement , 56, 746-759.

Knapp , T. R. (1978). Canonical correlation analysis: A general parametric significance testing system. Psychological Bulletin , 85, 410-416.

Kotrlik , J. W. (2000). Guidelines for authors. Journal of Agricultural Educating , 41 (1), inside cover.

Krejcie , R. V., & Morgan, D.W. (1970). Determining sample size of research activities. Educational and Psychological Measurement , 30, 607-608.

Kupfersmid , J. (1988). Improving what is published. American Psychologist , 43, 635-642.

Lynch , K. B. (1983). Qualitative and quantitative evaluation: Two terms in search of meaning. Educational Evaluation and Policy Analysis , 5 (4), 461-464.

McCabe , G. P. (1975). Computations for variable selection in discriminant analysis. Technometrics, 17, 103-109.

McCracken , J. D. (1991, December). The use and misuse of correctional and regression analysis in agricultural education research . Paper presented as the invited address at the National Agricultural Education Research meeting, Los Angeles, CA.

Meehl , P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronal and slow progress of soft psychology. Journal of Consulting and Clinical Psychology , 46, 806-834.

Meehl, P. (1997). The problem is epistemdogy, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests ? (pp. 393-426). Mahwah, NJ: Erlbaum.

Miller , L. E., & Smith, K. L. (1983). Handling non-response issues. Journal of Extension, 21, 45-50.

Mittag , K. C. (1999). The psychometrics group instrument: Attitudes about contemporary statistical controversies . Unpublished instrument. The University of Texas at San Antonio.

Mittag, K. C., & Thompson, B. (2000). A national survey of AERA members' perceptions of statistical significance tests and other statistical issues. Educational Researcher , 29 (4), 14-20.

Moore , G. E. (1992). The significance of research in vocational education: The 1992 AVERA presidential address. Journal of Vocational Education Research , 17 (4), 1-4.

Nelson , N., Rosenthal, R., & Rosnow, R.L. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist , 41, 1299-1301.

Oakes , M. (1986). Statistical inference: A commentary for the social and behavioral sciences . New York: Wiley.

Olejnik , S., & Algina, J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limitations. Contemporary Educational Psychology , 25, 241-286.

Oliver , J. D. (1981). Improving agricultural education research. The Journal of American Association of Teacher Educations in Agriculture , 22 (1), 9-15.

Rosenthal , R. (1994). Parametric measures of effect size. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis (pp. 231-244). New York: Russell Sage Foundation.

Rosenthal, R., & Gaito, J. (1963). The interpretation of level of significance by psychological researchers. Journal of Psychology , 55, 33-38.

Schmidt , F. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researcher. Psychological Methods , 1, 115-129.

Shaver , J. (1992, April). What significance testing is, and what it isn't . Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

Shulman , L. S. (1970). Reconstruction of educational research. Review of Educational Research , 40, 371-393.

Smith , B. B. (1984). Empirical-analytic research paradigm research in vocational education. Journal of Vocational Education Research , 9 (4), 20-35.

Snyder , P., & Lawson, S. (1993). Evaluation results using corrected and uncorrected effect size estimates. Journal of Experimental Education , 61, 334-349

Soltis , J. F. (1984). On the nature of educational research. Educational Researcher , 13 (10), 5-10.

Thomas , H. (2000). Keeping on track to the future: The 1999 AVERA presidential address. Journal of Vocational Education Research , 25, 4-20.

Thompson , B. (1991). A primer on the logic and use of canonical correlation analysis. Measurement and Evaluation and Development , 24 (2), 80-95.

Thompson, B. (Ed.). (1993). Special issue on statistical significance with comments from various journal editors. Journal of Experimental Education , 61 (4), 285-328.

Thompson, B. (1994a). Guidelines for authors. Educational and Psychological Measurement , 54, 837-847.

Thompson, B. (1994b). The concept of statistical significance testing (An ERIC/AE Clearinghouse Digest EDO-TM-94-1). Measurement Update , 4 (1), 5-6, (ERIC Document Reproduction Service No. ED366 654)

Thompson, B. (1994c). The pivotal role of replication in psychological research: Empirically evaluating the replicability of sample results. Journal of Personality , 62 (2), 157-176.

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here. A guidelines editorial. Educational and Psychological Measurement , 55, 525-534.

Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher , 25 (2), 26-30.

Thompson, B., & Borello, G.M. (1985). The importance of structure coefficients in regression research. Educational and Psychological Measurement , 45, 203-209.

Tyler , R. W. (1931). What is statistical significance? Educational Research Bulletin , 10, 115-118, 142.

Tyron , W. W. (1998). The inscrutable null hypothesis. American Psychologist , 53, 796.

Vacha -Haase, T., Nilsson, J. E., Reetz, O. R., Lance, T. S., & Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory & Psychology , 10, 413-425.

Vacha-Haase, T., & Thompson, B. (1998, August). APA editorial polices regarding statistical significance and effect size: Glacial fields more inexorably (but glacially) . Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.

Vogt , P.W. (1993). Dictionary of statistics and methodology: A non-technical guide for the social sciences . Thousand Oaks, CA: Sage Publications.

Vogt, P.W. (1999). Dictionary of statistics and methodology: A non-technical guide for the social sciences (2nd ed.), Thousand Oaks, CA: Sage Publications.

Walker , H. M. (1956). Methods of research. Review of Educational Research, 26 (3), 323-344.

Warmbrod , J. R. (1986). Priorities for continuing progress in research in agricultural education . Paper presented at the 35th annual Southern Region Research Conference in Agricultural Education, Little Rock, AR.

West , C. K., & Robinson, D. G. (1980). Prestigious psycho-educational research published form 1910 to 1974: Types of explanations, focus, authorship, and other concerns. Journal of Educational Research , 73 (5), 271-275.

Wiersma , W. (2000). Research methods in education: An introduction (7th ed.). Needham Heights, MA: Allyn and Bacon.

Wilkinson , L., & The APA Task Force on Statistical Inference (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist , 54, 594-604.

Zhang , C. (1993, April). The determination of statistical sophistication of research in vocational education . Paper presented at the annual meeting of the American Educational Research Association Atlanta, GA.

Zuckerman , M., Hodgins, H. S., Zuckerman, A., & Rosenthal, R. (1993). Contemporary issues in the analysis of data: A survey of 531 psychologists. Psychological Science , 4, 49-53.

Author

HOWARD R. D. GORDON is Professor at Marshall University, Adult and Technical Education, Harris Hall, One John Marshall Drive, Huntington, WV 25705-2460 [E-mail: gordon@Marshall.edu]. Dr. Gordon's research interests are assessment of career and technical education, quantitative research, and history of career and technical education.

Appendix

Research Instrument

Contemporary Statistical Controversies

Part I. Background Information

Instructions: Circle your answers.

1. Have you completed a doctoral degree?
a. Yes
b. No

2. What is your primary work/study setting?

a. University

b. School district

c.Business

d.Other

3. Please indicate years of experience involved with educational research.

a. 1 - 5 years

b. 6 - 10 years

c. 11 - 15 years

d. 16 years and above

4. What is your gender?
a. Male
b. Female

Part II. Perceptions of Statistical Significance Tests/Statistical Controversies

Instructions: Circle the number that most closely indicates your degree of agreement / disagreement with each item. Circle "3" if you are unsure or have no opinion.

1. Controversies regarding the use of significance tests have existed for many years in the past, and will doubtless continue for many years in the future.

Disagree

1

2

3

4

5

Agree

2. It would be better if everyone used the phrase, "statistically significant," rather than "significant', to describe the results when the null hypothesis is rejected.

Disagree

1

2

3

4

5

Agree

3. Most studies are conducted with insufficient statistical power against Type II error.

Disagree

1

2

3

4

5

Agree

4. Science would progress more rapidly if tests of significance were banned from journal articles.

Disagree

1

2

3

4

5

Agree

5. All that significance means is that the researcher rejected the null hypothesis.

Disagree

1

2

3

4

5

Agree

6. Finding the p < .05 is one indication that the results are important.

Disagree

1

2

3

4

5

Agree

7. On its face, the statement, "the reliability of the test,' asserts an untruth since reliability is not a characteristic of a given test.

Disagree

1

2

3

4

5

Agree

8. Smaller and smaller values for the calculated p indicate that the results are more and more likely to be replicated in future research.

Disagree

1

2

3

4

5

Agree

9. A Type II error is impossible if the results are statistically significant.

Disagree

1

2

3

4

5

Agree

10. Statistically significant results are more noteworthy when sample sizes are small.

Disagree

1

2

3

4

5

Agree

11. Smaller p values provide direct evidence that study effects were larger.

Disagree

1

2

3

4

5

Agree

12. All statistical analyses (e.g., t-tests, ANOVA, r, R) are correlational.

Disagree

1

2

3

4

5

Agree

13. In regression and other analyses, stepwise analyses can reasonably be used to identify the best subset of predictors of a given subset size.

Disagree

1

2

3

4

5

Agree

14. If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important.

Disagree

1

2

3

4

5

Agree

15. The p values that are calculated in a given study test the probability of the results occurring in the sample, and not the probability of results occurring in the population.

Disagree

1

2

3

4

5

Agree

16. Every null hypothesis will eventually be rejected at some sample size.

Disagree

1

2

3

4

5

Agree

17. Type I errors may be a concern when the null hypothesis is not rejected.

Disagree

1

2

3

4

5

Agree

18. Studies with non-significant results can still be very important.

Disagree

1

2

3

4

5

Agree

19. Testing the significance of a reliability or a validity coefficient with a null hypothesis that r2 = 0 is not useful or productive.

Disagree

1

2

3

4

5

Agree

20. When researchers do stepwise analyses, the order of the entry of the variables (1st, 2nd, etc.) provides one useful indication of the importance of the variables.

Disagree

1

2

3

4

5

Agree

21. Significance tests evaluate the probability that the results for the sample are the same in the population.

Disagree

1

2

3

4

5

Agree

22. It is possible to make both a Type I and Type II error in a given study.

Disagree

1

2

3

4

5

Agree

23. Poor reliability of data in a given study will tend to lower or attenuate the effect sizes that are detected.

Disagree

1

2

3

4

5

Agree

24. The p values reported in different studies cannot be readily compared, because these values are confounded with the different samples sizes across studies.

Disagree

1

2

3

4

5

Agree

25. Significance tests are partly a test of whether the researcher had a large sample.

Disagree

1

2

3

4

5

Agree

26. It is not possible to use regression to statistically test the null that means of different groups are equal.

Disagree

1

2

3

4

5

Agree

27. Unlikely results are generally more important or noteworthy.

Disagree

1

2

3

4

5

Agree

28. Reliability does not directly affect the likelihood of obtaining significance in a given study.

Disagree

1

2

3

4

5

Agree

29. Type II errors are probably fairly common within published research.

Disagree

1

2

3

4

5

Agree

Table 1
AVERA Members' General Views Regarding Statistical Testing
No.	Perception Statement/Item	M ^a	SD

1.	Controversies regarding the use of significance tests have existed for many years in the past, and will doubtless continue for many years in the future.	4.47	0.60
2.	It would be better if everyone used the phrase "statistically significant" rather than "significant," to describe the results when the null hypothesis is rejected.	4.25	0.87
3.	Most studies are conducted with insufficient statistical power against Type II error.	3.41	0.85
5.	All that significance means is that the researcher rejected the null hypothesis.	3.02	1.44
4.	Science would progress more rapidly if tests of significance were banned from journal articles.	1.70	0.88
^a Note: Response scale: 1 = disagree, 5 = agree.

Table 2
AVERA Members' Perceptions of the General Linear Model
No.	Perception Statement/Item	M	SD

26.	It is not possible to use regression to statistically test the null that means of different groups are equal.	3.70	0.88
12.	All statistical analyses (e.g., t-tests, ANOVA, r, R) are correlational.	2.37	1.17
Note: For item 26, after recoding, 1 = agree, 5 = disagree. For item 12, 1 = disagree, 5 = agree.For item 12, 1 = disagree, 5 = agree.

Table 3
AVERA Members' Perceptions of Stepwise Methods
No.	Perception Statement/Item	M ^a	SD

13.	In regression and other analyses, stepwise methods can reasonably be used to identify the best subset of predictors of a given subset size.	3.55	0.95
20.	When researchers do stepwise analyses, the order of the entry of the variables (1st, 2nd, etc.) provides one useful indication of the importance of the variables.	3.47	0.98
^a Note: Response scale: 1 = disagree, 5 = agree.

Table 4
AVERA Members' Perceptions of Score Reliability
No.	Perception Statement/Item	M	SD

23.	Poor reliability of data in a given study will tend to lower or attenuate the effect sizes that are detected.	3.62	1.12
28.	Reliability does not directly affect the likelihood of obtaining significance in a given study.	3.45	1.21
7.	On its face the statement, "the reliability of the test," asserts an untruth, since reliability is not a characteristic of a given test.	2.85	1.18
19.	Testing the significance of a reliability of validity coefficient with null hypothesis that r2 = 0 is not useful or productive.	2.80	0.99
Note: For items 7, 19, and 23, 1 = disagree, 5 = agree. For item 28, after recoding, 1 = agree, 5 = disagree.

Table 5
AVERA Members' Perceptions of Type I and II Errors
No.	Perception Statement/Item	M	SD

22.	It is possible to make both Type I and Type II error in a given study.	3.37	1.21
17.	Type I errors may be a concern when the null hypothesis is not rejected.	2.72	1.19
29.	Type II errors are probably fairly common within published research.	2.52	1.01
9.	A Type II error is impossible if the results are statistically significant.	2.27	1.10
Note: For items 17, 22, 29, after recoding, 1 = agree, 5 = disagree. For item 9, 1 = disagree, 5 = agree.

Table 6
AVERA Members' Perceptions of Sample Size Influences
No.	Perception Statement/Item	M ^a	SD

16.	Every null hypothesis will eventually be rejected at some sample size.	3.15	1.29
25.	Significance tests are partly a test of whether the researcher had a large sample.	2.87	1.18
10.	Statistically significant results are more noteworthy when sample sizes are small.	2.37	1.31
^a Note: Response scale: 1 = disagree, 5 = agree.

Table 7
AVERA Members' Perceptions of Effect Sizes
No.	Perception Statement/Item	M ^a	SD

14.	If a dozen different researchers investigated the same phenomenon using the same null hypothesis, and none of the studies yielded statistically significant results, this means that the effects being investigated were not noteworthy or important.	3.82	1.19
11.	Smaller p values provide direct evidence that study effects were larger.	3.27	1.17
24.	The p values reported in different studies cannot be readily compared, because these values are confounded with different sample sizes across studies.	3.15	1.23
Note: For items 11 and 14, after recoding, 1 = agree, 5 = disagree. For item 24, 1 = disagree, 5 = agree.

Table 8
AVERA Members' Perceptions of p Values
No.	Perception Statement/Item	M ^a	SD

27.	Unlikely results are generally more important or noteworthy.	3.50	1.06
6.	Finding that p < .05 is one indication that the results are important.	2.80	1.41
18.	Studies with non-significant results can still be very important.	1.45	1.19
Note: Note. After recoding 1 = agree, 5 = disagree.

Table 9
AVERA Members' Perceptions of p as Replicability Evidence
No.	Perception Statement/Item	M ^a	SD

8.	Smaller and smaller values for the calculated p indicate that the results are more likely to be replicated in future research.	3.05	1.21
15.	The p values that are calculated in a given study test the probability of the results occurring in the sample, and not the probability of results occurring in the population.	2.82	1.33
21.	Significance tests evaluate the probability that the results for the sample are the same in the population.	2.22	1.09
Note: For items 8 and 21, after recoding 1 = agree, 5 = disagree. For item 15, 1 = disagree, 5 = agree.

JVER v26n2 - American Vocational Education Research Association Members' Perceptions of Statistical Significance Tests and Other Statistical Controversies

American Vocational Education Research Association Members' Perceptions of Statistical Significance Tests and Other Statistical Controversies

Howard R. D. Gordon Marshall University

Abstract

Conceptual Framework and Related Literature

Method

Population and Sample

Instrumentation

Data Collection

Data Analysis

Findings

Demographic Characteristics

Perception Clusters

Discussion, Conclusions, and Recommendations

References

Author

Appendix

Research Instrument

Howard R. D. Gordon
Marshall University