Assessing the Validity of Standard-Setting for an English Language Assessment With a Hybrid Expert and Empirical Performance Model
Author: Jin-Chang Hsieh (Research Center for Testing and Assessment, National Academy for Educational Research)
Vol.&No.:Vol. 68, No. 2
Date:June 2023
Pages:1-35
DOI:https://doi.org/10.6209/JORIES.202306_68(2).0001
Abstract:
Background and Purpose
The Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL) was implemented to evaluate the effect of the new 12-year basic education curriculum on student performance in Taiwan. TASAL is a standards-based, large-scale assessment that aims to track the literacy growth of Taiwanese students, explore relevant factors, and collect empirical evidence to assist in the development of future curriculum guidelines. This study assessed the validity of standard-setting with a hybrid model combining expert and student empirical performance.
The hybrid model exhibits multidimensional, multisource, and long-term cumulative features. The multidimensional feature provides evidence for procedural, internal, and external validity and for setting appropriate standards (Kane, 1994, 2001; Pant et al., 2009). The multisource feature indicates that the evidence of validity is derived from various sources, such as expert opinions and students’ empirical performance. Finally, the long-term cumulative feature represents the process of accumulating evidence over a long period. Presenting every type of evidence in a study is challenging due to time and resource constraints. The burden placed on researchers and students should be considered.
Method
1. Sampling
In 2019, the evaluation of seventh-grade students was initiated formally in TASAL. In 2020, the same group of students, now in the eighth grade, was evaluated in TASAL. The sampling method was stratified two-stage cluster sampling. Initially, 256 junior high schools were selected to take part in the evaluation. Finally, 246 schools with a total of 2,793 students were enrolled for this project. Regarding the English test of TASAL, in 2019, 2,793 seventh-grade students took the TASAL English test. In 2020, 2,893 eighth-grade students took the test. Among the eighth-grade students, 2,554 took the English test in both years.
2. Materials
The TASAL English core competence assessment was developed through a standardized procedure, including purpose clarification, theory construction, assessment guidelines, performance level descriptor development, test item designation, test assembly, and data analysis. The TASAL English core competence assessment examines English reading comprehension according to the corresponding content in the 12-year basic education curriculum. Based on the concept of transforming verb-noun usage into cognitive processes and content knowledge, as proposed by Anderson et al. (2001), a separate set of assessment criteria and test items has been developed for the TASAL English core competence assessment to evaluate reading comprehension.
In the TASAL English core competence assessment, six levels of performance descriptors was initially proposed (Hsieh, 2023). However, no corresponding test items were available for the sixth (highest) level of the assessment, because the standard-setting process still focused on the seventh-grade test items. Therefore, this study focused on the first five levels, which included acquiring linguistic fluency, locating explicitly stated information, literal comprehension, implicit comprehension, and evaluation and reflection beyond text comprehension. According to a review of the literature, various text types based on the OECD text types (2019) are used in the TASAL English core competence assessment, and these types are modified to include descriptive, introductive, transactional, expository, commentary, persuasive, narrative, and literary texts. The assessment for seventh-grade students contained 182 test items, and the assessment for eighth-grade students contained 196 test items; 84 common items were included in both assessments. The response consistency was good. The Expected A Posteriori (EAP) estimate of the items were 0.85 and 0.91 in the assessments for seventh-grade and eighth-grade students, respectively.
3. Standard-setting
This study employed the extended Angoff method (Hambleton & Plake, 1995) to establish assessment standards. A total of 15 experts from various regions in Taiwan were trained and participated in the standard-setting meeting. Among these experts, 10 were women and 5 were men, with an average teaching experience of 18.25 years.
The standard-setting meeting was implemented in three rounds, and student ability and cutoff scores were estimated by weighted likelihood estimation (Warm, 1989). Statistical analyses were performed in R (R Core Team, 2022) and TAM software packages (Robitzsch et al., 2020).
Result and Conclusion
Feedback was collected using a questionnaire on standard-setting. Most of the experts rated the process and outcome of the standard-setting meeting as being well above or above average. The experts agreed or strongly agreed that providing feedback and PLD procedures were helpful in establishing standards. In summary, this study provides satisfactory evidence for the procedural validity of standard-setting.
This study also provides evidence for the internal validity of standard-setting. During the initial round, the standard error of cutoff scores was between 2.03 and 11.58, as reported by all experts across all levels. However, during subsequent rounds, the margin of error decreased. In general, most standard errors (relative to the measurement error of 34.64) were within an acceptable level of 0.33, which is consistent with the results of Kaftandjieva (2010, p. 104).
Using the English comprehension performance of eighth-grade students as the external criteria, the use of the scores obtained from the seventh-grade assessment to set cutoff scores was effective for significantly distinguishing between different levels of achievement. A partial η2 of .506 was obtained, indicating a large effect size, as suggested by Cohen (1988). In conclusion, this study provides evidence for the external validity of standard-setting.
In summary, some valuable suggestions are provided based on the study results. For example, when evaluating changes in student performance, the regression toward the mean may be a crucial factor affecting the result of standard-setting during the implementation of vertical articulation of cutoff scores across grades. Additionally, continuously collecting evidence to support the validity of standard-setting is crucial in responding to educational policies and curriculum guidelines. Therefore, the study results indicate the importance of building ongoing proof of validity in future research.
Keywords:English comprehension, hybrid of expert and student empirical performance models, Taiwan Assessment of Student Achievement: Longitudinal study, standard-based large-scale assessment, standard setting
《Full Text》
References:
- 任宗浩(2018)。十二年國民基本教育實施成效評估—臺灣學生成就長期追蹤評量計畫(第一期)(總計畫)(NAER-107-12-B-1-01-00-1-02)。國家教育研究院。【Jen, T.-H. (2018). Effectiveness of 12-year basic education program: A longitudinal study on Taiwan Assessment of Student Achievement (TASA-L) (I) (NAER-107-12-B-1-01-00-1-02). National Academy for Educational Research.】
- 吳正新(2019)。長期追蹤調查抽樣技術與權重校正(NAER-2019-113-A-1-1-E1-03)。國家教育研究院。【Wu, J.-S. (2019). Sampling design and weighting adjustment of large-scale surveys (NAER-2019-113-A-1-1- E1-03). National Academy for Educational Research.】
- 侯佩君、杜素豪、廖培珊、洪永泰、章英華(2008)。台灣鄉鎮市區類型之研究:「台灣社會變遷基本調查」第五期計畫之抽樣分層效果分析。調查研究─方法與應用,23,7-32。https://doi.org/10.7014/TCYCFFYYY.200804.0007 【Hou, P.-C., Tu, S.-H., Liao, P.-S., Hung, Y.-T., & Chang, Y.-H. (2008). The typology of townships in Taiwan: The analysis of sampling stratification of the 2005-2006 Taiwan social change survey. Survey Research: Method and Application, 23, 7-32. https://doi.org/10.7014/TCYCFFYYY.200804.0007】
- 國家教育研究院(2018)。十二年國民基本教育課程綱要:國民中小學暨普通型高級中等學校:語文領域─英語文。作者。【National Academy for Educational Research. (2018). 12-year basic education curriculum for elementary and high school: English. Author.】
- 國家教育研究院(無日期)。首頁。臺灣學生成就長期追蹤評量計畫。2022年3月30日,https://tasal.naer.edu.tw/【National Academy for Educational Research. (n.d.). Homepage. Taiwan Assessment of Student Achievement: Longitudinal Study. Retrieved March 30, 2022, from https://tasal.naer.edu.tw/】
» More
- 任宗浩(2018)。十二年國民基本教育實施成效評估—臺灣學生成就長期追蹤評量計畫(第一期)(總計畫)(NAER-107-12-B-1-01-00-1-02)。國家教育研究院。【Jen, T.-H. (2018). Effectiveness of 12-year basic education program: A longitudinal study on Taiwan Assessment of Student Achievement (TASA-L) (I) (NAER-107-12-B-1-01-00-1-02). National Academy for Educational Research.】
- 吳正新(2019)。長期追蹤調查抽樣技術與權重校正(NAER-2019-113-A-1-1-E1-03)。國家教育研究院。【Wu, J.-S. (2019). Sampling design and weighting adjustment of large-scale surveys (NAER-2019-113-A-1-1- E1-03). National Academy for Educational Research.】
- 侯佩君、杜素豪、廖培珊、洪永泰、章英華(2008)。台灣鄉鎮市區類型之研究:「台灣社會變遷基本調查」第五期計畫之抽樣分層效果分析。調查研究─方法與應用,23,7-32。https://doi.org/10.7014/TCYCFFYYY.200804.0007 【Hou, P.-C., Tu, S.-H., Liao, P.-S., Hung, Y.-T., & Chang, Y.-H. (2008). The typology of townships in Taiwan: The analysis of sampling stratification of the 2005-2006 Taiwan social change survey. Survey Research: Method and Application, 23, 7-32. https://doi.org/10.7014/TCYCFFYYY.200804.0007】
- 國家教育研究院(2018)。十二年國民基本教育課程綱要:國民中小學暨普通型高級中等學校:語文領域─英語文。作者。【National Academy for Educational Research. (2018). 12-year basic education curriculum for elementary and high school: English. Author.】
- 國家教育研究院(無日期)。首頁。臺灣學生成就長期追蹤評量計畫。2022年3月30日,https://tasal.naer.edu.tw/【National Academy for Educational Research. (n.d.). Homepage. Taiwan Assessment of Student Achievement: Longitudinal Study. Retrieved March 30, 2022, from https://tasal.naer.edu.tw/】
- 國家教育研究院課程及教學研究中心核心素養工作圈(2015)。十二年國民基本教育領域課程綱要─核心素養發展手冊。國家教育研究院。【Task Force of Core Competence Development, Research Center for Curriculum and Instruction, National Academy for Educational Research. (2015). Curriculum guidelines of 12-year basic education: Core competence development manual. National Academy for Educational Research.】
- 張銘秋、黃瓅瑩、陳佳蓉、陳柏熹、曾芬蘭(2022)。國中教育會考數學科的回沖效應初探。教育科學研究期刊,67(1),227-254。https://doi.org/10.6209/JORIES.202203_67(1).0008 【Chang, M.-C., Huang, L.-Y., Chen, C.-J., Chen, P.-H., & Tseng, F.-L. (2022). Washback effect from changes to comprehensive assessment program math test for junior high school students. Journal of Research in Education Sciences, 67(1), 227-254. https://doi.org/10.6209/JORIES.202203_67(1).0008】
- 教育部統計處(2019)。各級學校地理資訊及地區別統計查詢。https://stats.moe.gov.tw/EduGis/ 【Department of Statistics, Ministry of Education. (2019). Statistical query of geographical and regional information for schools at all levels. https://stats.moe.gov.tw/EduGis/】
- 曾芬蘭、林奕宏、邱佳民(2017)。監控評分者效果的Yes/No Angoff標準設定法之效度檢核:以國中教育會考數學科為例。測驗學刊,64(4),403-432。【Tseng, F.-L., Lin, Y.-H., & Chiou, J.-M. (2017). Validation of the rater-effects-monitored yes/no Angoff standard-setting method: Using the Taiwan comprehensive assessment program for junior high school students math exam as an example. Psychological Testing, 64(4), 403-432.】
- 黃馨瑩、謝名娟、謝進昌(2013)。臺灣學生學習成就評量英語科標準設定之效度評估研究。教育與心理研究,36(2),87-112。https://doi.org/10.3966/102498852013063602004【Huang, H.-Y., Hsieh, M.-C., & Hsieh, J.-C. (2013). Validating the yes/no Angoff standard setting procedure on Taiwan assessment of student achievement English test. Journal of Education & Psychology, 36(2), 87-112. https://doi.org/10.3966/102498852013063602004】
- 謝進昌(2021)。以「補充性表現水平描述輔助自陳式測量構念」之延伸Angoff標準設定研究。教育心理學報,53(2),307-334。https://doi.org/10.6251/BEP.202112_53(2).0003 【Hsieh, J.-C. (2021). Extended Angoff method in setting standards for self-report measures with supplementary performance level descriptors. Bulletin of Educational Psychology, 53(2), 307-334. https://doi.org/10.6251/BEP. 202112_53(2).0003】
- 謝進昌(2023)。建構英語文素養評量指引:TASAL標準本位大型評量。國家教育研究院。【Hsieh, J.-C. (2023). Constructing English core competence assessment guidelines: TASAL standard-based large-scale assessment. National Academy for Educational Research.】
- 謝進昌、謝名娟、林世華、林陳涌、陳清溪、謝佩蓉(2011)。大型資料庫國小四年級自然科學習成就評量標準設定結果之效度評估。教育科學研究期刊,56(1),1-32。https://doi. org/10.3966/2073753X2011035601001【Hsieh, J.-C., Hsieh, M.-C., Lin, S.-H., Lin, C.-Y., Chen, C.-H., & Hsieh, P.-J. (2011). Validation of the standard setting procedure for a large scale 4th grade science assessment. Journal of Research in Educational Sciences, 56(1), 1-32. https://doi.org/10.3966/2073753X2011035601001】
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
- Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths, J., & Wittrock, M. C. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman.
- Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (pp. 508-600). American Council on Education.
- Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17(2), 191-201. https://doi.org/10.3102/10769986017002191
- Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Sage.
- Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31-50. https://doi.org/10.1111/j.1745-3992.2004.tb00166.x
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
- Efron, B. (1979). Bootstrap methods: Another look at the Jackknife. The Annals of Statistics, 7(1), 1-26. https://doi.org/10.1214/aos/1176344552
- Ferrara, S., Johnson, E., & Chen, W. H. (2005). Vertically articulated performance standards: Logic, procedures, and likely classification accuracy. Applied Measurement in Education, 18(1), 35-59. https://doi.org/10.1207/s15324818ame1801_3
- Gagne, E. D., Yekovich, C. W., & Yekovich, F. R. (1993). The cognitive psychology of school learning. Harper Collins.
- Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2011). Survey methodology (2nd ed.). John Wiley & Sons.
- Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria for evaluating the process. In G. J. Cizek (Ed.), Standard setting: Concepts, methods, and perspectives (pp. 89-116). Erlbaum.
- Hambleton, R. K., & Plake, B. S. (1995). Using an extended Angoff procedure to set standards on complex performance assessments. Applied Measurement in Education, 8(1), 41-55. https://doi. org/10.1207/s15324818ame0801_4
- Hoover, W. A., & Gough, P. B. (1990). The simple view of reading. Reading and Writing: An Interdisciplinary Journal, 2, 127-160. https://doi.org/10.1007/BF00401799
- Ingels, S. J., Pratt, D. J., Rogers, J. E., Siegel, P. H., & Stutts, E. S. (2005). Education longitudinal study of 2002: Base-year to first follow-up data file documentation (NCES 2006-344). National Center for Education Statistics.
- Kaftandjieva, F. (2010). Methods for setting cut scores in criterion-referenced achievement tests: A comparative analysis of six recent methods with an application to tests of reading in EFL. CITO. https://www.ealta.eu.org/documents/resources/FK_second_doctorate.pdf
- Kane, M. T. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461. https://doi.org/10.2307/1170678
- Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Standard setting: Concepts, methods, and perspectives (pp. 53-88). Erlbaum.
- Kelly, D. L. (1999). Interpreting the Third International Mathematics and Science Study (TIMSS) achievement scales using scale anchoring [Unpublished doctoral dissertation]. Boston College.
- Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). Springer.
- LaRoche, S., Joncas, M., & Foy, P. (2020). Sample design in TIMSS 2019. In M. O. Martin, M. von Davier, & I. V. S. Mullis (Eds.), Methods and procedures: TIMSS 2019 technical report (pp. 3.1-3.33). https://timssandpirls.bc.edu/timss2019/methods/chapter-3.html
- Linacre, J. M. (2005). A user’s guide to Winsteps/Ministeps Rasch model programs. MESA Press.
- Linn, R. L., & Herman, J. L. (1997). A policymaker’s guide to standards-led assessment. The Education Commission of the States.
- Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 101-121). Springer.
- Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23. https://doi.org/10.3102/0013189X023002013
- Mullis, I. V. S., & Martin, M. O. (2015). PIRLS 2016 assessment framework (2nd ed.). TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement.
- Mullis, I. V. S., & Prendergast, C. O. (2017). Using scale anchoring to interpret the PIRLS and ePIRLS 2016 achievement scales. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in PIRLS 2016 (pp. 13.1-13.23). Boston College, TIMSS & PIRLS International Study Center.
- Mullis, I. V. S., Cotter, K. E., Centurino, V. A. S., Fishbein, B. G., & Liu, J. (2016). Using scale anchoring to interpret the TIMSS 2015 achievement scales. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 14.1-14.47). Boston College, TIMSS & PIRLS International Study Center.
- Nassif, P. M. (1978, March 27-April 1). Standard setting for criterion referenced teacher licensing tests [Paper presentation]. The annual meeting of the National Council on Measurement in Education, Toronto, Canada.
- Organisation for Economic Co-operation and Development. (2019). PISA 2018 assessment and analytical framework. OECD Publishing. https://doi.org/10.1787/b25efab8-en
- Organisation for Economic Co-operation and Development. (2020). PISA 2018 technical report. OECD Publishing. https://www.oecd.org/pisa/data/pisa2018technicalreport/
- Pant, H. A., Rupp, A. A., Tiffin-Richards, S. P., & Köller, O. (2009). Validity issues in standard-setting studies. Studies in Educational Evaluation, 35(2-3), 95-101. https://doi.org/10. 1016/j.stueduc.2009.10.008
- Plake, B. S., & Cizek, G. J. (2012). The modified Angoff, extended Angoff, and yes/no standard setting methods. In G. J. Cizek (Ed.), Setting performance standards. Foundations, methods, and innovations (pp. 181-253). Routledge.
- R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
- Robitzsch, A., Kiefer, T., & Wu, M. (2020). TAM: Test analysis modules. R package version 3.5-19. https://cran.r-project.org/web/packages/TAM/index.html
- Schafer, W. D. (2006). Growth scales as an alternative to vertical scales. Practical Assessment Research & Evaluation, 11(4). https://doi.org/10.7275/xjkz-7n67
- Sireci, S. G., Hauger, J. B., Wells, C. S., Shea, C., & Zenisky, A. L. (2009). Evaluation of the standard setting on the 2005 grade 12 national assessment of educational progress mathematics test. Applied Measurement in Education, 22(4), 339-358. https://doi.org/10.1080/08957340903221659
- Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427-450. https://doi.org/10.1007/BF02294627
- Wixson, K. K., Valencia, S. W., Murphy, S., & Phillips, G. W. (2013). A study of NAEP reading and writing frameworks and assessments in relation to the common core state standards in English language art (ED545239). ERIC. https://files.eric.ed.gov/fulltext/ED545239.pdf
- Wyse, A. E. (2017). Five methods for estimating angoff cut scores with IRT. Educational Measurement: Issues and Practice, 36(4), 16-27. https://doi.org/10.1111/emip.12161
- Wyse, A. E. (2018). Equating Angoff standard-setting ratings with the Rasch model. Measurement: Interdisciplinary Research and Perspectives, 16(3), 181-194. https://doi.org/10.1080/15366367.2018.1483170