期刊目錄列表 - 68卷(2023) - 【教育科學研究期刊】68(2)六月刊

「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究 作者:國家教育研究院測驗及評量研究中心謝進昌

卷期:68卷第2期
日期:2023年6月
頁碼:1-35
DOI:https://doi.org/10.6209/JORIES.202306_68(2).0001

摘要:
為評估《十二年國民基本教育課程綱要》推動對於學生表現之影響,遂推動臺灣學生成就長期追蹤評量計畫(TASAL),目的在追蹤臺灣學生素養成長表現、探究影響因子與回饋國家課程綱要。究其內涵為標準本位大型教育評量,而本研究目的在以標準發展的整體歷程觀點,提出「混合專家與學生實徵表現導向模式」,以多面向、多途徑(來源)累積支持過程、內部、外部與後效(間接預估)影響等證據,以回應第四學習階段英語文標準設定結果的有效性。在透過標準化流程發展評量工具時,研究者於過程中逐步融入標準設定各項任務元素,以建構理論、發展表現水準描述、標準設定素材與試題、開發相關大型評量技術等。研究者經蒐集來自15名專家成員對於評估問卷填答,以檢核標準設定過程、結果合理性,而結果顯示成員多能認同其適切性。此外,本研究透過回饋訊息提供、成員間討論與反思,也發現成員對於試題判讀,會隨著輪次增加而愈趨一致,分類誤差多能在合理區間內。加諸以七年級學生升至八年級之英語文理解表現作為外部效標,結果顯示所設定的切截分數是具有良好區別不同層級學生於外部效標表現差異之程度。整體而言,本研究在標準形成(或標準設定)階段,大致能獲得良好過程、內部、外部與後效(間接預估)影響等證據支持。文末,本研究並提出建議,供未來參考。

關鍵詞:英語文理解、混合專家與學生實徵表現導向效度評估、臺灣學生成就長期追蹤評量、標準本位大型評量、標準設定

《詳全文》 檔名

參考文獻:
  1. 任宗浩(2018)。十二年國民基本教育實施成效評估—臺灣學生成就長期追蹤評量計畫(第一期)(總計畫)(NAER-107-12-B-1-01-00-1-02)。國家教育研究院。【Jen, T.-H. (2018). Effectiveness of 12-year basic education program: A longitudinal study on Taiwan Assessment of Student Achievement (TASA-L) (I) (NAER-107-12-B-1-01-00-1-02). National Academy for Educational Research.】
  2. 吳正新(2019)。長期追蹤調查抽樣技術與權重校正(NAER-2019-113-A-1-1-E1-03)。國家教育研究院。【Wu, J.-S. (2019). Sampling design and weighting adjustment of large-scale surveys (NAER-2019-113-A-1-1- E1-03). National Academy for Educational Research.】
  3. 侯佩君、杜素豪、廖培珊、洪永泰、章英華(2008)。台灣鄉鎮市區類型之研究:「台灣社會變遷基本調查」第五期計畫之抽樣分層效果分析。調查研究─方法與應用,23,7-32。https://doi.org/10.7014/TCYCFFYYY.200804.0007 【Hou, P.-C., Tu, S.-H., Liao, P.-S., Hung, Y.-T., & Chang, Y.-H. (2008). The typology of townships in Taiwan: The analysis of sampling stratification of the 2005-2006 Taiwan social change survey. Survey Research: Method and Application, 23, 7-32. https://doi.org/10.7014/TCYCFFYYY.200804.0007】
  4. 國家教育研究院(2018)。十二年國民基本教育課程綱要:國民中小學暨普通型高級中等學校:語文領域─英語文作者。【National Academy for Educational Research. (2018). 12-year basic education curriculum for elementary and high school: English. Author.】
  5. 國家教育研究院(無日期)。首頁。臺灣學生成就長期追蹤評量計畫。2022年3月30日,https://tasal.naer.edu.tw/【National Academy for Educational Research. (n.d.). Homepage. Taiwan Assessment of Student Achievement: Longitudinal Study. Retrieved March 30, 2022, from https://tasal.naer.edu.tw/】
» 展開更多
中文APA引文格式謝進昌(2023)。「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究。教育科學研究期刊,68(2),1-35。https://doi.org/10.6209/JORIES.202306_68(2).0001
APA FormatHsieh, J.-C. (2023). Assessing the validity of standard-setting for an English language assessment with a hybrid expert and empirical performance model. Journal of Research in Education Sciences, 68(2), 1-35. https://doi.org/10.6209/JORIES.202306_68(2).0001

Journal directory listing - Volume 68 (2023) - Journal of Research in Education Sciences【68(2)】June

Assessing the Validity of Standard-Setting for an English Language Assessment With a Hybrid Expert and Empirical Performance Model Author: Jin-Chang Hsieh (Research Center for Testing and Assessment, National Academy for Educational Research)

Vol.&No.:Vol. 68, No. 2
Date:June 2023
Pages:1-35
DOI:https://doi.org/10.6209/JORIES.202306_68(2).0001

Abstract:
Background and Purpose
The Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL) was implemented to evaluate the effect of the new 12-year basic education curriculum on student performance in Taiwan. TASAL is a standards-based, large-scale assessment that aims to track the literacy growth of Taiwanese students, explore relevant factors, and collect empirical evidence to assist in the development of future curriculum guidelines. This study assessed the validity of standard-setting with a hybrid model combining expert and student empirical performance.
The hybrid model exhibits multidimensional, multisource, and long-term cumulative features. The multidimensional feature provides evidence for procedural, internal, and external validity and for setting appropriate standards (Kane, 1994, 2001; Pant et al., 2009). The multisource feature indicates that the evidence of validity is derived from various sources, such as expert opinions and students’ empirical performance. Finally, the long-term cumulative feature represents the process of accumulating evidence over a long period. Presenting every type of evidence in a study is challenging due to time and resource constraints. The burden placed on researchers and students should be considered.
Method
1. Sampling
In 2019, the evaluation of seventh-grade students was initiated formally in TASAL. In 2020, the same group of students, now in the eighth grade, was evaluated in TASAL. The sampling method was stratified two-stage cluster sampling. Initially, 256 junior high schools were selected to take part in the evaluation. Finally, 246 schools with a total of 2,793 students were enrolled for this project. Regarding the English test of TASAL, in 2019, 2,793 seventh-grade students took the TASAL English test. In 2020, 2,893 eighth-grade students took the test. Among the eighth-grade students, 2,554 took the English test in both years.
2. Materials
The TASAL English core competence assessment was developed through a standardized procedure, including purpose clarification, theory construction, assessment guidelines, performance level descriptor development, test item designation, test assembly, and data analysis. The TASAL English core competence assessment examines English reading comprehension according to the corresponding content in the 12-year basic education curriculum. Based on the concept of transforming verb-noun usage into cognitive processes and content knowledge, as proposed by Anderson et al. (2001), a separate set of assessment criteria and test items has been developed for the TASAL English core competence assessment to evaluate reading comprehension.
In the TASAL English core competence assessment, six levels of performance descriptors was initially proposed (Hsieh, 2023). However, no corresponding test items were available for the sixth (highest) level of the assessment, because the standard-setting process still focused on the seventh-grade test items. Therefore, this study focused on the first five levels, which included acquiring linguistic fluency, locating explicitly stated information, literal comprehension, implicit comprehension, and evaluation and reflection beyond text comprehension. According to a review of the literature, various text types based on the OECD text types (2019) are used in the TASAL English core competence assessment, and these types are modified to include descriptive, introductive, transactional, expository, commentary, persuasive, narrative, and literary texts. The assessment for seventh-grade students contained 182 test items, and the assessment for eighth-grade students contained 196 test items; 84 common items were included in both assessments. The response consistency was good. The Expected A Posteriori (EAP) estimate of the items were 0.85 and 0.91 in the assessments for seventh-grade and eighth-grade students, respectively.
3. Standard-setting
This study employed the extended Angoff method (Hambleton & Plake, 1995) to establish assessment standards. A total of 15 experts from various regions in Taiwan were trained and participated in the standard-setting meeting. Among these experts, 10 were women and 5 were men, with an average teaching experience of 18.25 years.
The standard-setting meeting was implemented in three rounds, and student ability and cutoff scores were estimated by weighted likelihood estimation (Warm, 1989). Statistical analyses were performed in R (R Core Team, 2022) and TAM software packages (Robitzsch et al., 2020).
Result and Conclusion
Feedback was collected using a questionnaire on standard-setting. Most of the experts rated the process and outcome of the standard-setting meeting as being well above or above average. The experts agreed or strongly agreed that providing feedback and PLD procedures were helpful in establishing standards. In summary, this study provides satisfactory evidence for the procedural validity of standard-setting.
This study also provides evidence for the internal validity of standard-setting. During the initial round, the standard error of cutoff scores was between 2.03 and 11.58, as reported by all experts across all levels. However, during subsequent rounds, the margin of error decreased. In general, most standard errors (relative to the measurement error of 34.64) were within an acceptable level of 0.33, which is consistent with the results of Kaftandjieva (2010, p. 104).
Using the English comprehension performance of eighth-grade students as the external criteria, the use of the scores obtained from the seventh-grade assessment to set cutoff scores was effective for significantly distinguishing between different levels of achievement. A partial η2 of .506 was obtained, indicating a large effect size, as suggested by Cohen (1988). In conclusion, this study provides evidence for the external validity of standard-setting.
In summary, some valuable suggestions are provided based on the study results. For example, when evaluating changes in student performance, the regression toward the mean may be a crucial factor affecting the result of standard-setting during the implementation of vertical articulation of cutoff scores across grades. Additionally, continuously collecting evidence to support the validity of standard-setting is crucial in responding to educational policies and curriculum guidelines. Therefore, the study results indicate the importance of building ongoing proof of validity in future research.

Keywords:English comprehension, hybrid of expert and student empirical performance models, Taiwan Assessment of Student Achievement: Longitudinal study, standard-based large-scale assessment, standard setting