Developing and Validating a Constructed-Response Assessment of Scientific Abilities: A Case of the Optics Unit
Author: Hsiao-Hui Lin (Graduate Institute of Science Education, National Taiwan Normal University), Sieh-Hwa Lin (Department of Educational Psychology & Counseling, National Taiwan Normal University), Hsin-Kai Wu (Graduate Institute of Science Education, National Taiwan Normal University)
Vol.&No.:Vol. 63, No.1
Date:March 2018
Pages:173-205
DOI:10.6209/JORIES.2018.63(1).06
Abstract:
This study aimed to develop and validate a constructed-response assessment of scientific abilities and an accompanying rubric. The assessment included 32 open-ended test items that were categorized into four subscales—Remembering and understanding scientific knowledge, application and analysis of scientific procedures, argumentation and expression of scientific logic, and evaluation and innovation during problem solving. The analysis revealed the following results: First, the Cronbach’s α values were higher than .90, indicating high intrarater consistency. Second, Kendall’s coefficient of concordance was higher than .90 and its p value was less than .001, denoting a consistent scoring pattern between raters. In addition, many-facet Rasch measurement (MFRM) analysis revealed no significant difference in rater severity, whereas a comparison of the rating scale model (RSM) and partial credit model (PCM) indicated that each rater had a unique rating scale structure. The infit and outfit mean squares of the MFRM were 1 ± 0.5, which suggested that both severe and lenient raters could effectively distinguish high and low-ability students. The deviance values estimated by the RSM and PCM were converted to Bayesian information criterion values, and the RSM was viewed to fit the empirical data appropriately compared with the PCM. Therefore, the severity thresholds of the raters were the same. Third, Cronbach’s α coefficients of the four subassessments and the full assessment were higher than .85, indicating that the constructed-response assessment of scientific abilities (CRASA) provided a high internal-consistency reliability. Finally, confirmatory factor analysis revealed acceptable goodness-of-fit for the CRASA. These results suggested that the CRASA is a useful tool for accurately measuring scientific abilities.
Keywords:confirmatory factor analysis, constructed-response assessment, many-facet Rasch measurement, rater consistency
《Full Text》
References:
- 李茂能(2006)。結構方程模式軟體Amos之簡介及其在測驗編製上之應用:Graphics & Basic。臺北市:心理。【Li, M.-N. (2006). An introduction to Amos and its uses in scale development: Graphics & Basic. Taipei, Taiwan: Psychological.】
- 林小慧、曾玉村(2017)。科學多重文本閱讀理解評量及規準之建構與信效度分析—以氣候變遷與三峽大壩之間的關係題本為例。教育心理學報,49(2),215-241。doi:10.6251/BEP.2017-49(2).0003 【Lin, H.-H., & Tzeng, Y.-T. (2017). Developing and validating a scientific multi-text reading comprehension assessment: Evidence from texts describing relationships between climate changes and the Three Gorges Dam. Bulletin of Educational Psychology, 49(2), 215-241. doi:10.6251/BEP.2017-49(2).0003】
- 林世華、盧雪梅、陳學志(2004)。國民中小學九年一貫課程學習成就評量指標與方法手冊。臺北市:教育部。【Lin, S.-H., Lu, S.-M., & Chen, H.-C. (2004). The learning achievement assessment indicators and methods manual of grade 1-9 curriculum. Taipei, Taiwan: Ministry of Education.】
- 張郁雯、林文瑛、王震武(2013)。科學表現的兩性差異縮小了嗎?-國際科學表現評量資料之探究。教育心理學報,44(s),459-476。doi:10.6251/BEP.20111028 【Chang, Y.-W., Lin, W.-Y., & Wang, J.-W. (2013). Is gender gap in science performance closer? Investigating data from international science study. Bulletin of Educational Psychology, 44(s), 459-476. doi:10.6251/BEP.20111028】
- Anderson, L. W. (1999). Rethinking bloom’s taxonomy: Implications for testing and assessment. Retrieved from ERIC database. (ED435630)
» More
- 李茂能(2006)。結構方程模式軟體Amos之簡介及其在測驗編製上之應用:Graphics & Basic。臺北市:心理。【Li, M.-N. (2006). An introduction to Amos and its uses in scale development: Graphics & Basic. Taipei, Taiwan: Psychological.】
- 林小慧、曾玉村(2017)。科學多重文本閱讀理解評量及規準之建構與信效度分析—以氣候變遷與三峽大壩之間的關係題本為例。教育心理學報,49(2),215-241。doi:10.6251/BEP.2017-49(2).0003 【Lin, H.-H., & Tzeng, Y.-T. (2017). Developing and validating a scientific multi-text reading comprehension assessment: Evidence from texts describing relationships between climate changes and the Three Gorges Dam. Bulletin of Educational Psychology, 49(2), 215-241. doi:10.6251/BEP.2017-49(2).0003】
- 林世華、盧雪梅、陳學志(2004)。國民中小學九年一貫課程學習成就評量指標與方法手冊。臺北市:教育部。【Lin, S.-H., Lu, S.-M., & Chen, H.-C. (2004). The learning achievement assessment indicators and methods manual of grade 1-9 curriculum. Taipei, Taiwan: Ministry of Education.】
- 張郁雯、林文瑛、王震武(2013)。科學表現的兩性差異縮小了嗎?-國際科學表現評量資料之探究。教育心理學報,44(s),459-476。doi:10.6251/BEP.20111028 【Chang, Y.-W., Lin, W.-Y., & Wang, J.-W. (2013). Is gender gap in science performance closer? Investigating data from international science study. Bulletin of Educational Psychology, 44(s), 459-476. doi:10.6251/BEP.20111028】
- Anderson, L. W. (1999). Rethinking bloom’s taxonomy: Implications for testing and assessment. Retrieved from ERIC database. (ED435630)
- Bennett, R. E., & Ward, W. C. (1993). Construction versus choice in cognitive measurement: Issues in constructed response, performance testing, and portfolio assessment. Hillsdale, NJ: Lawrence Erlbaum Associates.
- Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of educational goals. New York, NY: Longmans, Green.
- Carter, P. L., Ogle, P. K., & Royer, L. B. (1993). Learning logs: What are they and how do we use them? In N. L. Webb & A. F. Coxford (Eds.), Assessment in the mathematics classroom (pp. 87-96). Reston, VA: National Council of Teachers of Mathematics.
- Cizek, G. J. (2006). Standard setting. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 225-258). Mahwah, NJ: Lawrence Erlbaum Associates.
- Cohen, J. (2013). Statistical power analysis for the behavioral sciences (2nd ed.). Hoboken, NJ: Taylor and Francis.
- Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for languages: Learning, teaching, assessment (Section H). Strasbourg, France: Council of Europe/Language Policy Division.
- Foltz, P. W., Laham, D., & Landauer, T. K. (1999, June). Automated essay scoring: Applications to educational technology. Paper presented at the World Conference on Educational Multimedia, Hypermedia and Telecommunications, Seattle, WA.
- Foster, G. (1984, March). Technical writing and science writing. Is there a difference and what does it matter? Paper presented at the annual meeting of the Conference on College Composition and Communication, New York, NY.
- Gronlund, N. E. (1985). Measurement and evaluation in teaching (5th ed.). New York, NY: Macmillan.
- Huang, Y.-C. (1999). A study of reformulation relations in scientific reports (Unpublished master’s thesis). National Tsing Hua University, Hsinchu, Taiwan.
- Kirk, R. E. (1995). Experimental design: Procedures for the behavioral sciences (3rd ed.). Pacific Grove, CA: Brooks/Cole.
- Kuo, C.-Y., Wu, H.-K., Jen, T.-H., & Hsu, Y.-S. (2015). Development and validation of a multimedia-based assessment of scientific inquiry abilities. International Journal of Science Education, 37(14), 2326-2357. doi:10.1080/09500693.2015.1078521
- Landy, F. J., & Farr, J. L. (1983). The measurement of work performance: Methods, theory, and applications. New York, NY: Academic Press.
- Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
- Lynn, R., & Mikk, J. (2009). Sex differences in reading achievement. Trames, 13(63/58), 3-13. doi:10.3176/tr.2009.1.01
- Miller, R. G., & Calfee, R. C. (2004). Building a better reading-writing assessment: Bridging cognitive theory, instruction, and assessment. English Leadership Quarterly, 26(3), 6-13.
- Park, T. (2004). An investigation of an ESL placement test of writing using many-facet Rasch measurement. Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics, 4(1), 1-21.
- Roid, G. H. (1994). Patterns of writing skills derived from cluster analysis of direct-writing assessments. Applied Measurement in Education, 7(2), 159-170. doi:10.1207/s15324818ame0702_4
- Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464. doi:10.1214/aos/1176344136
- Stepanek, J. S., & Jarrett, D. (1997). Assessment strategies to inform science and mathematics instruction: It’s just good teaching. Portland, OR: Northwest Regional Educational Laboratory.
- Temiz, B. K., Taşar, M. F., & Tan, M. (2006). Development and validation of a multiple format test of science process skills. International Education Journal, 7(7), 1007-1027.
- Toranj, S., & Ansari, D. N. (2012). Automated versus human essay scoring: A comparative study. Theory & Practice in Language Studies, 2(4), 719-725. doi:10.4304/tpls.2.4.719-725
- Torkzadeh, G., Koufteros, X., & Pflughoeft, K. (2003). Confirmatory analysis of computer self-efficacy. Structural Equation Modeling: A Multidisciplinary Journal, 10(2), 263-275. doi:10.1207/S15328007SEM1002_6
- Valenti, S., Cucchiarelli, A., & Panti, M. (2002). Computer based assessment systems evaluation via the ISO9126 quality model. Journal of Information Technology Education: Research, 1(3), 157-175. doi:10.28945/353
- Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education: Research, 2(s), 319-330. doi:10.28945/331
- Witkin, S. L. (2000). Writing social work. Social Work, 45(5), 389-394. doi:10.1093/sw/45.5.389
- Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.