List of Figures
List of Tables
List of Abbreviations
Chapter I Introduction
1.1 Context of the research
1.2 Research questions
1.2.1 Research question 1 (Study one)
1.2.2 Research question 2 (Study two)
1.2.3 Research question 3 (Study three)
1.3 Research design overview
1.3.1 Study one: traditional scoring
1.3.2 Study two: confidence scoring
1.3.3 Study three: traditional scoring and confidence scoring
1.4 Potential contribution
Chapter 2 Review of the Literature
2.1 Early development: linking scores to expert experience
2.1.1 Expert experience: the "native speaker" benchmark
2.1.2 Practice perspective: (I)ELTS (1986 & 1989)
2.2 Major contribution: linking scores to rater perception
2.2.1 Teacher/Rater interpretation: "scaling descriptors
2.2.2 Rater judgment: "binary comparisons
2.2.3 Practice perspective: IELTS revision (1998-2001)
2.3 Work in progress: linking scores to candidate performance
2.3.1 Identifying features from rater perception
2.3.2 Identifying features from documents/rating scales
2.3.3 Practice perspective: TOEFL iBT and IELTS (operational)
2.4 The L2 Chinese context and identifying L2 Chinese features
2.4.1 Pronunciation
2.4.2 FluentT
2.4.3 Vocabulary
2.4.4 Grammar
2.5 Traditional Scoring and problems of"indisfinction" and "overlap
2.6 Summary
Chapter 3 Study One: Traditional Scoring
3.1 Introduction
3.1.1 Traditional scoring
3.1.2 Research question
3.2 Method
3.2.1 Instrument: an L2 Chinese speaking test
3.2.2 Participants
3.2.3 Coding
3.2.4 Statistical analysis
3.3 Results
3.3.1 Correlations
3.3.2 Standard multiple regression
3.4 Discussion
3.5 Summary
Chapter 4 Study Two: Confidence Scoring
4.1 Introduction
4.1.1 Confidence scoring 4.1.2 Research question
4.2 Confidence scoring design
4.2.1 Raw confidence scores of adjacent levels
4.2.2 Raw confidence scores from different scales
4.2.3 Raw confidence scores to a confidence score
4.2.4 Score interpretation and use
4.3 Pilot study
4.3.1 Candidates and instruments
4.3.2 Coding system
4.3.3 Confidence scores and traditional scores
4.4 Discussion
4.5 Summary
Chapter 5 Study Three: Traditional Scoring and Confidence
Scoring
5.1 Introduction
5.1.1 Mixed methods: the convergent parallel design
5.1.2 Research question
5.2 Method
5.2.1 Quantitative score dataI
5.2.2 Qualitative interview data
5.3 Analysis
5.3.1 Quantitative data analysis
5.3.2 Qualitative data analysisII
5.4 Results and findings
5.4.1 Quantitative results
5.4.2 Qualitative findings
5.5 Discussion
5.6 Summary
Chapter 6 General Discussion and Conclusion
6.1 Study one: traditional scoring
6.1.1 Consmacfing rating scales based on candidate performance
6.1.2 Establishing a potential alignment of L2 speaking tests
6.2 Study two: confidence scoring
6.2.1 Applying confidence scoring in other educational contexts
6.2.2 Developing computation package for confidence scoring
6.3 Study three: traditional scoring and confidence scoring
6.4 Limitations
6.5 Conclusion
6.6 Future agendas: where are we heading
6.6.1 Investigating more features representing the construct
6.6.2 Applying confidence scoring to different contexts
6.6.3 Combining automated scoring and raters' scoring
References
Appendices
Appendix 1 Holistic rating scale for traditional scoring
Appendix 2 The L2 Chinese speaking test
Appendix 3 Histograms
Appendix 4 Scatterplots
Appendix 5 Correlation matrix
Appendix 6 Histograms and scatterplots for the residuals (Study one)Appendix 7 Center of gravity (COG) computation details
Appendix 8 Rating scales (used in Study two and Study three)
Appendix 9 Histograms and scatterplots for the residuals (Study three)
Appendix 10 Instructions for using the computation package for
confidence scoring