Event

Doctoral Defence: Xueqi DANG

The Doctoral School in Science and Engineering is happy to invite you to Xueqi DANG’s defence entitled

Learning-based Test Input Prioritization for Machine Learning Systems

Supervisor: Prof Yves LE TRAON

Machine learning (ML) has achieved significant success across various fields. Ensuring the reliability of ML systems through testing is essential. However, ML testing faces a major challenge: it is expensive to label each test input to assess the model’s accuracy on the testing set. This is mainly due to three reasons: 1) reliance on manual labeling, 2) the large scale of test datasets, and 3) the need for domain expertise during the labeling process. Test input prioritization has become a promising strategy to mitigate the labeling cost issues, which focuses on prioritizing test inputs that are more likely to be misclassified. Enabling the earlier labelling of such bug-revealing inputs can accelerate the debugging process, therefore enhancing the efficiency of ML testing. In the existing literature, various test prioritization methods have been introduced, which can generally be classified into coverage-based, confidence-based, and mutation-based approaches. While these methods have demonstrated effectiveness in certain scenarios, they exhibit notable limitations when applied to more specialized contexts. This dissertation focuses on three specific scenarios: classical machine learning classification, long text classification, and graph neural network (GNN) classification. Specifically, for each scenario, we introduce a novel test prioritization method, as detailed in Chapters 3 to Chapter 5. Beyond proposing these new methods, we also conduct an empirical study focusing on GNN classification to explore the limitations of existing test selection approaches when applied to GNNs (cf. Chapter 6).

– MLPrior: A New Test Prioritization Approach for Classical Machine Learning Models. To tackle the challenges in traditional ML testing, we propose a novel test prioritization method named MLPrior. MLPrior is specifically designed to leverage the unique characteristics of classical ML models (i.e., compared to DNNs, traditional ML models are generally more interpretable, and their datasets often consist of carefully engineered feature attributes) for effective test prioritization. MLPrior is built on two key principles: 1) tests that are more sensitive to mutations are more likely to be misclassified, and 2) tests that are closer to the model’s decision boundary are more likely to be misclassified. Experimental results reveal that MLPrior surpasses other prioritization methods, achieving an average improvement ranging from 14.74% to 67.73%.

– GraphPrior: A New Test Prioritization Approach for Graph Neural Networks. To enhance the efficiency of GNN testing, we propose GraphPrior, a novel test prioritization method specifically designed for GNNs. In particular, we introduce new mutation rules tailored to GNNs to generate mutated models and leverage the mutation results for effective test prioritization. The core principle is that test inputs that “kill” more mutated models are considered more likely to be misclassified. Experimental results demonstrate that GraphPrior outperforms all baseline methods, achieving an average performance improvement of 4.76% to 49.60% on natural datasets.

– LongTest: A New Test Prioritization Approach for Long Text Files. Long texts, such as legal documents and scientific papers, present unique challenges for test prioritization due to their substantial length, complex hierarchical structures, and diverse semantic content. To address these issues, we propose LongTest, a novel approach specifically tailored for long text data. LongTest is built based on two key components: 1) a specialized embedding generation mechanism designed to extract crucial information from entire long documents, and 2) a contrastive learning framework that enhances prioritization by effectively distinguishing misclassified samples from correctly classified ones. Experimental evaluations demonstrate that LongTest outperforms baseline methods, with average improvements ranging from 14.28% to 70.86%.

– An Empirical Study investigating the limitations of test selection approaches on GNNs. To investigate the limitations of existing DNN-oriented test selection methods in the context of GNNs, we carried out an empirical study involving 22 test selection techniques evaluated across seven graph datasets and eight GNN models. This study concentrated on three key objectives: 1) Misclassification Detection: identifying test inputs with a higher probability of being misclassified; 2) Accuracy Estimation: selecting a representative subset of tests to accurately estimate the overall accuracy of the full test set; 3) Performance Improvement: selecting retraining samples to enhance the accuracy of GNN models. Our findings indicate that the effectiveness of test selection methods for GNNs falls short when compared to their performance in the context of DNNs. In summary, this dissertation introduces three novel test prioritization methods designed for some specific machine learning scenarios and presents an empirical study to explore the limitations of existing test selection approaches when applied to GNNs.