PhD defence: Test Flakiness Prediction Techniques for Evolving Software Systems

You are all cordially invited to attend the PhD Defence of Mr. Guillaume HABEN on Thursday, 29th of June 2023, at 10 am. The defence will take place physically in room E01-A12 in CK building (Campus Kirchberg).

‌

Members of the defence committee:

Prof. Dr. Michail PAPADAKIS, Chair, University of Luxembourg, Luxembourg
Dr. Maxime CORDY, Vice-chair, University of Luxembourg, Luxembourg
Prof. Dr. Yves LE TRAON, Supervisor, University of Luxembourg, Luxembourg
Prof. Dr. Arie VAN DEURSEN, Member & reviewer, Delft University of Technology, The Netherlands
Prof. Dr. Javier TUYA, Member & reviewer, University of Oviedo, Spain

‌Abstract:

Software testing plays a crucial role to guarantee a desired level of software quality. Its goal is to ensure software products respect specified requirements, function as intended and are error-free. The scope of software testing is broad, from functional to non-functional requirements, and is generally performed at different levels (e.g. unit testing, integration testing, system testing, acceptance testing). During continuous integration, development activities are typically stopped when test failures happen and further investigations and debugging are required.

In an ideal world, all tests are deterministic: developers and testers expect the same outcome (pass or fail) for a test when executed twice on the same version of their program. Unfortunately, some tests exhibit non-deterministic behaviour. Commonly named flaky tests, they give confusing signals to developers who struggle to understand if their software is defective or not, and tend to lower their trust in test suites. Those occasional test failures can be difficult to reproduce and thus hard to debug. When test flakiness is left unaddressed, it can hinder the smooth and rapid integration of code changes. Furthermore, it also impacts many effective testing techniques such as test case selection, test case prioritisation, automated program repair or fault localisation. While the phenomenon is known by many for decades now, academic attention has only sprouted in recent years and few studies were carried out to better understand flakiness, its different causes and origins, and to propose techniques helping to prevent, detect and mitigate flakiness.

In this context, the present dissertation aims at advancing research in test flakiness prediction through five main contributions. The first two are explorative studies. They aim at getting a better understanding of test flakiness and existing prediction techniques. The next two contributions are constructive studies. They suggest new approaches and focus on yet unaddressed problems. Finally, the last contribution is a case study carried out in a real-world context and brings new highlights important to efficiently continue test flakiness prediction research.

By conducting a qualitative study, the first contribution seeks to understand practitioners’ perceptions of the sources, impact and mitigation strategies of flaky tests. The goal of this work is to grasp the current challenges revolving around flakiness in the industry and to identify opportunities for future research. We carried out this study by conducting a grey literature review and practitioner interviews. Findings revealed sources of flakiness that were until now overlooked by previous research (such as the infrastructure, environment or testing frameworks) and a strong negative impact on testing practices. The second contribution aims at comforting the usability of flaky test prediction techniques. Rerunning failing tests is still the main approach to deal with flakiness and this comes at a cost, both time-wise and computer-wise. If accurate, predicting flaky tests can be analternative to reruns and help better understand their characteristics. In this study, we replicate an existing approach relying on code vocabulary to predict flaky tests with three goals in mind: validating the approach in the continuous integration context, evaluating the generalisability of the approach to different programming languages and extending the approach by considering an additional set of features.

Realising that predicting flaky tests is feasible but also that challenges remain to understand the cause of flakiness, the third contribution presents a new technique to predict the flakiness category of a given flaky test with the hope to provide developers with better insights to debug their tests. In the fourth contribution, we aim at identifying the cause of flakiness in the critical case where its source originates from within the program under test. To do so, we adapt spectrum-based fault localisation and leverage ensemble learning to rank classes based on their likelihood to be responsible for test flakiness.

In the fifth and final contribution, we conduct an empirical analysis of Chromium’s continuous integration, where we found that flaky test signals should not be dis-carded as they reveal themselves useful to find faults caused by regressions. Thus, we advocate for the need to predict failures (as flaky or faulty) by taking into account the context of a test’s execution.

Overall, this thesis provides insights on how predictive models can be validated and leveraged to better handle test flakiness in real-world contexts.

Partager ce contenu