Supervisor: Prof. Dr Conchita D’AMBROSIO
Title: Essays on the Economics of Wellbeing and Machine Learning
The defense will be organised in presential mode on Campus BELVAL, rooms 2.150 and 2.210
Niccolo’s thesis abstract :
In Chapter 1, we apply Machine Learning (ML) methods to predict and interpret life satisfaction using data from the UK British Cohort Study. We discuss the application of first Penalized Linear Models and then of one non–linear method, Random Forests. We present two key model–agnostic interpretative tools for the latter method: Permutation Importance and Shapley Values. With a parsimonious set of explanatory variables, neither Penalized Linear Models nor Random Forests produce major improvements over the standard Non–penalized Linear Model. However, once we consider a richer set of controls, these methods do produce a non–negligible improvement in predictive accuracy. Although marital status and emotional health continue to be the most important predictors of life satisfaction, as in the existing literature, gender becomes insignificant in the non–linear analysis.
In Chapter 2, we further assess the potential of ML to help us better understand wellbeing. To do so, we analyze wellbeing data on over a million respondents from Germany, the UK, and the United States. In terms of predictive power, ML approaches do perform better than traditional models. Although the size of the improvement is small in absolute terms, it turns out to be substantial when compared to that of key variables like health. We moreover find that drastically expanding the set of explanatory variables doubles the predictive power of both OLS and the ML approaches on unseen data. The variables identified as important by our ML algorithms – i.e. material conditions, health, and meaningful social relations – are similar to those that have already been identified in the literature. In that sense, our data–driven ML results validate the findings from conventional approaches.
In Chapter 3, we predict and analyze the determinants of health. There is a change in the target compared to the previous two chapters: we now focus on objective health outcomes. In particular, ML methods are applied to predict health outcomes in the German Socio–Economic Panel, under two specifications: pooling data across multiple years, and applying the Mundlak transformation on the same pooled data. The dependent variable of interest is Number of doctor visits in the last three months. We discuss the application of ML Regression and Clustering techniques, and after presenting the different nature of the independent variables, and the rationale behind the choice of the considered ML algorithms, we present the findings, using accuracy scores suited to compare all models. The analysis of the distribution of the variables in the clusters created by the algorithm, along with novel model–agnostic interpretative tools (Shapley Values), allows us to better interpret the results. We find that ML algorithms – Random Forest in our case – lead to large improvements in predictive accuracy, especially in clusters. Self–rated measures of health, gender and disability status represent the most important drivers in healthcare utilization, in line with the existing literature.