时 间：2023年10月31日 10:00-11:30
We consider a nonparametric regression setting that is common in modern studies, consisting of a response Y and some easy-to-obtain covariates X as well as a group of expensive covariates Z. To establish predictive models for Y, a natural question is whether it is worthwhile to include Z as predictors, given the added cost of collecting data on Z for both training the models and predicting Y for future individuals. Therefore, prior to embarking on large-scale data collection for model development, we wish to conduct preliminary investigations to infer the importance of Z in predicting Y in the presence of X.
To achieve this goal, we propose a nonparametric variable importance measure for Z, defined as a population parameter that aggregates maximum potential contributions of Z in multiple models for predicting Y. For this measure, we develop novel inferential approaches considering two-phase data that consist of a large number of observations for (Y, X) with Z being measured only in a relatively small subsample. Many available samples of (Y, X, Z) are of this structure due to study designs, considering the high cost of measuring Z. In such two-phase study settings, our approaches possess a surprising advantage: they draw valid and efficient inference for variable importance in a unified and seamless way regardless of whether Z makes zero or positive contribution to predicting Y. We refer to this particularly desirable property as the “blessing of data incompleteness” since it is unattainable with complete data. To achieve it, we overcome substantial challenges arising from the missing covariate issue in two-phase data. Numerical results using both simulated and real data demonstrate the superior performance of our approaches.
戴国榕，复旦大学管理学院统计与数据科学系讲师。他于2019年获Texas A&M 统计学博士学位，随后留校从事博士后研究工作，直至2021年加入复旦大学。戴国榕博士的研究兴趣包括高维统计、缺失数据、半参数理论、半监督推断，以及统计方法在生物医学中的应用。