Background: Machine Learning (ML) is a powerful tool for analyzing Real-World Data (RWD) where missing values are a common problem. Including features after multiple imputation (MI) regardless of their missing percentage improved the performance of ML models. Limited research exists on how different absolute non-missing thresholds for MI impact ML performance in RWD issues, like predicting overall survival (OS) in metastatic lung cancer (mLC).
Objectives: To evaluate the absolute non-missing numbers of features used in MI for ML in predicting OS for mLC.
Methods: ML algorithms were trained and validated for predicting 90-day mortality in a cohort of adults from first recorded diagnosis of mLC from the US nationwide IQVIA Oncology EMR in 2019-2020. Baseline features including demographics, vital signs, stage, TNM, histology, biomarkers, chemo, target- and immunotherapy, and functional lab tests were assessed. Implementation steps included: i) Full cohort with all 58 variables regardless of missingness degrees (Cfull), complete cohort with 12 non-missing variables (Ccomplete) and 7 additional analytic cohorts (C300, C200, C175, C150, C125, and C100) to keep features with respective non-missing observations ≥300, ≥200, ≥175, ≥150, ≥125, and ≥100 were created; ii) Data cleaning included removing extreme outliers, clustering of chemotherapies and histology; iii) MI methods included MissForest; iv) Each cohort was split into 70/30 for training and testing; v) ML included XGBoost; and iv) Performance metrics for ML models included AUC, accuracy, F1, sensitivity, and specificity.
Results: Among 3,155 mLC adult patients in the Cfull cohort, 32.5% were 75+ years old (median=70 and IQR=56-84 years) and 51.9% were male. Stages IIIB, IIIC and IV were characterized in 12.6%, 2.5%, and 84.9% of patients, respectively. 8.1% of patients died within the 90-day follow-up. The number of variables dropped from 58 in the full cohort to 44, 48, 49, 51, 53, 53and 12 for the C300, C200, C175, C150, C125, C100 and Ccomplete cohorts, respectively. There was equivalent 5.5% missing in C175 cohort. AUC from XGBoost decreased from 0.69 for the Cfull, C300, C200, and C175 to 0.63 for C150, C125, and further to 0.62 of C100 and Ccomplete cohorts, respectively. Similar AUC trends were found from different ML models and MI methods for the study cohorts.
Conclusions: MI approaches incorporating all features regardless of missing degrees demonstrated the best performance in predicting mortality for mLC patients based on AUC. This was consistent with prior findings, however including variables with non-missing values under certain threshold (e.g., < 175) did not improve the ML performance. Further research is needed to validate these findings across diverse diseases and databases.