Prediction of Postoperative Lung Function in Lung Cancer Patients Using Machine Learning Models
Article information
Abstract
Background
Surgical resection is the standard treatment for early-stage lung cancer. Since postoperative lung function is related to mortality, predicted postoperative lung function is used to determine the treatment modality. The aim of this study was to evaluate the predictive performance of linear regression and machine learning models.
Methods
We extracted data from the Clinical Data Warehouse and developed three sets: set I, the linear regression model; set II, machine learning models omitting the missing data: and set III, machine learning models imputing the missing data. Six machine learning models, the least absolute shrinkage and selection operator (LASSO), Ridge regression, ElasticNet, Random Forest, eXtreme gradient boosting (XGBoost), and the light gradient boosting machine (LightGBM) were implemented. The forced expiratory volume in 1 second measured 6 months after surgery was defined as the outcome. Five-fold cross-validation was performed for hyperparameter tuning of the machine learning models. The dataset was split into training and test datasets at a 70:30 ratio. Implementation was done after dataset splitting in set III. Predictive performance was evaluated by R2 and mean squared error (MSE) in the three sets.
Results
A total of 1,487 patients were included in sets I and III and 896 patients were included in set II. In set I, the R2 value was 0.27 and in set II, LightGBM was the best model with the highest R2 value of 0.5 and the lowest MSE of 154.95. In set III, LightGBM was the best model with the highest R2 value of 0.56 and the lowest MSE of 174.07.
Conclusion
The LightGBM model showed the best performance in predicting postoperative lung function.
Introduction
Lung cancer is the leading cause of cancer-related deaths worldwide and chronic obstructive pulmonary disease (COPD) is the most common comorbid disease in patients with lung cancer [1-3]. Since lung cancer risk increases with age, the incidence rates of lung cancer are higher in elderly people. The lung parenchymal structure changes with age, resulting in the loss of elastic recoil and senile hyperinflation [4]. These previous studies showed that patients with lung cancer tended to have decreased lung function.
Surgical resection is the standard treatment for early-stage lung cancer. Patients with higher perioperative risks such as COPD and older age are at higher risk for postoperative complications and mortality after resection [5]. The assessment of perioperative risk is essential because surgery is an invasive treatment, and it affects postoperative lung function [6,7]. Since postoperative lung function is related to the quality of life and mortality, treatment modalities are determined according to the predicted postoperative lung function. The predicted postoperative forced expiratory volume in 1 second (ppoFEV1) is widely used as a parameter to represent postoperative lung function [8,9]. Several methods are used to compute ppoFEV1, such as a formula based on the number of resected segments [10], quantitative computed tomography (CT), and perfusion scintigraphy [9].
In recent clinical practice, minimally invasive surgical procedures such as video-assisted thoracic surgery have become routine [11,12]. A previous study showed that the actual postoperative lung function differed from the ppoFEV1 depending upon the extent of the resection [13]. Patients with COPD experienced smaller decreases in FEV1 than in the ppoFEV1 [14]. These studies showed a discrepancy between the actual postoperative lung function and the ppoFEV1. Previous attempts have been made to predict postoperative lung function more accurately, but these were limited by the wide diversity of individual characteristics of the patients. Therefore, by predicting postoperative lung function for each individual more precisely, personalized cancer treatment can be made available.
Recently, machine learning methods have begun to be used in many different clinical settings for predicting outcomes [15]. Previous studies showed that machine learning could outperform conventional statistical models, such as logistic regression [16,17]. The objective of this study was to compare the conventional statistical model of linear regression with the machine learning model to predict postoperative pulmonary lung function in patients with lung cancer.
Materials and Methods
1. Study design and database
We retrospectively investigated the Clinical Data Warehouse (CDW) database from the Department of Medical Informatics at College of Medicine, The Catholic University of Korea. A total of 4,342 patients with lung cancer who underwent surgery with mediastinal lymph node dissection at seven hospitals in the Catholic University of Korea (Seoul St. Mary’s Hospital, Incheon St. Mary’s Hospital, Yeouido St. Mary’s Hospital, Eunpyeong St. Mary’s Hospital, Bucheon St. Mary’s Hospital, St. Vincent’s Hospital, and Uijeongbu St. Mary’s Hospital) from 1997 to 2019 were extracted from the database. Of them, 2,855 patients were excluded due to missing preoperative FEV1 values or 6-month postoperative FEV1 values, as shown in Figure 1. Finally, 1,487 patients were selected for analysis. Since there were many missing values in the post-bronchodilator pulmonary function test (PFT) values, we used pre-bronchodilator PFT values; predicted FEV1 and forced vital capacity (FVC) (%) throughout the study.

Flow chart. Six machine learning models were implemented in sets II and III, least absolute shrinkage and selection operator (LASSO), Ridge regression, ElasticNet, Random Forest, eXtreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM). Missing data were omitted in set II and implemented in set III. The training dataset and test dataset were split at a 70:30 ratio. Implementation was done after data splitting in set III. FEV1: forced expiratory volume in 1 second.
2. Variable selection and outcome definition
Demographic data including age, gender, body mass index, smoking history, type of surgery, histologic features, tumor stage according to the eighth tumor-node-metastasis (TNM) classification, cancer location, blood test results, comorbidities, usage of COPD medications, treatment modalities (neoadjuvant chemotherapy, adjuvant chemotherapy, palliative chemotherapy, neoadjuvant radiotherapy, adjuvant radiotherapy, and palliative therapy) and pre-bronchodilator PFT results were collected. The continuous variables are presented as the mean with standard deviation and the categorical variables are expressed as numbers with percentages. Regarding smoking history, patients were grouped into never smokers if they had smoked fewer than 100 cigarettes or never smoked in their lifetime, and as ever smokers if they had smoked at least 100 cigarettes in their lifetime. The types of surgery were grouped into three groups, the sublobar resection group if patients had received sublobar resection (segmentectomy, wedge resection, etc.), the lobectomy group if patients had received lobectomy, and the others group. PFT was performed in accordance with the American Thoracic Society/European Respiratory Society standardization guidelines. To measure the effect of COPD medications and determine whether the point of time when the COPD medications began to affect the outcome, the date when the patient started the medicine was considered to be a variable. V0 was defined as the time interval from 3 months before surgery to the surgery date and V1 was defined as 6±3 months after surgery. For example, roflumilast V1 indicated that the patients had taken roflumilast 6±3 months after surgery. The baseline PFT results, including FEV1, FVC, the FEV1/FVC ratio, diffusing capacity of the lung for carbon monoxide, the residual volume (RV)/total lung capacity (TLC) ratio, and laboratory results including complete blood count, albumin, C-reactive protein, lactate dehydrogenase (LDH), and creatinine levels were measured at V0. Comorbidities were considered if they were diagnosed 5 years before and after surgery. Recurrence was defined based on radiological or histologic evidence of cancer 6 months after surgery. The details of the variables are shown in Table 1. To represent postoperative lung function, predicted FEV1 (%) measured at V1 was defined as an outcome. The performance of the linear regression model and machine learning models in predicting the outcome was evaluated.
3. Statistical analyses
We developed three different sets for the linear regression and machine learning models according to the method used to handle missing data. In set I, simple linear regression was performed to evaluate the individual effects of the variables on predicting postoperative lung function. All variables with a p-value of less than 0.07 in the simple linear regression analysis were included in the multiple linear regression analysis. The imputation of missing data and splitting were not implemented in set I. The dataset was split into training and test datasets at a 70:30 ratio for the machine learning model. In set II, patients who did not have data for all of the selected variables were excluded, meaning that a patient with only one variable missing was excluded. In set III, we implemented missing value imputation rules using only the training dataset. Imputation was implemented with linear regression analysis of the continuous variables and logistic regression analysis of the categorical variables using the simpleimputer function in the Python package Autoimpute. These imputation rules were applied to the test dataset. A total of 1,487 patients were in sets I and III and 896 patients were in set II. Since scaling the data showed poorer performance than using the raw data, raw data were used in all sets. The study flow diagram is presented in Figure 1.
In sets II and III, we implemented six machine learning models, least absolute shrinkage and selection operator (LASSO), ridge regression, ElasticNet, Random Forest, eXtreme gradient boosting (XGBoost), and the light gradient boosting machine (LightGBM) to predict 6-month postoperative lung function in the developed dataset. Five-fold cross-validation was performed for hyperparameter tuning of the machine learning models. After hyperparameter optimization, we used the following parameters in each model.
Mean squared error (MSE) and R2 were used in the test dataset to assess the predictive quality of the models. Data analyses were performed using R version 4.0.2 (The R Foundation, Vienna, Austria) and Python 3.7 (Python Software Foundation, Wilmington, DE, USA). The specific Python package used was Autoimpute.
4. Ethics approval
This study was reviewed and approved by the Institutional Review Board (IRB) of the Eunpyeong St. Mary’s Hospital, College of Medicine, The Catholic University of Korea (IRB approval number: XC20WIDI0027P). Written informed consent by the patients was waived due to the retrospective nature of our study.
Results
1. Overall patient characteristics
A total of 1,487 patients with primary lung cancer who received surgery with mediastinal lymph node dissection and did not relapse for up to 6 months after surgery were included. The patient characteristics are presented in Table 1.
2. Linear regression model
The results of the single linear regression model are provided in Table 2. Variables with a p-value of less than 0.07 were entered into the multiple linear regression model as presented in Table 3. The male gender was negatively correlated with FEV1 V1 compared to the female gender, and regarding the type of surgery, lobectomy and other methods were negatively correlated compared to sublobar resection. LDH levels and inhaler V0 were negatively correlated and FVC V0 was positively correlated.
3. Predictive performance
The scatter plots of actual postoperative FEV1 (%) and the predicted postoperative FEV1 (%) of prediction models in sets II and III are presented in Figures 2, 3. The residual box plots of prediction models in sets II and III are presented in Figures 4, 5. Residuals were calculated by the formula: residual=predicted postoperative FEV1 (%)–actual postoperative FEV1 (%). Linear regression coefficients were estimated by ordinary least squares regression. The range between the maximum residual and minimum residual and the box was narrowest in the LightGBM model in both sets. The predictive performance evaluated by MSE and R2 for each classifier in sets II and III was computed based on the test dataset and is described in Table 4. In set I, the R2 value was 0.27 with p<0.001. In set II, the best classifier for predicting 6-month postoperative lung function was LightGBM with an MSE of 154.95 and an R2 value of 0.5. In set III, the best classifier was LightGBM with an MSE of 174.07 and an R2 value of 0.56. Since LightGBM in set III (imputing missing data) had the highest explanatory power with the highest R2 value of 0.56 and the lowest MSE, it was the best model for predicting 6-month postoperative lung function.

Scatter plot of the actual postoperative forced expiratory volume in 1 second (FEV1, %) and predicted FEV1 (%) pairwise of the models in set II. (A) Ordinary least squares (OLS), (B) least absolute shrinkage and selection operator (LASSO), (C) Ridge regression, (D) ElasticNet, (E) Random Forest, (F) eXtreme gradient boosting (XGBoost), (G) light gradient boosting machine (LightGBM), and (H) predicted postoperative forced expiratory volume in 1 second (ppoFEV1)=preFEV1×[1–(number of segments×0.0526)].

Scatter plot of the actual postoperative forced expiratory volume in 1 second (FEV1) and predicted FEV1 pairwise of the models in set III. (A) Ordinary least squares (OLS), (B) least absolute shrinkage and selection operator (LASSO), (C) Ridge regression, (D) ElasticNet, (E) Random Forest, (F) eXtreme gradient boosting (XGBoost), (G) light gradient boosting machine (LightGBM), and (H) predicted postoperative forced expiratory volume in 1 second (ppoFEV1)=preFEV1×[1–(number of segments×0.0526)].

Residual boxplot of each prediction method in set II. Residual: predicted forced expiratory volume in 1 second (FEV1)–actual postoperative FEV1; OLS: ordinary least squares; LASSO: least absolute shrinkage and selection operator; XGBoost: eXtreme gradient boosting; LightGBM: light gradient boosting machine; ppoFEV1: predicted postoperative forced expiratory volume in 1 second: ppoFEV1=preFEV1×[1–(number of segments×0.0526)].

Residual boxplot of each prediction method in set III. Residual: predicted forced expiratory volume in 1 second (FEV1)–actual postoperative FEV1; OLS: ordinary least squares; LASSO: least absolute shrinkage and selection operator; XGBoost: eXtreme gradient boosting; LightGBM: light gradient boosting machine; ppoFEV1: predicted postoperative forced expiratory volume in 1 second: ppoFEV1=preFEV1×[1–(number of segments×0.0526)].
4. Importance scores
The importance scores (β coefficients) for the variables in Random Forest, XGBoost, and LightGBM are presented in Figure 6. In Figure 6A, C, E missing data were omitted and in Figure 6B, D, F missing data were imputed. The variable with the highest importance score was preFEV1 V0 in the Random Forest model and the LightGBM model. In the Random Forest model with set III, preFVC V0 was the variable with the second-highest importance score (0.163). In the XGBoost model, inhaler V0 had the largest absolute value of the coefficient in set II and adjuvant radiotherapy had the largest value in set III. In LightGBM, age was the second-highest variable and RV/TLC V0 was the third-highest variable.

Mean squared error (MSE) and R2 of the machine learning models. (A, B) Set II, III Random Forest, (C, D) set II, III eXtreme gradient boosting (XGBoost), (E, F) set II, III light gradient boosting machine (LightGBM). FEV1: forced expiratory volume in 1 second; V0: time interval from 3 months before surgery to the surgery date; FVC: forced vital capacity; LDH: lactate dehydrogenase; RV: residual volume; TLC: total capacity; CRP: C-reactive protein; DLCO: diffusing capacity of the lung for carbon monoxide; BMI: body mass index; CTx: chemotherapy; BD: bronchodilator; RTx: radiotherapy; WBC: white blood cell; ICS: inhaled corticosteroid; RML: right middle lobe; LUL: left upper lobe; HF: heart failure; CV: cardiovascular disease; SCLC: small cell lung cancer; TPL: theophylline; RUL: right upper lobe.
Discussion
In this study, we found that machine learning models outperformed the traditional linear model and the previous prediction method using the number of resected lung segments in predicting postoperative lung function. Moreover, in all machine learning models, imputing the missing data showed performance superior to omitting the missing data, as presented in Table 4. Overall, the LightGBM model had the highest explanatory power with an R2 value of 0.561 and the lowest MSE of 174.07. It outperformed the previous method of predicting postoperative lung function using the number of lung segments resected in set III which had R2 value of 0.368 and MSE of 250.06 as shown in Table 4. Moreover, as presented in Figure 5, the previous method had large negative residuals and a relatively longer whiskers than the LightGBM model. Since the linear regression model showed a relatively low R2 at only 0.27, the results of this study might suggest that machine learning models could improve predictive accuracy given the same data.
In multiple linear regression analysis, the variables with statistical significance were sex, type of surgery, LDH level, inhaler V0, and FVC V0, as presented in Table 3. In regard to the type of surgery, as the range of surgery types increased, the postoperative lung function decreased. Compared to sublobar resection, lobectomy had a parameter estimate of –6.44. This was consistent with previous studies that showed limited resection such as sublobar resection preserved postoperative lung function [18,19]. In another previous study, the ppoFEV1 was calculated using the formula: ppoFEV1=preFEV1×[1–(S×0.0526)]; where S is the number of segments resected [13]. This formula correlated well when patients underwent lobectomy. However, in patients who underwent pneumonectomy, the actual postoperative FEV1 was an average of 250 mL higher than ppoFEV1. This indicates that an adjustment factor is required to increase the accuracy of predicting postoperative lung function. To achieve this, we considered a variety of factors as independent variables to serve as an adjustment factor and pre-bronchodilator postoperative FEV1 V1 as a dependent variable.
In the machine learning models, the variables with the highest importance scores varied in each model. In the ridge regression model, preFVC V0 was the highest variable followed by preFEV1 V0 and age variables. In the LightGBM model, preFEV1V0 was the highest variable followed by age and RV/TLC V0. In the XGBoost model of set III, adjuvant radiotherapy was the highest variable followed by neoadjuvant chemotherapy. Since COPD and lung cancer are known to be linked diseases [20], COPD is highly prevalent in patients with lung cancer. COPD has a characteristic of hyperinflated lungs, resulting in an increase in the RV/TLC ratio. The all-cause mortality was higher in COPD patients who had higher RV/TLC values [21]. Because surgery reduces this hyperinflation, patients who underwent surgery exhibited better preservation of FEV1 [22]. In our previous study, lung cancer patients with COPD had higher RV/TLC values than lung cancer patients without COPD. Lung cancer patients with COPD had preserved postoperative lung functions compared to patients without COPD. Therefore, the RV/TLC value was positively correlated with postoperative lung function [23]. We found that the RV/TLC value was also influential in predicting postoperative lung function with machine learning models.
When handling the missing data, we compared the predictive performance of each model in two methods, one in which the missing data were omitted, and another in which the missing data with imputed by implementing linear regression analysis for the continuous variables and logistic regression for the categorical variables. As shown in Table 4, by implementing the missing data, the R2 value increased in all models. In data analysis, handling missing data is important and deleting missing values causes a massive loss of information, leading to a decrease in statistical power [24-26]. Therefore, the imputation of missing data should be implemented when handling big data.
This study had several limitations. The CDW had a relatively large number of incomplete data and this might lead to selection bias. Patients who did not have postoperative FEV1 data were excluded. Physicians usually measure FEV1 when patients experience symptoms such as shortness of breath. Therefore, patients without postoperative FEV1 data might have had relatively preserved FEV1. Imaging studies like CT are also known as factors predicting postoperative lung function but these data were not available. The inclusion of laboratory data and other variables might have been irrelevant, and these data might have impaired the explanatory power. Future studies are required to enhance the explanatory power by increasing the sample size, including appropriate variables, and excluding irrelevant variables.
In predicting postoperative lung function for lung cancer patients, machine learning models performed modestly better than the linear regression model. The highest explanatory power was achieved by the LightGBM model by imputing the missing data. The preoperative FEV1 and RV/TLC values had a large impact on postoperative lung function. These findings suggest that machine learning models can be used as a predictive tool. Future studies are needed to improve the predictive performance.
Notes
Authors’ Contributions
Conceptualization: Yeo CD. Methodology: Kwon OB, Han S, Lee HY, Lee SH. Formal analysis: Kim SK, Park CK, Kim JW. Data curation: Kang HS, Kim JS, Kim SJ. Software: Han S. Validation: Han J. Writing - original draft preparation: Kwon OB. Writing - review and editing: Han S, Yeo CD. Approval of final manuscript: all authors.
Conflicts of Interest
No potential conflict of interest relevant to this article was reported.
Funding
This study was supported by a 2020 grant from The Korean Academy of Tuberculosis and Respiratory Diseases.