Home About Help Datasets Team
Utility
Dataset
Attributes of spirometry
Prediction methodology
Performance

Utility

Dataset

The dataset contained spirometry investigation reports of 1314 patients from Institute of Pulmocare and Research (IPCR), Kolkata diagnosed with obstructive and non-obstructive diseases. The patients were divided in 2 groups - Group A and Group B consisting of 1163 and 151 patients respectively. The reports of the patients diagnosed with obstructive diseases were labelled as positive and those with non-obstructive diseases were labelled as negative. The reports in Group A were used for training and testing with cross validation (CV-dataset), and the reports in Group B were used as blind dataset. A summary of the dataset is given in Table - 1.


Table - 1: Summary of patient groups in the dataset.

Group A Group B
Used for training and testing with 5-fold cross validation Used as blind dataset for validation
Patient count 1163 151
Total number of spirometry reports 1172 154
Number of obstructive spirometry reports 1006 103
Number of non-obstructive spirometry reports 166 51

Attributes of spirometry

In spirometry, patients are asked to take a maximal inspiration and, then, expel the air forcefully as quickly as possible into a mouthpiece. The test is repeated following the administration of a bronchodilator. The pre and post bronchodilator values of the following three metrics were used as input:



For each of the above tests, there are 4 attributes. Thus, there are a total of 12 attributes.



Prediction methodology

Supervised machine learning models were developed for the classification task using Support Vector Machine (SVM), Random Forest (RF), Naive Bayes (NB) and Multi-layer Perceptron (MLP) algorithms. Different performance metrics, such as accuracy, sensitivity, specificity, F1-score, Matthews correlation coefficient (MCC) and area under receiver operator characteristic curve (AUROC) were computed and compared. The optimal model was chosen on the basis of the highest MCC value.


The training dataset used for cross validation was highly imbalanced where the positive to negative ratio (P:N) was 6:1. To handle this imbalance, an undersampling method was used in which the majority (positive) class samples were randomly divided into six disjoint (and, exhaustive) subsets. Then the minority (negative) class samples were concatenated with each positive class subset to obtain six undersampled datasets with P:N = 1:1. Six models were trained with each undersampled dataset and the performance metrics were averaged.


The tuning of hyperparameters was performed for each ML algorithm to improve the performance of the models using grid search technique, which is an exhaustive search using a parameter grid created by taking the cartesian product of pre-specified sets of values for each hyperparameter. Hyperparameter optimization was performed separately for both sets of models - one trained with the whole training set and another with the undersampled datasets. The optimal model wass saved and used in this prediction server.


Performance

Table - 2: Performance of models with 5-fold cross validation

DatasetModelAccuracySensitivitySpecificity F1-scoreMCC
Whole training dataset Support Vector Machine (SVM) 0.835 0.837 0.826 0.897 0.532
Random Forest (RF) 0.906 0.955 0.609 0.946 0.597
Naive Bayes (NB) 0.870 0.915 0.602 0.924 0.495
Multi-layer Perceptron (MLP) 0.918 0.966 0.626 0.953 0.645
Under-sampled datasets Support Vector Machine (SVM) 0.823 0.825 0.821 0.824 0.650
Random Forest (RF) 0.822 0.832 0.811 0.824 0.647
Naive Bayes (NB) 0.800 0.864 0.737 0.813 0.607
Multi-layer Perceptron (MLP) 0.837 0.853 0.822 0.841 0.682

The MLP model trained with the under-sampled datsets showed optimal performance with MCC of 0.68 and accuracy of 83.7% (Table - 2). This model is used in this prediction server. The hyperparameters chosen by the grid-search algorithm for this MLP model used two hidden layer architecture - 100 nodes in the first hidden layer followed by 100 nodes in the second. The input and output layers used 12 and 1 nodes respectively. An "adam" weight optimizer and a rectified linear unit (ReLU) activation function with constant learning rate of 0.001 was used. The ROC plot of the different models are given in Figure - 1. The performance of the models on blind dataset (Group - B) is given in Table - 3.


Figure - 1: Receiver Operator Characteristic (ROC) plot of different models. (σ-standard deviation)


Table - 3: Performance of models on predicting the validation dataset

Training DatasetModelAccuracySensitivity SpecificityF1-scoreMCC
Whole training dataset Support Vector Machine (SVM) 0.853 0.897 0.765 0.891 0.667
Random Forest (RF) 0.835 0.971 0.561 0.887 0.619
Naive Bayes (NB) 0.823 0.944 0.580 0.877 0.586
Multi-layer Perceptron (MLP) 0.857 0.986 0.596 0.902 0.677
Under-sampled datasets Support Vector Machine (SVM) 0.854 0.898 0.766 0.892 0.669
Random Forest (RF) 0.862 0.902 0.781 0.897 0.687
Naive Bayes (NB) 0.855 0.926 0.712 0.895 0.665
Multi-layer Perceptron (MLP) 0.849 0.886 0.774 0.887 0.663


Bhattacharjee S. et al., J Comput Sci (2022), 63:101768. doi: 10.1016/j.jocs.2022.101768. Please contact Dr. Sudipto Saha (ssaha4@jcbose.ac.in) regarding any further queries.