Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Liver Patient Dataset Classification Using the Intel® Distribution for Python*

$
0
0

Abstract

This paper focuses on the implementation of the Indian Liver Patient Dataset classification using the Intel® Distribution for Python* on the Intel® Xeon® Scalable processor. Various preprocessing steps were used to analyze the effect of the machine learning classification problems. With the help of various features, the liver patient classification aims to predict whether or not a person has liver disease. Early determination of the disease without the use of manual effort could be a great support for people in the medical field. Good results were obtained by using SMOTE as the preprocessing method and the Random Forest algorithm as the classifier.

Introduction

The liver, which is the largest solid organ in the human body, performs several important functions. Its major functions include manufacturing essential proteins and blood clotting factors, metabolizing fat and carbohydrates, eliminating harmful waste products and detoxifying alcohol and certain drugs, and secreting bile to aid digestion and internal absorption. Disorders of the liver can affect the smooth functioning of these activities.

Excessive consumption of alcohol, viruses, the intake of contaminated food and drugs and so on are the major causes of liver diseases. The symptoms may or may not be visible in the early stages. If not attended to properly, liver diseases can lead to life-threatening conditions. It is always better to diagnose the disease in an early stage in order to help ensure a high rate of survival for the patient.

Classification is an effective technique used to handle this kind of problem in the medical field. Using the available feature values, the classifier could predict whether or not a person has liver disease. This ability will help doctors identify the disease in advance. It is always recommended to reduce Type I error (occurs due to the rejection of null hypothesis (as false) when it is actually true). Because it is better to identify a non-liver patient as a patient rather than not identifying a liver patient as a patient.

In this experiment, various preprocessing methods were tried prior to model building and training for comparison. Computational libraries like scikit-learn*, numpy, and scipy* from the Intel Distribution for Python on the Intel Xeon Scalable Processor were used for predictive model creation.

Environment Setup

Table 1 describes the environment setup that was used to conduct the experiment.

Table 1. Environment setup.

SetupVersion
ProcessorIntel® Xeon® Gold 6128 processor 3.40 GHz
SystemCentOS* (7.4.1708)
Core(s) per socket6
Anaconda* with Intel channel4.3.21
Intel® Distribution for Python*3.6.3
Scikit-learn*0.19.0
Numpy1.13.3
Pandas0.20.3

Dataset Description

The Indian Liver Patient dataset was collected from the northeast area of the Andhra Pradesh state in India. This is a binary classification problem with the class labeled as liver patient (represented as 1 in the dataset) and not-liver patient (represented as 2). There are 10 features, which are listed in table 2.

Table 2. Dataset description.

Attribute NameAttribute Description
V1Age of the patient. Any patient whose age exceeded 89 is listed as being age 90.
V2Gender of the patient
V3Total bilirubin
V4Direct bilirubin
V5Alkphos alkaline phosphatase
V6Sgpt alanine aminotransferase
V7Sgot aspartate aminotransferase
V8Total proteins
V9Albumin
V10A/G ratio albumin and globulin ratio
ClassLiver patient or not

Methodology

Methodology
Figure 1. Methodology.

Data Analysis

Before performing any processing on the available data, a data analysis is recommended. This process includes visualization of the data, identifying the outliers, and skewed predictors. These tasks help to inspect the data and thereby spot the missing values and irrelevant information in the dataset. A data cleanup process is performed to handle these issues and to ensure data quality. Gaining a better understanding of the dataset helps to identify useful information and supports decision making.

The Indian Liver Patient dataset consists of 583 records in which 416 are records of people with liver disease, and the remaining are records of people without any liver disease. The dataset has 10 features in which there is only one categorical data (V2-Gender of the patient). The endmost column of the dataset represent the class in which each sample falls (liver patient or not). A value of 1 indicates the person has liver disease and a 2 indicates the person does not have the disease. There is no missing value in the dataset.

Liver patient dataset
Figure 2. Visualization: liver patient dataset class.

Male and female population
Figure 3. Visualization: male and female population.

Figure 2 shows a visualization of the number of liver patients and non-liver patients in the dataset, whereas figure 3 represents a visualization of the male and female population in the dataset. Histograms of numerical variables are represented by figure 4.

Numerical variables
Figure 4. Visualization of numerical variables in the dataset.

Data Preprocessing

Some datasets contain irrelevant information, noise, missing values, and so on. These datasets should be handled properly to get a better result for the data mining process. Data preprocessing includes data cleaning, preparation, transformation, and dimensionality reduction, which convert the raw data into a form that is suitable for further processing.

The major objective of the experiment is to show the effect of various preprocessing methods on the dataset prior to classification. Different classification algorithms were applied to compare the results.

Some of the preprocessing includes:

  • Normalization: This process scales each feature into a given range. The preprocessing.MinMaxScaler() function in the sklearn package is used to perform this action.
  • Assigning quantiles ranges: The pandas.qcut function is used for quantile-based discretization. Based on the sample quantiles or rank, the variables are discretized and assigned some categorical values.
  • Oversampling: This technique handles the unbalanced dataset. Oversampling is used to generate new samples in the under-represented class. SMOTE is used for oversampling the data. SMOTE proposes several variants by identifying specific samples. The SMOTE() function from imblearn.over_sampling is used to implement this.
  • Undersampling: Another technique to deal with unbalanced data is undersampling. This method is used to reduce the number of samples in the targeted class. ClusterCentroids is used for undersampling. The K-means algorithm is used in this method to reduce the number of samples. The ClusterCentroids() function from the imblearn.under_sampling package is used.
  • Binary encoding: This method converts the categorical data into a numerical form. It is used when the feature column has a binary value. In the liver patient dataset, column V2 (gender) has the values male/female, which is binary encoded into “0” and “1”.
  • One hot encoding: Categorical features are mapped onto a set of columns that have values “1” or “0” to represent the presence or absence of that feature. Here, after assigning the quantile ranges to some features (V1, V3, V5, V6, V7), one hot encoding is applied to represent the same in the form of 1s and 0s.

Feature Selection

Feature selection is mainly applied to large datasets to reduce high dimensionality. This helps to identify the most important features in the dataset that can be given for model building. In the Indian Liver Patient dataset, the random forest algorithm is applied in order to visualize feature importance. The ExtraTreesClassifier() function from the sklearn.ensemble package is used for calculation. Figure 5 shows the feature importance with forests of trees. From the figure, it is clear that the most important feature is V5 (alkphos alkaline phosphatase) and the least important is V2 (gender).

Removing the least significant features help to reduce the processing speed. Here V2 (gender of the patient), V8 (total proteins), V10 (A/G ratio albumin and globulin ratio), and V9 (albumin) are dropped in order to reduce the number of features for model building.

Feature importances
Figure 5. Feature importance with forests of trees.

Model Building

A list of classifiers was used for creating various classification models, which can be further used for prediction. A part of the whole dataset was given for training the model and the rest was given for testing. In this experiment, 90 percent of the data was given for training and 10 percent for testing. Since StratfiedShuffleSplit (a function in scikit-learn) was applied to split the train-test data, the percentage of samples for each class was preserved, that is, in this case, 90 percent of samples from each class was taken for training and the remaining 10 percent from each class was given for testing. Classifiers from the scikit-learn package were used for model building.

Prediction

The label of a new input can be predicted using the trained model. The accuracy and F1 score were analyzed to understand how well the model has learned during training.

Evaluation of the model

Several methods can be used to evaluate the performance of the model. Cross validation, confusion metrics, accuracy, precision, recall, and so on are some of the popular performance evaluation measures.

The performance of a model cannot be assessed by considering only the accuracy, because there is a possibility for misleading. Therefore this experiment considers the F1 score along with the accuracy for evaluation.

Observation and Results

In order to find out the effect of feature selection on the liver patient dataset, accuracy and F1 score were analyzed with and without feature selection (see table 3).

After analyzing the result, it was inferred that there was no remarkable change in the result by removing the least significant features except in the case of the Random Forest Classifier. Because feature selection helps to reduce the processing, it was applied before further processing techniques.

Table 3. Performance with and without feature selection.

ClassifiersWithout Feature SelectionWith Feature Selection
 AccuracyF1 scoreAccuracyF1 score
  PatientNon-Patient PatientNon-Patient
Random Forest Classifier71.11860.810.3774.57620.840.44
Ada Boost Classifier74.57620.830.5272.88130.820.43
Decision Tree Classifier66.10160.760.4167.79660.770.49
Multinomial Naïve Bayes47.45760.470.4749.15250.50.48
Gaussian Naïve Bayes62.71180.650.6161.01690.620.6
K-Neighbors Classifier72.88130.830.3372.88130.830.33
SGD Classifier71.18640.83067.79660.810
SVC71.18640.83071.18640.830
OneVsRest Classifier62.71180.770.0832.20330.090.46

After feature selection, some preprocessing techniques were applied, including normalization. Here, each feature was scaled and translated such that it is in the given range on the training set. Another preprocessing was done by assigning quantile ranges to some of the feature values. One hot encoding was done after this to represent each column in terms of 1s and 0s. The classification result after performing normalization and quantile assigning is given in table 4. After analysis, it was clear that the preprocessing could not improve the performance of the model. But one hot encoding of the column helped in faster model building and prediction.

Table 4. Performance with normalization and quantile ranges.

ClassifiersNormalizationAssigning quantiles ranges
 AccuracyF1 scoreAccuracyF1 score
  PatientNon-Patient PatientNon-Patient
Random Forest Classifier72.88130.820.4371.18640.820.32
Ada Boost Classifier72.88130.820.4376.27110.850.36
Decision Tree Classifier67.79660.770.4974.57620.840.35
Multinomial Naïve Bayes71.18640.83067.79660.750.56
Gaussian Naïve Bayes57.62710.580.5837.28810.210.48
K-Neighbors Classifier72.88130.830.3371.18640.780.59
SGD Classifier71.18640.83071.18640.830
SVC71.18640.83071.18640.830
OneVsRest Classifier71.18640.83071.18640.830

Another inference is that the F1 score for non-patients is zero in some cases, which is a major challenge. In such cases, the accuracy may be high, but the model will not be reliable because the classifier classifies the whole data into one class. The major reason for this could be data imbalance. To address this issue undersampling and oversampling techniques were introduced. Cluster centroids were used for undersampling and the SMOTE algorithm was used for oversampling. The results are shown in table 5.

Table 5. Performance with under sampling and SMOTE.

ClassifiersCluster Centroid (Under Sampling)SMOTE(Over sampling)
 AccuracyF1 scoreAccuracyF1 score
  PatientNon-Patient PatientNon-Patient
Random Forest Classifier67.79660.730.686.44060.910.75
Ada Boost Classifier66.10160.710.5874.57620.810.63
Decision Tree Classifier57.62710.650.4772.88130.790.6
Multinomial Naïve Bayes45.76270.410.549.15250.50.48
Gaussian Naïve Bayes59.32200.60.5962.71180.650.61
K-Neighbors Classifier67.79660.720.6371.18640.80.51
SGD Classifier33.89830.130.4769.49150.810.18
SVC66.10160.690.6366.10160.710.6
OneVsRest Classifier52.54230.550.540.67790.290.49

Table 5 shows that undersampling and oversampling could handle the data imbalance problem. Using cluster centroids as the undersampling technique did not improve the accuracy, whereas SMOTE did give a tremendous improvement in the accuracy. The best accuracy was obtained for the Random Forest Classifier and Ada Boost Classifier. The processing was improved by running the machine learning problem in the Intel® Xeon® Scalable processor making use of computational libraries from the Intel Distribution for Python.

Random Forest
Figure 6. ROC of Random Forest 5-fold cross validation.

ROC curve for various
Figure 7. ROC curve for various classifiers.

Figure 6 shows the ROC curve of the best classifier (Random Forest Classifier) for 5-fold cross validation. Higher accuracy was obtained during the cross validation as the validation samples were taken from the training sample that was subjected to oversampling (SMOTE). The expected accuracy during cross-validation was not attained during testing because the test data was isolated from the train data before performing SMOTE.

The ROC curves for various classifiers are given in figure 7. The classifier output quality of different classifiers can be evaluated using this.

Conclusion

The preprocessing and classification methods did not improve the accuracy of the model. Handling the data imbalance using SMOTE gave better accuracy for the Random Forest and Ada Boost Classifier. A good model was created using the computational libraries from the Intel Distribution for Python on the Intel Xeon Scalable processor.

References

Author

Aswathy C is a technical consulting engineer working with the Intel® AI Academy Program.


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>