Liver Patient Dataset Classification Using the Intel® Distribution for Python*

Abstract

This paper focuses on the implementation of the Indian Liver Patient Dataset classification using the Intel® Distribution for Python* on the Intel® Xeon® Scalable processor. Various preprocessing steps were used to analyze the effect of the machine learning classification problems. With the help of various features, the liver patient classification aims to predict whether or not a person has liver disease. Early determination of the disease without the use of manual effort could be a great support for people in the medical field. Good results were obtained by using SMOTE as the preprocessing method and the Random Forest algorithm as the classifier.

Introduction

The liver, which is the largest solid organ in the human body, performs several important functions. Its major functions include manufacturing essential proteins and blood clotting factors, metabolizing fat and carbohydrates, eliminating harmful waste products and detoxifying alcohol and certain drugs, and secreting bile to aid digestion and internal absorption. Disorders of the liver can affect the smooth functioning of these activities.

Excessive consumption of alcohol, viruses, the intake of contaminated food and drugs and so on are the major causes of liver diseases. The symptoms may or may not be visible in the early stages. If not attended to properly, liver diseases can lead to life-threatening conditions. It is always better to diagnose the disease in an early stage in order to help ensure a high rate of survival for the patient.

Classification is an effective technique used to handle this kind of problem in the medical field. Using the available feature values, the classifier could predict whether or not a person has liver disease. This ability will help doctors identify the disease in advance. It is always recommended to reduce Type I error (occurs due to the rejection of null hypothesis (as false) when it is actually true). Because it is better to identify a non-liver patient as a patient rather than not identifying a liver patient as a patient.

In this experiment, various preprocessing methods were tried prior to model building and training for comparison. Computational libraries like scikit-learn*, numpy, and scipy* from the Intel Distribution for Python on the Intel Xeon Scalable Processor were used for predictive model creation.

Environment Setup

Table 1 describes the environment setup that was used to conduct the experiment.

Table 1. Environment setup.

Setup	Version
Processor	Intel® Xeon® Gold 6128 processor 3.40 GHz
System	CentOS* (7.4.1708)
Core(s) per socket	6
Anaconda* with Intel channel	4.3.21
Intel® Distribution for Python*	3.6.3
Scikit-learn*	0.19.0
Numpy	1.13.3
Pandas	0.20.3

Dataset Description

The Indian Liver Patient dataset was collected from the northeast area of the Andhra Pradesh state in India. This is a binary classification problem with the class labeled as liver patient (represented as 1 in the dataset) and not-liver patient (represented as 2). There are 10 features, which are listed in table 2.

Table 2. Dataset description.

Attribute Name	Attribute Description
V1	Age of the patient. Any patient whose age exceeded 89 is listed as being age 90.
V2	Gender of the patient
V3	Total bilirubin
V4	Direct bilirubin
V5	Alkphos alkaline phosphatase
V6	Sgpt alanine aminotransferase
V7	Sgot aspartate aminotransferase
V8	Total proteins
V9	Albumin
V10	A/G ratio albumin and globulin ratio
Class	Liver patient or not

Methodology

Image may be NSFW.
Clik here to view. Methodology
Figure 1. Methodology.

Data Analysis

Before performing any processing on the available data, a data analysis is recommended. This process includes visualization of the data, identifying the outliers, and skewed predictors. These tasks help to inspect the data and thereby spot the missing values and irrelevant information in the dataset. A data cleanup process is performed to handle these issues and to ensure data quality. Gaining a better understanding of the dataset helps to identify useful information and supports decision making.

The Indian Liver Patient dataset consists of 583 records in which 416 are records of people with liver disease, and the remaining are records of people without any liver disease. The dataset has 10 features in which there is only one categorical data (V2-Gender of the patient). The endmost column of the dataset represent the class in which each sample falls (liver patient or not). A value of 1 indicates the person has liver disease and a 2 indicates the person does not have the disease. There is no missing value in the dataset.

Image may be NSFW.
Clik here to view. Liver patient dataset
Figure 2. Visualization: liver patient dataset class.

Image may be NSFW.
Clik here to view. Male and female population
Figure 3. Visualization: male and female population.

Figure 2 shows a visualization of the number of liver patients and non-liver patients in the dataset, whereas figure 3 represents a visualization of the male and female population in the dataset. Histograms of numerical variables are represented by figure 4.

Image may be NSFW.
Clik here to view. Numerical variables
Figure 4. Visualization of numerical variables in the dataset.

Data Preprocessing

Some datasets contain irrelevant information, noise, missing values, and so on. These datasets should be handled properly to get a better result for the data mining process. Data preprocessing includes data cleaning, preparation, transformation, and dimensionality reduction, which convert the raw data into a form that is suitable for further processing.

The major objective of the experiment is to show the effect of various preprocessing methods on the dataset prior to classification. Different classification algorithms were applied to compare the results.

Some of the preprocessing includes:

Normalization: This process scales each feature into a given range. The preprocessing.MinMaxScaler() function in the sklearn package is used to perform this action.
Assigning quantiles ranges: The pandas.qcut function is used for quantile-based discretization. Based on the sample quantiles or rank, the variables are discretized and assigned some categorical values.
Oversampling: This technique handles the unbalanced dataset. Oversampling is used to generate new samples in the under-represented class. SMOTE is used for oversampling the data. SMOTE proposes several variants by identifying specific samples. The SMOTE() function from imblearn.over_sampling is used to implement this.
Undersampling: Another technique to deal with unbalanced data is undersampling. This method is used to reduce the number of samples in the targeted class. ClusterCentroids is used for undersampling. The K-means algorithm is used in this method to reduce the number of samples. The ClusterCentroids() function from the imblearn.under_sampling package is used.
Binary encoding: This method converts the categorical data into a numerical form. It is used when the feature column has a binary value. In the liver patient dataset, column V2 (gender) has the values male/female, which is binary encoded into “0” and “1”.
One hot encoding: Categorical features are mapped onto a set of columns that have values “1” or “0” to represent the presence or absence of that feature. Here, after assigning the quantile ranges to some features (V1, V3, V5, V6, V7), one hot encoding is applied to represent the same in the form of 1s and 0s.

Feature Selection

Feature selection is mainly applied to large datasets to reduce high dimensionality. This helps to identify the most important features in the dataset that can be given for model building. In the Indian Liver Patient dataset, the random forest algorithm is applied in order to visualize feature importance. The ExtraTreesClassifier() function from the sklearn.ensemble package is used for calculation. Figure 5 shows the feature importance with forests of trees. From the figure, it is clear that the most important feature is V5 (alkphos alkaline phosphatase) and the least important is V2 (gender).

Removing the least significant features help to reduce the processing speed. Here V2 (gender of the patient), V8 (total proteins), V10 (A/G ratio albumin and globulin ratio), and V9 (albumin) are dropped in order to reduce the number of features for model building.

Image may be NSFW.
Clik here to view. Feature importances
Figure 5. Feature importance with forests of trees.

Model Building

A list of classifiers was used for creating various classification models, which can be further used for prediction. A part of the whole dataset was given for training the model and the rest was given for testing. In this experiment, 90 percent of the data was given for training and 10 percent for testing. Since StratfiedShuffleSplit (a function in scikit-learn) was applied to split the train-test data, the percentage of samples for each class was preserved, that is, in this case, 90 percent of samples from each class was taken for training and the remaining 10 percent from each class was given for testing. Classifiers from the scikit-learn package were used for model building.

Prediction

The label of a new input can be predicted using the trained model. The accuracy and F1 score were analyzed to understand how well the model has learned during training.

Evaluation of the model

Several methods can be used to evaluate the performance of the model. Cross validation, confusion metrics, accuracy, precision, recall, and so on are some of the popular performance evaluation measures.

The performance of a model cannot be assessed by considering only the accuracy, because there is a possibility for misleading. Therefore this experiment considers the F1 score along with the accuracy for evaluation.

Observation and Results

In order to find out the effect of feature selection on the liver patient dataset, accuracy and F1 score were analyzed with and without feature selection (see table 3).

After analyzing the result, it was inferred that there was no remarkable change in the result by removing the least significant features except in the case of the Random Forest Classifier. Because feature selection helps to reduce the processing, it was applied before further processing techniques.

Table 3. Performance with and without feature selection.

Classifiers	Without Feature Selection			With Feature Selection
	Accuracy	F1 score		Accuracy	F1 score
		Patient	Non-Patient		Patient	Non-Patient
Random Forest Classifier	71.1186	0.81	0.37	74.5762	0.84	0.44
Ada Boost Classifier	74.5762	0.83	0.52	72.8813	0.82	0.43
Decision Tree Classifier	66.1016	0.76	0.41	67.7966	0.77	0.49
Multinomial Naïve Bayes	47.4576	0.47	0.47	49.1525	0.5	0.48
Gaussian Naïve Bayes	62.7118	0.65	0.61	61.0169	0.62	0.6
K-Neighbors Classifier	72.8813	0.83	0.33	72.8813	0.83	0.33
SGD Classifier	71.1864	0.83	0	67.7966	0.81	0
SVC	71.1864	0.83	0	71.1864	0.83	0
OneVsRest Classifier	62.7118	0.77	0.08	32.2033	0.09	0.46

After feature selection, some preprocessing techniques were applied, including normalization. Here, each feature was scaled and translated such that it is in the given range on the training set. Another preprocessing was done by assigning quantile ranges to some of the feature values. One hot encoding was done after this to represent each column in terms of 1s and 0s. The classification result after performing normalization and quantile assigning is given in table 4. After analysis, it was clear that the preprocessing could not improve the performance of the model. But one hot encoding of the column helped in faster model building and prediction.

Table 4. Performance with normalization and quantile ranges.

Classifiers	Normalization			Assigning quantiles ranges
	Accuracy	F1 score		Accuracy	F1 score
		Patient	Non-Patient		Patient	Non-Patient
Random Forest Classifier	72.8813	0.82	0.43	71.1864	0.82	0.32
Ada Boost Classifier	72.8813	0.82	0.43	76.2711	0.85	0.36
Decision Tree Classifier	67.7966	0.77	0.49	74.5762	0.84	0.35
Multinomial Naïve Bayes	71.1864	0.83	0	67.7966	0.75	0.56
Gaussian Naïve Bayes	57.6271	0.58	0.58	37.2881	0.21	0.48
K-Neighbors Classifier	72.8813	0.83	0.33	71.1864	0.78	0.59
SGD Classifier	71.1864	0.83	0	71.1864	0.83	0
SVC	71.1864	0.83	0	71.1864	0.83	0
OneVsRest Classifier	71.1864	0.83	0	71.1864	0.83	0

Another inference is that the F1 score for non-patients is zero in some cases, which is a major challenge. In such cases, the accuracy may be high, but the model will not be reliable because the classifier classifies the whole data into one class. The major reason for this could be data imbalance. To address this issue undersampling and oversampling techniques were introduced. Cluster centroids were used for undersampling and the SMOTE algorithm was used for oversampling. The results are shown in table 5.

Table 5. Performance with under sampling and SMOTE.

Classifiers	Cluster Centroid (Under Sampling)			SMOTE(Over sampling)
	Accuracy	F1 score		Accuracy	F1 score
		Patient	Non-Patient		Patient	Non-Patient
Random Forest Classifier	67.7966	0.73	0.6	86.4406	0.91	0.75
Ada Boost Classifier	66.1016	0.71	0.58	74.5762	0.81	0.63
Decision Tree Classifier	57.6271	0.65	0.47	72.8813	0.79	0.6
Multinomial Naïve Bayes	45.7627	0.41	0.5	49.1525	0.5	0.48
Gaussian Naïve Bayes	59.3220	0.6	0.59	62.7118	0.65	0.61
K-Neighbors Classifier	67.7966	0.72	0.63	71.1864	0.8	0.51
SGD Classifier	33.8983	0.13	0.47	69.4915	0.81	0.18
SVC	66.1016	0.69	0.63	66.1016	0.71	0.6
OneVsRest Classifier	52.5423	0.55	0.5	40.6779	0.29	0.49

Table 5 shows that undersampling and oversampling could handle the data imbalance problem. Using cluster centroids as the undersampling technique did not improve the accuracy, whereas SMOTE did give a tremendous improvement in the accuracy. The best accuracy was obtained for the Random Forest Classifier and Ada Boost Classifier. The processing was improved by running the machine learning problem in the Intel® Xeon® Scalable processor making use of computational libraries from the Intel Distribution for Python.

Image may be NSFW.
Clik here to view. Random Forest
Figure 6. ROC of Random Forest 5-fold cross validation.

Image may be NSFW.
Clik here to view. ROC curve for various
Figure 7. ROC curve for various classifiers.

Figure 6 shows the ROC curve of the best classifier (Random Forest Classifier) for 5-fold cross validation. Higher accuracy was obtained during the cross validation as the validation samples were taken from the training sample that was subjected to oversampling (SMOTE). The expected accuracy during cross-validation was not attained during testing because the test data was isolated from the train data before performing SMOTE.

The ROC curves for various classifiers are given in figure 7. The classifier output quality of different classifiers can be evaluated using this.

Conclusion

The preprocessing and classification methods did not improve the accuracy of the model. Handling the data imbalance using SMOTE gave better accuracy for the Random Forest and Ada Boost Classifier. A good model was created using the computational libraries from the Intel Distribution for Python on the Intel Xeon Scalable processor.

References

Author

Aswathy C is a technical consulting engineer working with the Intel® AI Academy Program.

Liver Patient Dataset Classification Using the Intel® Distribution for Python*

Abstract

Introduction

Environment Setup

Dataset Description

Methodology

Data Analysis

Data Preprocessing

Feature Selection

Model Building

Prediction

Evaluation of the model

Observation and Results

Conclusion

References

Author

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112