Intel® Distribution for Python* versus Non-Optimized Python: Breast Cancer Classification

Abstract

This case study compares the performance of Intel® Distribution for Python* to that of non-optimized Python using a breast cancer classification. This comparison was done using machine learning algorithms from the scikit-learn* package in Python.

Introduction

Cancer refers to cells that grow out of control and invade other tissues. This process can also result in a tumor, where there is more cell growth than cell death. There are various types of cancer including bladder cancer, kidney cancer, lung cancer, and breast cancer. Currently, breast cancer is one of the most prevalent types of cancer, especially in women. It occurs when the cells in the breast divide and grow uncontrollably. Early detection of breast cancer can save lives. Causes of cancer include inherited genes, hormones, and an individual’s lifestyle.

This article provides a comparative study between the performance of non-optimized Python* and the Intel® Distribution for Python using breast cancer classification as an example. The classifiers used for breast cancer classification were taken from the scikit-learn* package in Python. The time and accuracy of each classifier for each distribution was calculated and compared.

Dataset Description

The dataset for this study can be accessed from the Breast Cancer Wisconsin (Diagnostic) Data Set. The features of this dataset were computed from a digitized image of a fine needle aspirate of a breast mass in a CSV format and describe the characteristics of the cell nuclei present in the image. These values obtained were the features for classification. Using these features, a cancer cell can be classified into two classes: benign and malignant. Benign refers to a tumor that is not cancerous, whereas a malignant tumor has cancer in it. Observing the class distribution, there were 357 benign and 212 malignant data rows. The classification is based on the diagnosis field that has values M or B, where M denotes malignant and B denotes benign. Hence, this is a binary classification.

Hardware Configuration

The experiment used Intel® architecture with the following hardware configuration:

Feature	Specifications
Architecture	x86_64
CPU op-mode(s)	32-bit, 64-bit
Byte order	Little Endian
CPU(s)	256
On-line CPU(s) list	0-255
Thread(s) per core	4
Core(s) per socket	64
Socket(s)	1
NUMA node(s)	2
Vendor ID	GenuineIntel
CPU family	6
Mode	87
Model name	Intel® Xeon Phi™ processor 7210 @ 1.30 GHz
Stepping	1
CPU MHz	1375.917
BogoMIPS	2593.85
L1d cache	32K
L1i cache	32K
L2 cache	1024K
NUMA node0 CPU(s)	0-255

Software Configuration

The following are the software dependencies used to perform these classification:

Software	Version
Python*	2.7.13
scikit-learn*	0.18.2
Anaconda*	4.3.25

Classifier Implementation Pipeline

The goal was to identify the class (M or B) to which the tumor belonged. The following block diagram shows the classification steps, explained in the following section, for both the Intel Distribution for Python and non-optimized Python.

Image of a flowchart

Implementation

The scikit-learn Python library provides a wide variety of machine learning algorithms for classification. Ten classifiers from the package were used for the study: Decision Tree Classifier, Gaussian NB, SGD Classifier, SVC, KNeighbors Classifier, OneVsRest Classifier, Quadratic Discriminant Analysis (QDA), Random Forest Classifier, MLP Classifier, and AdaBoost Classifier.

Create a Python file called classifier_ml.py. The following steps are implemented in this file:

The input data mentioned in Dataset Description section is given for preprocessing.
As part of the preprocessing, the given dataset is checked for categorical values (if any) and are converted to numerical data. This is performed using a technique called One Hot Encoding. This is important because a few classifiers in scikit-learn work only with numerical values. Here, diagnosis fields containing values "M" and "B" are converted to 1 and 0, respectively. Columns such as "id" are irrelevant for classification and hence can be dropped.
After preprocessing, all the columns except diagnosis field is considered as the features. Diagnosis column is taken as the target.
70 percent of the data is used for training and 30 percent is used for testing. The split is done using the StratifiedShuffleSplit function from cross_validation module of sklearn¹.
Keeping the default environment intact, the accuracy of each classifier is recorded using the scikit-learn package of Python².
The file 'classifier_ml.py', is now executed. The time taken (t_nop) is measured as a 10-times average for better accuracy as follows:

time(cmd="python classifier_ml.py"; for i in $(seq 10); do $cmd; done)

Steps 1 through 6 provide the time and accuracy values for non-optimized Python. Repeat these steps for the Intel Distribution for Python. The time (t_idp) and accuracy are calculated.

To enable the Intel Distribution for Python, follow the steps given in Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda*.

The results are shown in Table 1.

The accuracy values for each classifier are the same for both non-optimized Python and the Intel Distribution for Python. Therefore, the accuracy values listed in Table 1 are common for both distributions.

Performance Gain percentage with respect to time is calculated by the given formula:

Performance Gain % = (t_nop - t_idp) / t_nop * 100

From the formula, it is clear that a positive value of Performance Gain percentage indicated better performance of the Intel Distribution for Python. The higher the value, the better the performance for the Intel Distribution for Python compared to non-optimized Python.

Results

Table 1 shows the percentage of performance gain for non-optimized Python compared to that of the Intel Distribution for Python*.

Table 1: Gain percentage: non-optimized Python* versus Intel® Distribution for Python*

Classifier	Accuracy (Percent)	Performance Gain Percentage
DecisionTreeClassifier	90.64	34.69
GaussianNB	94.74	35.01
SGDClassifier	88.89	33.04
SVC	94.74	32.29
KNeighborsClassifier	92.98	34.35
OneVsRestClassifier	92.40	33.00
QuadraticDiscriminantAnalysis	94.15	33.65
RandomForestClassifier	93.57	30.36
MLPClassifer	65.50	32.09
AdaBoostClassifier	94.74	27.09

Conclusion

The performance gain clearly shows that the Intel Distribution for Python had better performance in terms of the time taken for execution as compared to non-optimized Python. The accuracy remained the same as expected and did not change whether non-optimized Python or the Intel Distribution for Python was used.

References

Cross Validation - Stratified Shuffle Split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html
An introduction to machine learning with scikit-learn: http://scikit-learn.org/stable/tutorial/basic/tutorial.html

Intel® Distribution for Python* versus Non-Optimized Python: Breast Cancer Classification

Abstract

Introduction

Dataset Description

Hardware Configuration

Software Configuration

Classifier Implementation Pipeline

Implementation

Results

Conclusion

References

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112