Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Credit Risk Classification: Faster Machine Learning with Intel Optimized Packages

$
0
0

Abstract

This paper explains the importance of using Intel® Performance Libraries to solve a machine-learning problem such as credit risk classification. The objective of solving a credit risk classification problem is to minimize loss from the bank’s perspective; for example, when a person applies for a loan, the bank has to make a decision regarding whom to give approval for the loan and whom not to. The case study uses Intel® Distribution for Python* and Python API for Intel® Data Analytics Acceleration Library (Intel® DAAL)—named PyDAAL—to boost machine-learning and data analytics performance. Using the advantage of optimized scikit-learn* (Scikit-learn with Intel DAAL) that comes with Intel® Distribution for Python, we were able to achieve good results for the prediction problem.

Introduction

Credit risk is one of the major financial risks that exists in the banking system. How a bank manages its credit risk is very critical for its performance over time; capital depletion through loan losses has been the proximate reason for most organization failures. Distinguishing and rating credit risk is the primary step in managing it effectively. The applicant is a good credit risk if the user is likely to repay the loan. Conversely, if the user is not likely to repay the loan, the applicant is considered as a bad credit risk. The bank has to analyze applicant’s demographic and socio-economic profiles before a decision is made regarding his and her loan application.

With the help of Intel optimized computational packages and an Intel® Xeon® Gold 6128 processor, a faster predictive model is developed on this data. The machine-learning model helps to guide the bank manager to make a decision whether to approve a loan to a prospective applicant based on his and her profiles.

Hardware Details

The hardware configuration of Intel® Xeon® Gold 6128 processor is shown in table 1.

Table 1. Hardware Configuration.

NameDescription
Intel® Xeon® Gold processor based architecturex86_64
CPU op-mode(s)32-bit, 64-bit
Byte orderLittle Endian
CPU(s)24
On-line CPU(s) list0-23
Thread(s) per core2
Core(s) per socket6
Socket(s)2
NUMA node(s)2
Vendor IDGenuine Intel
CPU family6
Model85
Model nameIntel® Xeon® Gold 6128 processor 3.40 GHz
Stepping4
CPU MHz1199.960
BogoMIPS6800.00
Virtualization typeVT-x
L1d cache32K
L1i cache32K
L2 cache1024K
L3 cache19712K
NUMA node0 CPU(s)0-5,12-17
NUMA node1 CPU(s)6-11,18-23

Software Configuration

The development of this use case had the following dependencies (table 2).

Table 2. Software Configuration.

LibraryVersion
Anaconda* with Intel channel4.3.21
Intel® optimized for Python*3.6.3
Optimized scikit-learn*0.19.0
Intel® optimized for NumPy1.13.3

Dataset Description

The original dataset contains 1,000 entries with 20 categorical and symbolic attributes. Each entry represents a person who applies for a credit loan with a bank. Each person is classified as good or bad credit risks according to the set of attributes. Table 3 shows the set of attributes in the dataset.

Table 3. Dataset Description.

Attribute NameAttribute Description
checking_statusStatus of existing checking account, in Deutsche Marks (DM)
durationDuration in months
credit_historyCredit history (credits taken, paid back duly, delays, critical accounts)
purposePurpose of the credit (car, television, etc.)
credit_amountCredit loan amount, in Deutsche Marks (DM)
savings_statusStatus of savings account and bonds, in Deutsche Marks
employmentPresent employment, in number of years
installment_commitmentInstallment rate in percentage of disposable income
personal_statusPersonal status (married, single, etc.) and sex
other_partiesOther debtors and guarantors
residence_sincePresent residence since X years
property_magnitudeProperty (e.g., real estate)
ageAge in years
other_payment_plansOther installment plans (banks, stores, etc.)
housingHousing (rent, own)
existing_creditsNumber of existing credits at this bank
jobJob
num_dependentsNumber of people being liable to provide maintenance for
own_telephoneTelephone (yes and no)
foreign_workerForeign worker (yes and no)
classGood credit or bad credit

Intel® Distribution for Python*, Optimized scikit-learn*, and PyDAAL module

Machine learning and data analysis using Python get their power with Intel® Distribution for Python1. Intel® Distribution for Python is equipped with Intel optimized computational packages2 like NumPy, SciPY, scikit-learn* and PyDAAL (a free package which implements Python bindings to the Intel® Data Analytics Acceleration Library, or Intel® DAAL)3.

scikit-learn is a commonly used library for machine learning mainly because of its algorithm richness. Intel DAAL is more than just a set of machine-learning algorithms, as it implements and optimizes the full data analysis pipeline, from loading data and transforming and filtering it to analysis and modeling of data with classic statistical and machine-learning techniques, as well as advanced deep-learning methods. scikit-learn algorithms shipped within Intel® Distribution for Python take advantage of PyDAAL, bringing scikit-learn performance to new levels.

Solution Design

The major steps involved in solution design are shown in Figure 1.

Diagram Solution design
Figure 1. Solution design.

Detailed explanation of each step is given below.

Dataset analysis

The primarily part in any predictive modelling task is the dataset analysis, which is nothing but an initial exploration of data. The goal of dataset analysis is to have a solid understanding of data, to derive a better solution. Gathering good information about data also helps with feature engineering and feature selection. Data visualization using graphs helps us understand the relation between attributes and patterns in the data.

The German credit dataset4 has 21 features out of which 14 are categorical variables and the remaining 7 are numerical. The last column is the label, which denotes the credit risk and has only two possible values: good credit risk and bad credit risk. Since both categorical and numerical variables are included in the data set, appropriate statistical and distribution analysis are provided.

Numerical variable analysis

Table 4 shows the summary of all the numerical variables in the dataset. It includes count, mean, standard deviation (std), min, quartiles, and max in its output.

Table 4. Numerical Variable Analysis.

Here are few inferences that were obtained:

  1. There are no missing values in the dataset. Count is 1,000 for all columns which means that all the values are present for each attribute in 1000 rows of the dataset
  2. Most of the credit loan amount requests are in the range of 3,000DM-4,000DM.
  3. The mean average duration is about 21 months and age is around 36 years.

Categorical variable analysis.

histogram analysiS
Figure 2. Credit loan amount histogram analysis.

Credit loan amount
Figure 3. Credit loan amount box plot analysis.

Most of the credit loan amounts are in the range of 2,000DM-4,000DM (Figure 2). The largest amount given is as high as 18,000DM. Box plot analysis shows that higher credit loan amounts are mostly bad credit risks (Figure 3).

Purpose distribution
Figure 4. Purpose distribution analysis.

Television and radio or a new car are the top reasons for mosts applicants seeking a credit. loan. Very few applicants are seeking a credit loan for education or retraining. This may mean that either education and retraining are not worth a credit loan or that their costs have been covered entirely either by the schools, universities, government, or in some other way, which seems very unlikely.

Age distribution
Figure 5. Age distribution analysis.

Plotting the frequency of the credit loan against the age groups in Figure 5 shows that credit demands from applicants between the ages of 20 and 39 make up for about two-thirds of the total credit demands.

Data pre-processing

Data pre-processing involves the transformations being applied to the data before feeding it to the algorithm. Since most of the variables in the dataset are categorical, various techniques need to be applied for converting the categorical to numerical variables.

  1. Converting binary categorical data to numerical: In this step, the yes or no columns own_telephone, foreign_worker (Refer Table 3) are converted to 1’s and 0’s.
  2. One-hot encoding: This technique transforms each categorical feature with n possible values into n binary features, with only one active. The Pandas package in Python supports this feature using the get_dummies method. This function is named this way because it creates dummy and indicator columns. The following columns in the German credit data are one-hot encoded: checking_status, credit_history, savings_status, employment, personal_status, other_parties, job, housing, other_payment_plans, property_magnitude
  3. Label encoding: Another approach for converting categorical variables to numerical is by using label encoding, which converts each value in a column to a numeric value. The purpose column in the data is label-encoded. The optimized scikit-learn* has a pre-built functionality for preprocessing using the scikit-learn pre-processing module.

Feature selection

Datasets may contain irrelevant or redundant features that might make the machine-learning model more complicated. In this step, we aim to remove the irrelevant features which may cause an increase in run time, generate complex patterns, etc. The generated subset of features is used for further analysis. The feature selection can be done either by using Random Forest or Xgboost algorithm. In this experiment, the Xgboost algorithm is used to select the best features which has a score above a predefined threshold value . Table 5 shows some of the features sorted based on the score value.

Table 5. Feature Importance.

FeatureScore
credit_amount0.1724
duration0.1122
age0.1057
purpose0.0634
checking_status_no checking0.0423
installment_commitment0.0341
plan_none0.0309
employment_4<=X<70.0293
residence_since0.0260
credit_history_all paid0.0244

Data split

Splitting the train and test data: The data is then split into train and test sets for further analysis. 90% of the data is used for training and 10% is for testing. The train_test_split function in scikit-learn is used for data splitting.

Classifier implementation

Classifier is implemented using two packages: scikit-learn with Intel DAAL and PyDAAL.

scikit-learn with Intel DAAL

Balancing the dataset

The dataset is highly imbalanced with 70% of the data containing good credit risks. This data imbalance is handled by the SMOTE algorithm, which generates the new smoted dataset that addresses the unbalanced class problem. It artificially generates observations of minority classes using the nearest neighbors of this class of elements to balance the training dataset.

Model building and training

There are various machine-learning models (algorithms) that are created by researchers and data scientists over years. In this stage, machine-learning models are selected for training. All classifiers in scikit-learn use a fit (X, y) method to fit the model for the given train data X and train label y. To compare the performance of various models, an ensemble of classifiers is used. Once the model is trained, it can be used for prediction.

Prediction

During this stage, the trained model predicts the output for a given input based on its learning. That is, given an unlabeled observation X, predict (X) returns the predicted label y.

Evaluation

In order to measure the performance of model, various performance evaluation metrics are available. We have used accuracy, precision, and recall as our evaluation metrics to choose the best model for the problem.

Python API for Intel DAAL-PyDAAL

Data representation in PyDAAL

All the pre-processing steps upto Step 4 (data split) are the same for PyDAAL implementation. Every algorithm in DAAL accepts inputs in the form of NumericTables, a generic data type for representing data in memory. Since all the converted features are of the same type, we have used HomogenNumericTables for representation. The dataset obtained after feature selection in scikit-learn, is in the form of NumPy ndarray. This can be easily converted to a HomogenNumericTable using the built-in function in the DAAL data management module5.

Model building and training

An algorithm object is created to train the classifier in appropriate processing mode (Batch and Online and Distributed)5.The two pieces of input (i.e., data and labels) are set using the input.set member methods6 of the algorithm object. Further, the compute() method is used to update the partial model.

Prediction

After creating the model, a prediction object is defined. The testing data set and the trained model are passed to the algorithm using the input.setTable() and input.setModel() methods, respectively. The predictions are computed using the compute() method.

Evaluation

After finding the predictions using the compute() method, the accuracy, precision, recall, and f1-score of the model are calculated.

Experimental Results

Since the problem is binary classification, apart from the accuracy, the evaluation metrics such as precision, recall, and F1 score are used to identify the best classifier from the ensemble of algorithms. The classifiers are categorized as Best, Good, and Poor based on the performance evaluation metrics. The classifier that gives a stable value for Accuracy, Precision, Recall, and F1 score is categorized as the best classifier, whereas the classifier that gives highly varying values for the evaluation metrics is categorized as a poor classifier.

Using scikit-learn with Intel DAAL

Table 6 depicts the results of using various classifiers on the balanced and unbalanced datasets. Comparison of results for full and feature selected datasets are also given. All the features are pre-processed using encoding techniques discussed under the data pre-processing section. Receiver Operating Characteristic curve (or ROC curve) is also plotted for each classifier to analyze the classifier performance. The closer the curve follows the left-hand border and then the top border of the ROC space, is considered as the best classifier. Figure 6 shows the ROC curve for classifiers in scikit-learn with Intel® DAAL. ROC curve demonstrates that Random Forest Classifier and Ada Boost classifier are the best classifiers.

Table 6. Experiment Results of scikit-learn* with Intel® DAAL.

ClassifiersUnbalanced full training data (%)Unbalanced feature selected data (%)Balanced feature selected data (%)
 Acc.PrecisionRecallF1-scoreAcc.PrecisionRecallF1-scoreAcc.PrecisionRecallF1-score
Best

Random Forest

777677757678767482848281

Ada Boost

777777777474747381818181
Good

Decision Tree

697169706665666568676867

Support Vector Machine

797979777371737171747171

Gaussian Naïve Bayes

656765667375737475757575

Multinomial Naïve Bayes

606160606263626361616161

K Nearest Neighbors

666366646663666451545152
Poor

Stochastic Gradient Descent

704970587049705846724638

One Vs Rest

704970587049705865786553

ROC curve
Figure 6. ROC curve.

Using PyDAAL

The results of classification using PyDAAL algorithms are given in table 7.

Table 7. Experiment Results of PyDAAL.

ClassifiersAccuracy%Precision%Recall%F1-score%
Best
Decision Tree74727472
Support Vector Machine65797175
Good
Multinomial Naïve Bayes62646263
Poor
Ada Boost70497058

Observations:

  1. Random Forest Classifier and Ada Boost classifier in scikit-learn Intel DAAL are identified as the best classifiers for the given problem.
  2. There was only a slight improvement in classifier performance when the irrelevant features were removed, but there was a significant improvement in run time.
  3. Balancing the dataset using the SMOTE algorithm helped produce better results for scikit-learn, whereas there was no improvement in PyDAAL.
  4. PyDAAL is versatile enough to accommodate data of different memory layouts. Popular libraries used in the data analysis process, like NumPy, can be easily interfaced with Intel DAAL to create numeric tables.
  5. PyDAAL supports only a few machine-learning algorithms and the highest score is obtained for decision trees.

Conclusion

The scikit-learn algorithms shipped within Intel® Distribution for Python* take advantage of PyDAAL bringing scikit-learn performance to new levels. It is also a richer library in terms of machine-learning algorithms when compared to PyDAAL. Using Intel optimized performance libraries in Intel® Xeon® Gold 6128 processor helped machine-learning applications to make predictions faster. Higher accuracy is obtained for optimized scikit-learn algorithms than PyDAAL algorithms. PyDAAL mainly operates on numeric table data, which allows reduced memory footprint and efficient processing.

About the Authors

Nikhila Haridas is a Technical Consulting Engineer working with the Intel® AI Academy Program.

References

  1. Intel Distribution for Python
  2. Intel Optimized Packages for the Intel® Distribution for Python*
  3. PyDAAL
  4. German credit dataset
  5. DAAL: Data structures
  6. DAAL programming guide

Viewing all articles
Browse latest Browse all 3384

Trending Articles