Previous: Vol 2: Basic Operations on Numeric Tables
Earlier in the Gentle Introduction Series (Volume 1 and Volume 2), we covered fundamentals of the Intel® Data Analytics Acceleration Library (Intel® DAAL) custom Data Structure and basic operations that can be performed. Volume 3 will focus on the algorithm component of Intel® DAAL where the data management element is leveraged to drive analysis and build machine learning models
Intel® DAAL has classes available to construct a wide range of popular machine learning algorithms for analytics model building that include classification, regression, recommender systems and neural networks. Training and Prediction are separated into 2 pieces in Intel® DAAL model building. This separation allows the user to store and transfer only what’s needed for prediction when it comes time for model deployment. Typical Machine Learning workflow involves:
- Training stage that includes identifying patterns in input data that maps behavior of data features in accordance with a target variable.
- Prediction stage that requires employing the trained model on a new data set.
Additionally, Intel® DAAL also contains on-board model scoring, in the form of separate classes to evaluate trained model performance and compute standard quality metrics. Various sets of quality metrics can be reported based on the type of analytics model built.
To accelerate the model building process, Intel® DAAL is reinforced with the distributed parallel processing mode for large data sets, including a programming model that makes it easy for users to implement a Master-Slave approach. Mpi4py can be easily interfaced with PyDAAL, as Intel® DAAL’s serialization and deserialization classes enable data exchange between nodes during parallel computation.
Volumes in Gentle Introduction Series
- Vol 1: Data Structures - Covers introduction to Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples
- Vol 2: Basic Operations on Numeric Tables - Covers introduction to possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples
- Vol 3: Analytics Model Building and Deployment – Covers introduction to analytics model building and evaluation in Intel® DAAL with serialized deployment and distributed model fitting on large datasets.
Analytics Modelling:
1. Batch Learning with PyDAAL
Intel DAAL includes classes that support the following stages in analytics model building and deployment process:
1.1 Analytics Modelling Training and Prediction Workflow:
1.2 Build and Predict with PyDAAL Analytics Models:
As described earlier, Intel DAAL model building is separated into two different stages with two associated classes (“training”, “prediction”)
The training stage usually involves complex computations with possibly very large datasets, calling for extensive memory footprint. DAAL’s two separate classes allows users to perform training stage on a powerful machine, and optionally the subsequent prediction stage on a simpler machine. Furthermore, this facilitates the user to store and transmit only necessary training stage results that are required for prediction stage.
Four numeric tables are created at the beginning of model building process, two in each stage (training and prediction) as listed below:
Stage | Numeric Tables | Description |
---|---|---|
Training | trainData | This includes the feature values/predictors |
Training | trainDependentVariables | This includes the target values (i.e., labels/responses) |
Prediction | testData | This includes the feature values/predictors of test data |
Prediction | testGroundTruth | This includes the target (i.e., labels/responses) |
Note: See Volume 2 for details on creating and working with numeric tables
Below illustrates a high-level overview on training and prediction stages of the analytics model building process:
Helper Functions: Linear Regression
The next section can be copy/pasted into a user’s script or adapted to a specific use case. The Helper function block provided below can be used directly to automate the training and prediction stages of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.
''' training() function ----------------- Arguments: train data of type numeric table, train dependent values of type numeric table Returns: training results object ''' def training(trainData,trainDependentVariables): from daal.algorithms.linear_regression import training algorithm = training.Batch () # Pass a training data set and dependent values to the algorithm algorithm.input.set (training.data, trainData) algorithm.input.set (training.dependentVariables, trainDependentVariables) trainingResult = algorithm.compute () return trainingResult ''' prediction() function ----------------- Arguments: training result object, test data of type numeric table Returns: predicted responses of type numeric table ''' def prediction(trainingResult,testData): from daal.algorithms.linear_regression import prediction, training algorithm = prediction.Batch() # Pass a testing data set and the trained model to the algorithm algorithm.input.setTable(prediction.data, testData) algorithm.input.setModel(prediction.model, trainingResult.get(training.model)) predictionResult = algorithm.compute () predictedResponses = predictionResult.get(prediction.prediction) return predictedResponses
To use: copy the complete block of helper function and call training()
and prediction()
methods.
Usage Example: Linear Regression
Below is a code example implementing the provided training and predict helper functions:
#import required modules from daal.data_management import HomogenNumericTable import numpy as np from utils import printNumericTable seeded = np.random.RandomState (42) #set up train and test numeric tables trainData =HomogenNumericTable(seeded.rand(200,10)) trainDependentVariables = HomogenNumericTable(seeded.rand (200, 2)) testData =HomogenNumericTable(seeded.rand(50,10)) testGroundTruth = HomogenNumericTable(seeded.rand (50, 2)) #-------------- #Training Stage #-------------- trainingResult = training(trainData,trainDependentVariables) #-------------- #Prediction Stage #-------------- predictionResult = prediction(trainingResult, testData) #Print and compare results printNumericTable (predictionResult, "Linear Regression prediction results: (first 10 rows):", 10) printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)
Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain training and prediction stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.
1.3 Trained Model Evaluation and Quality Metrics:
Intel DAAL offers quality metrics classes for binary classifiers, multi-class classifiers and regression algorithms to measure quality of the trained model. Various standard metrics are computed by Intel DAAL quality metrics library for different types of analytics modeling.
Binary Classification:
Accuracy, Precision, Recall, F1-score, Specificity, AUC
Click here for more details on notations and definitions.
Multi-class Classification:
Average accuracy, Error rate, Micro precision (Precisionμ ), Micro recall (Recallμ ), Micro F-score (F-scoreμ ), Macro precision (Precision M), Macro recall (Recall M), Macro F-score (F-score M)
Click here for more details on notations and definitions.
Regression:
For regression models, Intel DAAL computes metrics using 2 libraries:
- Single Beta: Computes and produces metrics results based on Individual beta coefficients of trained model.
RMSE, Vector of variances, variance-covariance matrices, Z-score statistics
- Group Beta: Computes and produces metrics results based on group of beta coefficients of trained model.
Mean and Variance of expected responses, Regression Sum of Squares, Sum of Squares of Residuals, Total Sum of Squares, Coefficient of Determination, F-Statistic
Click here for more details on notations and definitions.
Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain quality metrics methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.
1.4 Trained Model Storage and Portability:
Trained models can be serialized into byte-type numpy arrays and deserialized using Intel DAAL’s data archive classes to:
- Support data transmission between devices.
- Save and restore from disk at a later date to predict response for an incoming observation or re-train the model with a set of new observations.
Optionally, to reduce network traffic and memory footprint, serialized models can further be compressed and later decompressed using the deserialization method.
Steps to attain model portability in Intel DAAL:
- Serialization:
- Serialize training stage results(trainingResults) into Intel DAAL’s Input Data Archive object
- Create an empty byte type numpy array object(bufferArray) of size Input Data Archive object
- Populate bufferArray with Input Data Archive contents
- Compress bufferArray to numpy array object (optional)
- Save bufferArray as .npy file to disk (optional)
- Deserialization
- Load .npy file from disk to numpy array object(if Serialization step 1e was performed)
- Decompress numpy array object to bufferArray (if Serialization step 1d was performed)
- Create Intel DAAL’s Output Data Archive object with bufferArray contents
- Create an empty original training stage results object (trainingResults)
- Deserialize Output Data Archive contents into trainingResults
Note: As mentioned in deserialization step 2d, an empty original training results object is required for Intel DAAL’s data archive methods to deserialize the serialized training results object.
Helper Functions: Linear Regression
The next section can be copy/pasted into a user’s script or adapted to a specific use case. The helper function block provided below can be used directly to automate model storage and portability of DAAL’s Linear Regression algorithm. The helper function is followed be a full usage code example.
import numpy as np import warnings from daal.data_management import (HomogenNumericTable, InputDataArchive, OutputDataArchive, \ Compressor_Zlib, Decompressor_Zlib, level9, DecompressionStream, CompressionStream) ''' Arguments: serialized numpy array Returns Compressed numpy array''' def compress(arrayData): compressor = Compressor_Zlib () compressor.parameter.gzHeader = True compressor.parameter.level = level9 comprStream = CompressionStream (compressor) comprStream.push_back (arrayData) compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8) comprStream.copyCompressedArray (compressedData) return compressedData ''' Arguments: deserialized numpy array Returns decompressed numpy array''' def decompress(arrayData): decompressor = Decompressor_Zlib () decompressor.parameter.gzHeader = True # Create a stream for decompression deComprStream = DecompressionStream (decompressor) # Write the compressed data to the decompression stream and decompress it deComprStream.push_back (arrayData) # Allocate memory to store the decompressed data bufferArray = np.empty (deComprStream.getDecompressedDataSize (), dtype=np.uint8) # Store the decompressed data deComprStream.copyDecompressedArray (bufferArray) return bufferArray #------------------- #***Serialization*** #------------------- ''' Method 1: Arguments: data(type nT/model) Returns dictionary with serailized array (type object) and object Information (type string) Method 2: Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk) Saves serialized numpy array as "fileName" argument Saves object information as "filename.txt" Method 3: Arguments: data(type nT/model), useCompression = True Returns dictionary with compressed array (type object) and object information (type string) Method 4: Arguments: data(type nT/model), fileName(.npy file to save serialized array to disk), useCompression = True Saves compresseed numpy array as "fileName" argument Saves object information as "filename.txt"''' def serialize(data, fileName=None, useCompression= False): buffArrObjName = (str(type(data)).split()[1].split('>')[0]+"()").replace("'",'') dataArch = InputDataArchive() data.serialize (dataArch) length = dataArch.getSizeOfArchive() bufferArray = np.zeros(length, dtype=np.ubyte) dataArch.copyArchiveToArray(bufferArray) if useCompression == True: if fileName != None: if len (fileName.rsplit (".", 1)) == 2: fileName = fileName.rsplit (".", 1)[0] compressedData = compress(bufferArray) np.save (fileName, compressedData) else: comBufferArray = compress (bufferArray) serialObjectDict = {"Array Object":comBufferArray, "Object Information": buffArrObjName} return serialObjectDict else: if fileName != None: if len (fileName.rsplit (".", 1)) == 2: fileName = fileName.rsplit (".", 1)[0] np.save(fileName, bufferArray) else: serialObjectDict = {"Array Object": bufferArray, "Object Information": buffArrObjName} return serialObjectDict infoFile = open (fileName + ".txt", "w") infoFile.write (buffArrObjName) infoFile.close () #--------------------- #***Deserialization*** #--------------------- ''' Returns deserialized/ decompressed numeric table/model Input can be serialized/ compressed numpy array or serialized/ compressed .npy file saved to disk''' def deserialize(serialObjectDict = None, fileName=None,useCompression = False): import daal if fileName!=None and serialObjectDict == None: bufferArray = np.load(fileName) buffArrObjName = open(fileName.rsplit (".", 1)[0]+".txt","r").read() elif fileName == None and any(serialObjectDict): bufferArray = serialObjectDict["Array Object"] buffArrObjName = serialObjectDict["Object Information"] else: warnings.warn ('Expecting "bufferArray" or "fileName" argument, NOT both') raise SystemExit if useCompression == True: bufferArray = decompress(bufferArray) dataArch = OutputDataArchive (bufferArray) try: deSerialObj = eval(buffArrObjName) except AttributeError : deSerialObj = HomogenNumericTable() deSerialObj.deserialize(dataArch) return deSerialObj
To use: copy the complete block of helper function and call serialize()
and deserialize()
methods.
Usage Example: Linear Regression:
The example below Implements serialize()
and deserialize()
functions on Linear Regression trainingResult
. (Refer Linear Regression usage example from section Build and Predict with PyDAAL Analytics Models to compute trainingResult
)
#Serialize serialTrainingResultArray = serialize(trainingResult) # Run Usage Example: Linear Regression from section 1.2 #Deserialize deserialTrainingResult = deserialize(serialTrainingResultArray) #predict predictionResult = prediction(deserialTrainingResult, testData) #Print and compare results printNumericTable (predictionResult, "Linear Regression deserialized prediction results: (first 10 rows):", 10) printNumericTable (testGroundTruth, "Ground truth (first 10 rows):", 10)
Examples below implement other combinations of serialize()
and deserialize()
methods with different input arguments
#---compress and serialize serialTrainingResultArray = serialize(trainingResult, useCompression=True) #---decompress and deserialize deserialTrainingResult = deserialize(serialTrainingResultArray, useCompression=True) #---serialize and save to disk as numpy array serialize(trainingResult,fileName="trainingResult") #---deserialize file from disk deserialTrainingResult = deserialize(fileName="trainingResult.npy")
Notes: Helper function classes have been created using Intel DAAL’s low level API for popular algorithms to perform various stages of model building and deployment process. These classes contain model storage and portability stages as methods, and are available in daaltces’s GitHub repository. These functions require only input arguments to be passed in each stage as show in the usage example.
2. Distributed Learning with PyDAAL and MPI:
PyDAAL and mpi4py can be used to easily distribute model training for many of DAAL’s algorithm implementations using the Single Program Multiple Data (SPMD) technique. Other Python machine learning libraries allow for the trivial implementation of a parameter-tuning grid search, mainly because it’s an “embarrassingly parallel” process. What sets Intel DAAL apart is the included IA-optimized distributed versions of many of its model training classes. This means acceleration of single model training is enabled with similar syntax to batch learning. For these implementations, the DAAL engineering team has provided a slave method to compute partial training results on row-grouped chunks of data, and then a master method for reduction of the partial results into a final model result.
Serialization and Message Passing:
Messages passed with MPI4Py are passed as serialized objects. MPI4Py uses the popular Python object serialization library Pickle under the hood during this process. PyDAAL uses SWIG (Simplified Wrapper and Interface Generator) as its wrapper interface. Unfortunately SWIG is not compatible with Pickle. Fortunately, DAAL has built-in serialized and deserialization functionality. See Trained Model Portability section for details.The table below demonstrates the master and slave methods for the distributed version of PyDAAL’s covariance model method.
Note: The serialize
and deserialize
helper functions are provided in the Trained Model Portability section of this volume.
The next section can be copy/pasted into a user’s script or adapted to a specific use case. The helper function block provided below can be used carry out distributed computation of the covariance matrix, but can be adapted for fitting other types of models. See Computation Modes section in developer’s docs for more details on distributed model fitting.The helper function is followed be a full usage code example.
Helper Functions: Covariance Matrix
# Define slave compute routine''' Defined Slave and Master Routines as Python Functions Returns serialized partial model result. Input is serialized partial numeric table''' from CustomUtils import getBlockOfNumericTable, serialize, deserialize from daal.data_management import HomogenNumericTable from daal.algorithms.covariance import ( Distributed_Step1LocalFloat64DefaultDense, data, partialResults, Distributed_Step2MasterFloat64DefaultDense ) def computestep1Local(serialnT): # Deseralize using Helper Function partialnT = deserialize(serialnT) # Create partial model object model = Distributed_Step1LocalFloat64DefaultDense() # Set input data for the model model.input.set(data, partialnT) # Get the computed partial estimate result partialResult = model.compute() # Seralize using Helper Function serialpartialResult = serialize(partialResult) return serialpartialResult # Define master compute routine ''' Imports global variable finalResult. Computes master version of model and sets full model result into finalResult. Inputs are array of serialized partial results and MPI world size''' def computeOnMasterNode(serialPartialResult, size): global finalResult # Create master model object model = Distributed_Step2MasterFloat64DefaultDense() # Add partial results to the distributed master model for i in range(size): # Deseralize using Helper Function partialResult = deserialize(serialPartialResult[i]) # Set input objects for the model model.input.add(partialResults, partialResult) # Recompute a partial estimate after combining partial results model.compute() # Finalize the result in the distributed processing mode finalResult = model.finalizeCompute()
Usage Example: Covariance Matrix
The below example uses the complete block of helper functions given above and implements computestep1Local()
, computeOnMasterNode()
functions with mpi4py
to construct a Covariance Matrix.
from mpi4py import MPI from CustomUtils import getBlockOfNumericTable, serialize, deserialize from daal.data_management import HomogenNumericTable''' Begin MPI Initialization and Run Options''' # Get MPI vars size = MPI.COMM_WORLD.Get_size() rank = MPI.COMM_WORLD.Get_rank() name = MPI.Get_processor_name() # Initialize result vars to fill serialPartialResults = [0] * size finalResult = None ''' Begin Data Set Creation The below example variable values can be used: numRows, numCols = 1000, 100 ''' # Create random array for demonstration # numRows, numCols defined by user seeded = np.random.RandomState(42) fullArray = seeded.rand(numRows, numCols) # Build seeded random data matrix, and slice into chunks # rowStart and rowEnd determined by size of the chunks sizeOfChunk = int(numRows/size) rowStart = rank*sizeOfChunk rowEnd = ((rank+1)*sizeOfChunk)-1 array = fullArray[rowStart:rowEnd, :] partialnT = HomogenNumericTable(array) serialnT = serialize(partialnT) ''' Begin Distributed Execution''' if rank == 0: serialPartialResults[rank] = computestep1Local(serialnT) if size > 1: # Begin receive slave partial results on master for i in range(1, size): rank, size, name, serialPartialResults[rank] = MPI.COMM_WORLD.recv(source=MPI.ANY_SOURCE, tag=1) computeOnMasterNode(serialPartialResults,size) else: serialPartialResult = computestep1Local(serialnT) MPI.COMM_WORLD.send((rank, size, name, serialPartialResult), dest=0, tag=1)
LINUX shell commands to run the covariance matrix usage example
------------------------------------------------------------------------------------------
# Source and activate Intel Distribution of Python (idp) env source ../anaconda3/bin/activate source activate idp # optionally set mpi environmental variable to shared memory mode export I_MPI_SHM_LMT=shm # Cd to script directory, and call Python interpreter inside mpirun command cd ../script_directory mpirun –n # python script.py
Conclusion
Previous volumes (Volume 1 and Volume 2) demonstrated Intel® Data Analytics Acceleration Library’s (Intel® DAAL) Numeric Table data structure and basic operations on Numeric Tables. Volume 3 discussed Intel® DAAL’s algorithm component and performing analytical modelling through different stages with both batch and distributed processing. Also, Volume 3 demonstrated how to achieve model probability (Serialization) and perform model evaluation (Quality Metrics) process. Furthermore, this volume utilized Intel® DAAL classes to provide helper functions that deliver a standalone solution in model fitting and deployment process.