Previous: Vol 1: Data Structures
A wide range of classes are available in the Intel® Data Analytics Acceleration Library (Intel® DAAL) to create a numeric table accommodating various data layout, dtypes, and frequent access methods. Volume 1 of this series covers numeric table creation under different scenarios. Once created, Intel® DAAL provides operational methods for visualizing and mutating a user’s numeric tables. Volume 2 will cover the usage of the operational methods. Subsequently Volume 3 in this series gives a brief introduction to Algorithm section of PyDaal. Table 1 can be used as a quick reference for basic operations on Intel® DAAL’s numeric table.
Volumes in Gentle Introduction Series
- Vol 1: Data Structures - Covers introduction to Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples.
- Vol 2: Basic Operations on Numeric Tables - Covers introduction to possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples.
- Vol 3: Analytics Model Building and Deployment - Covers introduction to analytics model building and evaluation in Intel® DAAL with serialized deployment and distributed model fitting on large datasets.
Table 1. Quick reference table on available methods
Method Description | Usage Syntax |
---|---|
*Print numeric table as stored in memory to represent data layout. Method requires ‘nT’ as input argument | printNumericTable(nT) |
*Quick visualization on multiple numeric tables | printNumericTables(nT1,nT2) |
Check shape of numeric table | #Number of Rows |
Allocate buffer to load block of numeric table for access and manipulation operations. | block = BlockDescriptor_Float64() |
Retrieve block of rows and columns from numeric table into Block Descriptor for visualization. (Setting rwflag to ‘readOnly’ enables only read access to the buffer.) | #Block of Column values |
Extract numpy array from Block Descriptor object when loaded with block of values | block.getArray() |
Release block of Rows from buffer | nT.releaseBlockOfRows(block) |
*Print underlying array of numeric table. Method requires ‘np.array’ as input argument. | printArray(block.getArray() , num_printed_cols, num_printed_rows, num_cols, message) |
Check FeatureTypes on each column of numeric table data dictionary | dict[colIndex].featureType |
* denotes functions included in the ‘utils’ folder, which can be found in <install_root>/share/pydaal_examples/examples/python/source/. <install_root>
Different phases of Numeric Table life cycle
1. Initiate
Let’s begin by constructing a numeric table (nT) directly from a Numpy array. We will use the nT throughout the rest of the code examples in this volume.
import numpy as np from daal.data_management import HomogenNumericTable array =np.array([[1,2,3,4], [5,6,7,8]]) nT= HomogenNumericTable(array)
2. Operate
Once initialized, numeric tables provide various classes and member functions to access and manipulate data similar to a pandas DataFrame. We will dive next into Intel DAAL’s operational methods, after an important note about Intel DAAL’s bookkeeping object called Data Dictionary.
Data Dictionary:
As mentioned in Vol1 of this series on creation of Intel DAAL’s numeric tables (link), these custom data structures must be accompanied by a Data Dictionary to perform operations. When raw data streams into memory to populate the numeric table structure, the table’s Data Dictionary concurrently records metadata. Dictionary creation will occur automatically unless specified to not allocate by the user. Various Data Dictionary methods are available to access and manipulate feature type, dtypes etc. If a user creates a numeric table without memory allocation, the Data Dictionary values have to be explicitly set with feature types. An important note is that Intel DAAL’s Data Dictionary is a custom data structure, not a Python dictionary.
More details on working with Intel DAAL Data Dictionaries
2.1 Data Mutation in Numeric Table:
2.1.1 Standardization and Normalization:
Data analysis work is usually preceded by a Data Preprocessing stage that includes data wrangling, quality checks, and assurance to handle null values, outliers etc. An important preprocessing activity is to normalize input data. Intel DAAL offers routines to support two popular normalization techniques on numeric tables: Namely, Z-score standardization and Min-Max normalization.
Currently, Intel DAAL only supports rescaling for descriptive analytics. In the future, support will be added for predictive analytics with the addition of a “transform()” method to be applied to new data.
Z-score Standardization: Rescales numeric table values feature-wise to the number of standard deviations each value deviates from the mean. Below are the steps to use Intel DAAL’s z-score standardization.
import daal.algorithms.normalization.zscore as zscore # Create an algorithm algorithm = zscore.Batch(method=zscore.sumDense) # Set input object for the algorithm to nT algorithm.input.set(zscore.data, nT) # Compute Z-score normalization function res = algorithm.compute() #Retrieve normalized nT Norm_nT= res.get(zscore.normalizedData)
Min-Max Normalization: Rescales numeric table values feature-wise linearly to fit [0, 1] / [-1-1] range. Below are the steps to use Intel DAAL’s Min-Max normalization.
import daal.algorithms.normalization.minmax as minmax # Create an algorithm algorithm = minmax.Batch(method=minmax.defaultDense) # Set lower and upper bounds for the algorithm algorithm.parameter.lowerBound = -1.0 algorithm.parameter.upperBound = 1.0 # Set input object for the algorithm to nT algorithm.input.set(minmax.data, nT) # Compute Min-max normalization function res = algorithm.compute() # Get normalized numeric table Norm_nT = res.get(minmax.normalizedData)
2.1.2 Block Descriptor for Visualization and Mutation:
The Contents of a numeric table cannot be accessed directly to visualize or manipulate. Instead a user must first move a requested block of data values to a memory buffer. Once instantiated, this buffer is housed in an object called BlockDescriptor
. An Intel DAAL numeric table object has member functions to retrieve blocks of rows/columns and add to the BlockDescriptor
. The argument rwflag
is used to set “readOnly”/“readWrite” mode, depending on whether the user intends to update values in the numeric table while releasing the block. Conveniently, the BlockDescriptor
class allows for block retrieval of data in specific rows and/or columns. Note: the dtype of data in the BlockDescriptor
buffer is not required to match the numeric table that sourced the block.
Access Modes:
“readOnly” argument sets
rwflag
to provide read only access to numeric table contents, thus performing no updates to the table when the block is released from buffer memory.Syntax and Usage:
from daal.data_management import BlockDescriptor_Float64, readOnly #Allocate a readOnly memory block with double dtype block = BlockDescriptor_Float64() nT.getBlockOfRows(0,1, readOnly, block)
“readWrite” argument sets
rwflag
to write back any changes from block descriptor object to the numeric table when the block is released from buffer memory, thus enabling numeric table mutation with the help of block descriptor.Syntax and Usage:
from daal.data_management import BlockDescriptor_Float64, readWrite #Allocate a readOnly memory block with double dtype block = BlockDescriptor_Float64() nT.getBlockOfRows(0,1, readWrite, block)
2.1.3 BlockDescriptor() in “readWrite” mode:
When rwflag
argument is set to “readWrite” in getBlockOfRows()
/ getBlockOfColumnValues()
, contents of BlockDecriptor
object are written back to the numeric table while releasing block of rows, making edits possible on existing rows/columns in numeric table.
Let’s create a basic numeric table to explain BlockDecriptor
in “readWrite” mode in detail.
import numpy as np from daal.data_management import HomogenNumericTable, readWrite, BlockDescriptor from utils import printNumericTable array =np.array([[1,2,3,4], [5,6,7,8]]) nT= HomogenNumericTable(array)
Edit numeric table Row-wise:
printNumericTable(nT,"Original nT: ") #Create buffer object with ntype "double" doubleBlock = BlockDescriptor(ntype=np.float64) firstRow = 0 lastRow = nT.getNumberOfRows() #getBlockOfRows() member function in "readWrite" mode to retrieve numeric table contents and populate "doubleBlock" object nT.getBlockOfRows(firstRow,lastRow, readWrite, doubleBlock) #Access array contents from "doubleBlock" object array = doubleBlock.getArray() #Mutate 1st row of array to reflect on "doubleBlock" object array[0] = [0,0,0,0] #Release buffer object and write changes back to numeric table nT.releaseBlockOfRows(doubleBlock) printNumericTable(nT,"Updated nT: ")
nT was originally created with data [[1,2,3,4],[5,6,7,8]]. After row mutation the first row is now replaced using buffer memory. Updated nT has data [[0,0,0,0],[5,6,7,8]].
Edit numeric table Column-wise:
printNumericTable(nT,"Original nT: ") #Create buffer object with ntype "double" doubleBlock = BlockDescriptor(ntype=np.intc) ColIndex = 2 firstRow = 0 lastRow = nT.getNumberOfRows() #getBlockOfColumnValues() member function in "readWrite" mode to retrieve numeric table ColIndex contents and populate "doubleBlock" object nT.getBlockOfColumnValues(ColIndex,firstRow,lastRow,readWrite,doubleBlock) #Access array contents from "doubleBlock" object array = doubleBlock.getArray() #Mutate array to reflect on "doubleBlock" object array[:][:] = 0 #Release buffer object and write changes back to numeric table nT.releaseBlockOfColumnValues(doubleBlock) printNumericTable(nT, "Updated nT: ")
nT was originally created with data [[1,2,3,4],[5,6,7,8]] After column mutations, the third column is replaced with [0,0] using buffer memory. Updated nT has data [[1,2,0,4],[5,6,0,8]].
2.1.4 Merge numeric table:
Numeric tables can be appended along rows and columns, provided, they share the same array size along the relevant axis to merge. RowMergedNumericTable()
and MergedNumericTable()
are the 2 classes available to merge numeric tables. The latter is used for merges on column indexes.
Merge Row-wise:
Syntax:
mnT = RowMergedNumericTable() mnT.addNumericTable(nT1); mnT.addNumericTable(nT2); mnt.addNumericTable(nT3)
Code Example:
from daal.data_management import HomogenNumericTable, RowMergedNumericTable import numpy as np from utils import printNumericTable #nT1 and nT2 are 2 numeric tables having equal number of COLUMNS array =np.array([[1,2,3,4], [5,6,7,8]], dtype = np.intc) nT1= HomogenNumericTable(array) array =np.array([[9,10,11,12], [13,14,15,16]],dtype = np.intc) nT2= HomogenNumericTable(array) #Create merge numeric table object mnT = RowMergedNumericTable() #add numeric tables to merged numeric table object mnT.addNumericTable(nT1); mnT.addNumericTable(nT2) printNumericTable(nT1, "Numeric Table nT1: ") printNumericTable(nT2, "Numeric Table nT2: ") printNumericTable(mnT, "Merged Numeric Table nT1 and nT2: ")
Output:
1.000 2.000 3.000 4.000
5.000 6.000 7.000 8.000
9.000 10.000 11.000 12.000
13.000 14.000 15.000 16.000Merge Column-wise:
Syntax:
mnT = MergedNumericTable() mnT.addNumericTable(nT1); mnT.addNumericTable(nT1); mnt.addNumericTable(nT3)
Code Example:
from daal.data_management import HomogenNumericTable, MergedNumericTable import numpy as np from utils import printNumericTable #nT1 and nT2 are 2 numeric tables having equal number of ROWS array =np.array([[1,2,3,4], [5,6,7,8]], dtype = np.intc) nT1= HomogenNumericTable(array) array =np.array([[9,10,11,12], [13,14,15,16]],dtype = np.intc) nT2= HomogenNumericTable(array) #Create merge numeric table object mnT = MergedNumericTable() #add numeric tables to merged numeric table object mnT.addNumericTable(nT1); mnT.addNumericTable(nT2) printNumericTable(nT1, "Numeric Table nT1: ") printNumericTable(nT2, "Numeric Table nT2: ") printNumericTable(mnT, "Merged Numeric Table nT1 & nT2: ")
Output:1.000 2.000 3.000 4.000 9.000 10.000 11.000 12.000
5.000 6.000 7.000 8.000 13.000 14.000 15.000 16.000
2.1.5 Split Numeric table:
See Table 1 for a quick reference on available methods for the entries getBlockOfRows()
and getBlockOfColumnValues()
methods, used to extract sections of a numeric table by row or column values. Additionally, the helper function getBlockOfNumericTable()
is provided below and implements the capability to extract a contiguous subset of the table with selected range of rows and columns. getBlockOfNumericTable()
accepts int or list keyword arguments for ranges of rows and columns, using conventional Python 0 - based indexing.
Syntax and Usage: getBlockOfNumericTable(nT, Rows = ‘All’, Columns = ‘All’)
Helper Function:
def getBlockOfNumericTable(nT,Rows = 'All', Columns = 'All'): from daal.data_management import HomogenNumericTable_Float64, \ MergedNumericTable, readOnly, BlockDescriptor import numpy as np #------------------------------------------------------ # Get First and Last Row indexes lastRow = nT.getNumberOfRows() if type(Rows)!= str: if type(Rows) == list: firstRow = Rows[0] if len(Rows) == 2: lastRow = min(Rows[1], lastRow) else:firstRow = 0; lastRow = Rows elif Rows== 'All':firstRow = 0 else: warnings.warn('Type error in "Rows" arguments, Can be only int/list type') raise SystemExit #------------------------------------------------------ # Get First and Last Column indexes nEndDim = nT.getNumberOfColumns() if type(Columns)!= str: if type(Columns) == list: nStartDim = Columns[0] if len(Columns) == 2: nEndDim = min(Columns[1], nEndDim) else: nStartDim = 0; nEndDim = Columns elif Columns == 'All': nStartDim = 0 else: warnings.warn ('Type error in "Columns" arguments, Can be only int/list type') raise SystemExit #------------------------------------------------------ #Retrieve block of Columns Values within First & Last Rows #Merge all the retrieved block of Columns Values #Return merged numeric table mnT = MergedNumericTable() for idx in range(nStartDim,nEndDim): block = BlockDescriptor() nT.getBlockOfColumnValues(idx,firstRow,(lastRow-firstRow),readOnly,block) mnT.addNumericTable(HomogenNumericTable_Float64(block.getArray())) nT.releaseBlockOfColumnValues(block) block = BlockDescriptor() mnT.getBlockOfRows (0, mnT.getNumberOfRows(), readOnly, block) mnT = HomogenNumericTable (block.getArray()) return mnT
There are 4 different ways of passing arguments to this function:
getBlockOfNumericTable(nT)
- Extracts block of numeric table having all rows and columns of nT.getBlockOfNumericTable(nT, Rows = 4, Columns = 5)
- Retrieves first 4 rows and first 5 column values of nTgetBlockOfNumericTable(nT, Rows=[2,4], Columns = [1,3])
-Slices numeric table along row and column directions using lower bound and upper bound passed as parameters in list.getBlockOfNumericTable(nT, Rows=[1,], Columns = [1,])
-Extracts all rows and columns from lower bound through last index.
2.1.6 Change feature type:
Numeric table objects have dictionary manipulation methods to get and set feature types in the Data Dictionary for each column. Categorical(0), Ordinal(1), and Continuous(2) are available feature types in Data Dictionary supported by Intel DAAL.
Get dictionary object associated with nT :
Syntax:
nT.getDictionary()
Code Example:
dict = nT.getDictionary() # nT is numeric table created in section 1''''dict' object has data dictionary of numeric table nT. This can be used to update metadata information about the data. Most common use case is to modify default feature type of nT columns.''' # Print default feature type of 3rd feature (example feature is continuous): print(dict[2].featureType) #outputs “2” (denotes Continuous feature type) # Modify feature type from Continuous to Categorical: dict[2].featureType = data_feature_utils.DAAL_CATEGORICAL print(dict[2].featureType) #outputs “0” (denotes Categorical feature type)
Set dictionary object associated with nT:
This is the method used to replace current Data Dictionary values or to create new Data Dictionaries, if needed. Also, for batch updates, an existing Data Dictionary can be overwritten in full using
setDictionary()
method.When tables are created without allocating memory for the Data Dictionary, the
setDictionary()
method must be used to construct metadata for features in the table. Let us again consider nT created in section-1 having 4 featuresSyntax:
nT.setDictionary()
Code Example:
nT.releaseBlockOfRows(block) #Create a dictionary object using Numeric table dictionary class with the number of features dict = NumericTableDictionary(nFeatures) #Allocate a feature type for each feature dict[0].featureType = data_feature_utils.DAAL_CONTINUOUS dict[1].featureType = data_feature_utils.DAAL_CATEGORICAL dict[2].featureType = data_feature_utils.DAAL_CONTINUOUS dict[3].featureType = data_feature_utils.DAAL_CATEGORICAL #set the nT numeric table dictionary with “dict” nT.setDictionary(dict)
2.2 Export Numeric Table to disk:
Numeric tables can be exported and saved as a numpy binary file (.npy) file to disk. The following two sections contain helper functions to complete the task of saving in binary form, as well as compressing the data on disk.
2.2.1 Serialization
Intel DAAL provides interfaces to serialize numeric table objects into a data archive that can be converted to a numpy array object. The resulting Numpy array, which houses the serialized form of the data, can be saved to disk and subsequently reloaded in the future to reconstruct the source numeric table.
To automate this process, the following helper function can be used to serialize and save to disk.
Helper Function:
def Serialize(nT): #Construct input data archive Object #Serialize nT contents into data archive Object #Copy data archive contents to numpy array #Save numpy array as .npy in the path from daal.data_management import InputDataArchive import numpy as np dataArch = InputDataArchive() nT.serialize(dataArch) length = dataArch.getSizeOfArchive() buffer_array = np.zeros(length, dtype=np.ubyte) dataArch.copyArchiveToArray(buffer_array) return buffer_array buffer_array = Serialize(nT) # call helper function #np.save(<path>, buffer)# This step is optional</path>
2.2.2 Compression
Compressor methods are also available in Intel DAAL to achieve reduced memory footprint when large datasets must be stored to disk. A serialized array representation of an Intel DAAL numeric table can be compressed before saving it to disk, hence achieving optimal storage.
To automate this process, the following helper function can be used to serialize, then compress the resulting serialized array.
Incorporate helper functions Serialize(nT)
and CompressToDisk (nT, path)
to compress and write numeric tables to disk.
Helper Function:
def CompressToDisk(nT, path): # Serialize nT contents # Create a compressor object # Create a stream for compression # Write numeric table to the compression stream # Allocate memory to store the compressed data # Store compressed data # Save compressed data to disk from daal.data_management import Compressor_Zlib, level9, CompressionStream import numpy as np buffer = Serialize (nT) compressor = Compressor_Zlib () compressor.parameter.gzHeader = True compressor.parameter.level = level9 comprStream = CompressionStream (compressor) comprStream.push_back (buffer) compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8) comprStream.copyCompressedArray (compressedData) np.save (path, compressedData) CompressToDisk (nT, < path >)
2.3 Import Numeric Table from disk:
As mentioned in the previous sections, numeric tables can be stored in the form of either serialized or compressed numpy files. Decompression/ Deserialization methods are available to reconstruct the numeric table.
2.3.1 Deserialization
The helper function below is available to reconstruct a numeric table from serialized array objects.
Helper Function:
def DeSerialize(buffer_array): from daal.data_management import OutputDataArchive, HomogenNumericTable #Load serialized contents to construct output data archive object #De-serialize into nT object and return nT dataArch = OutputDataArchive(buffer_array) nT = HomogenNumericTable() nT.deserialize(dataArch) return nT #buffer_array = np.load(path) # this step is optional, used only when serialized contents have to be written to disk nT = DeSerialize(buffer_array)
2.3.2 Decompression
As compression stage involves serialization of numeric table object, decompression stage includes deserialization. See DeSerialize helper function to recover the numeric table. Refer below for a quick de-compression helper function.
Incorporate helper functions DeSerialize(buffer_array) and DeCompressFromDisk(path)
to compress and read numeric tables from disk.
Helper Function:
def DeCompressFromDisk(path): from daal.data_management import Decompressor_Zlib, DecompressionStream # Create a decompressor decompressor = Decompressor_Zlib() decompressor.parameter.gzHeader = True # Create a stream for decompression deComprStream = DecompressionStream(decompressor) # Write the compressed data to the decompression stream and decompress it deComprStream.push_back(np.load(path)) # Allocate memory to store the decompressed data deCompressedData = np.empty(deComprStream.getDecompressedDataSize(), dtype=np.uint8) # Store the decompressed data deComprStream.copyDecompressedArray(deCompressedData) #Deserialize return DeSerialize(deCompressedData) nT = DeCompressFromDisk(<path>)#path must be ‘.npy’ file
Intel® DAAL also implements several other generic compression and decompression methods that include ZLIB, LZO, RLE, and BZIP (reference)
Conclusion
Intel® DAAL’s data management component provides classes and methods to perform common operations on numeric table contents. Some of the basic numeric table operations include - access, mutation, export to disk and import from disk. Helper functions covered in this document will help automating Intel® DAAL’s creation of numeric table subsets, as well as serialization and compression processes.
The next volume (Volume 3) in the Gentle Introduction series gives a brief introduction to Algorithm section of PyDAAL. Volume 3 focuses on the workflow of important descriptive and predictive algorithms available in Intel® DAAL. Advanced features such as setting hyperparameters, distributing fit calculations, and deploying models as serialized objects will all be covered.
Other Related Links:
- Gentle Introduction to PyDAAL: Vol 2 of 3 Basic Operations on Numeric Tables
- Gentle Introduction to PyDAAL: Vol 3 of 3 Analytics Model Building and Deployment
- Developer Guide for Intel® DAAL
- PyDaal GitHub Tutorials