Have you ever tried to access a website and had to wait a long time before you could access it or not been able to access it at all? If so, that website might be falling victim to what is called a Denial of Service1 (DoS) attack. DoS attacks occur when an attacker floods a network with information like spam emails, causing the network to be so busy handling that information that it is unable to handle requests from other users.
To prevent spam email DoS attack a network needs to be able to identify “garbage”/spam emails and filter them out. One way to do this is to compare an email pattern with those in the library of email spam signatures. Incoming patterns that match those of the library are labeled as attacks. Since spam emails can come in many forms and shapes, there is no way to build a library that can store all the patterns. In order to increase the chance of identifying spam emails there need to be a method to restructure the data in such a way that will make it simpler to analyze.
This article discusses an unsupervised2 machine-learning3 algorithm called principal component analysis4 (PCA) that can be used to simplify the data. It also describes how Intel® Data Analytics Acceleration Library (Intel® DAAL)5 helps optimize this algorithm to improve the performance when running it on systems equipped with Intel® Xeon® processors.
What is Principal Component Analysis?
PCA is a popular data analysis method. It is used to reduce the complexity of the data without losing its properties to make it easier to visualize and analyze. Reducing the complexity of the data means reducing the original dimensions to lesser dimensions while preserving the important features of the original datasets. It is normally used as a pre-step of machine learning algorithms like K-means6, resulting in simpler modeling and thus improving performance.
Figures 1–3 illustrate how the PCA algorithm works. To simplify the problem, let’s limit the scope to two-dimensional space.
Figure 1. Original dataset layout.
Figure 1 shows the objects of the dataset. We want to find the direction where the variance is maximal.
Figure 2. The mean and the direction with maximum variance.
Figure 2 shows the mean of the dataset and the direction with maximum variance. The first direction with the maximal variance is call the first principal component.
Figure 3. Finding the next principal component.
Figure 3 shows the next principal component. The next principal component is the direction where the variance is the second most maximal. Note that the second direction is orthonormal to the first direction.
Figure 4–6 shows how the PCA algorithm is used to reduce the dimensions.
Figure 4. Re-orientating the graph.
Figure 4 shows the new graph after rotating it so that the axis (P1) corresponding to the first principal component becomes a horizontal axis.
Figure 5. Projecting the objects to the P1 axis.
In Figure 5 the whole graph has been rotated so that the axis (P1) corresponding to the first principal component become a horizontal axis.
Figure 6. Reducing from two dimensions to one dimension.
Figure 6 shows the effect of using PCA to reduce from two dimensions (P1 and P2) to one dimension (P1) base on the maximal variance. Similarly, this same concept is used on multi-dimensional datasets to reduce their dimensions while still maintaining much of their characteristics by dropping dimensions with lower variances.
Information about PCA mathematical representation can be found at references 7 and 8.
Applications of PCA
PCA applications include the following:
- Detecting DoS and network probe attacks
- Image compression
- Pattern recognition
- Analyzing medical imaging
Pros and Cons of PCA
The following lists some of the advantages and disadvantages of PCA.
- Pros
- Fast algorithm
- Shows the maximal variance of the data
- Reduces the dimension of the origin data
- Removes noise.
- Cons
- Non-linear structure is hard to model with PCA
Intel® Data Analytics Acceleration Library
Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of the latest Intel® processors. More about Intel DAAL can be found at reference 5.
The next section shows how to use PCA with PyDAAL, the Python* API of Intel DAAL. To install PyDAAL, follow the instructions in reference 9.
Using the PCA Algorithm in Intel Data Analytics Acceleration Library
To invoke the PCA algorithm in Python10 using Intel DAAL, do the following steps:
- Import the necessary packages using the commands from and import
- Import the necessary functions for loading the data by issuing the following command:
from daal.data_management import HomogenNumericTable
- Import the PCA algorithm using the following commands:
import daal.algorithms.pca as pca
- Import numpy for calculation.
Import numpy as np
- Import the necessary functions for loading the data by issuing the following command:
- Import the createSparseTable function to create a numeric table to store input data reading from a file.
from utils import createSparseTable
- Load the data into the data set object declared above.
dataTable = createSparseTable(dataFileName)
Where dataFileName is the name of the input .csv data file - Create an algorithm object for PCA using the correlation method.
pca_alg = pca.Batch_Float64CorrelationDense ()
Note: if we want to use the svd (single value decomposition) method, we can use the following command:pca = pca.Batch_Float64SvdDense()
- Set the input for the algorithm.
pca_alg.input.setDataset(pca.data, dataTable)
- Compute the results.
result = pca_alg.compute()
The results can be retrieved using the following commands:result.get(pca.eigenvalues)
result.get(pca.eigenvectors)
Conclusion
PCA is one of the simplest unsupervised machine-learning algorithms that is used to reduce the dimensions of a dataset. Intel DAAL contains an optimized version of the PCA algorithm. With Intel DAAL, you don’t have to worry about whether your applications will run well on systems equipped with future generations of Intel Xeon processors. Intel DAAL will automatically take advantage of new features in new Intel Xeon processors. All you need to do is link your applications to the latest version of Intel DAAL.
References
3. Wikipedia – machine learning
4. Principal component analysis
7. Principal component analysis for machine learning
8. Principal component analysis tutorial
9. How to install Intel’s distribution for Python
10. Python website