Apache Spark* (http://spark.apache.org/) is a fast and general engine for large-scale data processing. Since its inception in 2014, Spark has become a widely adopted Big Data framework due to multiple advantages over Hadoop MapReduce. These advantages include: Fault-tolerant distributed data structures (Resilient Distributed Dataset), more operations available for data processing, ease-of-use (increased developer productivity), support for many types of clusters, and easy connection to many types of data sources.
Spark comes with a stack of powerful libraries, including a popular machine learning library, MLlib (http://spark.apache.org/mllib/). MLlib is full of compute-intensive mathematical algorithms. However, the implementations in MLlib are not necessarily optimized for Intel Architectures. These days, Big Data infrastructures are predominantly built using Intel processors. It is therefore in many developers' interest to make Spark MLlib run faster on Intel based clusters.
One way to make MLlib run faster is to replace MLlib algorithms with equivalent but more optimized implementations from the Intel® Data Analytics Acceleration Library (Intel® DAAL). This allows you to keep your workflow within Spark, so that at the same time your machine learning runs faster, you still enjoy Spark's other advantages,
Intel DAAL is a software solution for developing data applications in C++, Java, or Python. The library provides a set of optimized building blocks that can be used in all stages of the data analytics workflow. These building blocks include data mining methods such as basic statistical moments, Principle Component Analysis, associating rule mining, anomaly detection, etc.; and supervised and unsupervised machine learning methods such as linear regression, classification, Support Vector Machine, clustering, etc.
See the attached presentation for a recipe on how to build faster data applications on Spark using Intel DAAL. A companion ZIP archive contains code samples discussed in the presentation. Download and unzip the archive, and build the samples with these steps:
- Edit pom.xml to set the correct path for 'daal.jar' on the build system. Let DAALROOT be an environment variable pointing to your Intel DAAL installation location, then 'daal.jar' is in $DAALROOT/daal.jar.
- Build the samples with Maven (version 3.3 and above is required):
mvn clean package -DskipTests
To learn more about Intel DAAL, please visit the product page: https://software.intel.com/en-us/intel-daal
If you have any questions, please ask them on our user forum: https://software.intel.com/en-us/forums/intel-data-analytics-acceleration-library