Intel Keynote and Intel technical presentations at Spark Summit West 2015

Accelerating Apache Spark-based Analytics on Intel Architecture

Michael A. Greene (Intel Software and Services Group)
To find new trends and strong patterns from large complex data sets, a strong analytics foundation is needed. Intel is working closely with Databricks, AMPLab, Spark community and its ecosystem to advance these analytics capabilities for Spark on Intel® architecture platforms and to accelerate the development of the Spark-based applications. Intel Architecture offers advanced silicon acceleration & built-in security technologies. By building on this trusted foundation and extending & optimizing the rich capabilities of Spark, we are accelerating the speed by which our customers derive real-time analytics insights and deliver meaningful solutions.
View Michael Greene’s keynote video.
Slides PDF

With the bloom of Apache spark, various big data applications shift to Spark pool to pursue better user experience. However the initial performance doesn’t always meet expectation. In this talk, we will share our experience on working with several top China internet companies to build their next generation big data engine on Spark – including graph analysis, interactive, batch OLAP/BI and real-time analytics. With careful tuning, Spark brought x5-100 speedup versus their original Map Reduce implements. We even accumulated certain experience to further improve the user experience from building real-world Spark application in production environment. We expect this talk will be very useful for people who want to deploy their own spark application and also spark developers who are interested to learn some real case challenges.
Slides PDF Video

In general, we presented a common benchmark for modern distributed stream computing system. It helps to characterize the stream system like Spark-streaming and Storm, from performance, reliability and availability perspectives. For example, Spark-streaming is good at high throughput and better fault tolerant. Meanwhile Storm can response quickly, but has some defects in complex computation case. In addition, it can also be the integration test suite to evaluate different release candidates.

Spark nodes are shifting from commodity hardware to more powerful systems with higher memory environments (200GB+). As an in-memory computing framework, popular wisdom has it that large Java heaps result in long garbage collection pauses slowing down Spark’s overall throughput. Through several case studies using large Java heaps, we will show it is possible to maintain low GC pauses for better application throughput. In this presentation, we introduce the Hotspot G1 collector as the best GC for Spark solutions running in large memory environments. We first discuss Hotspot G1 internal operations and several tuning flags. Those flags can be used to set desired GC pause target, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several case studies from Spark graph computing application running 80GB+ heap to show how we can tune those flags to remove unpredicted and protracted GC pauses for better application throughput.
Slides PDF Video

The SparkR project provides language bindings and runtime support to enable users to run scalable computation from R using Apache Spark. SparkR has an active set of contributors from many companies and a number of recent developments have improved performance and usability. Some of the improvements include:
a new R to JVM bridge that enables easy deployment to YARN clusters,
serialization-deserialization routines that enable integration with other Spark components like ML Pipelines,
complete RDD API with support coming for DataFrames and
performance improvements for various operations including shuffles.
This talk will present an overview of the project, outline some of the technical contributions and discuss new features we will build over the next year. We will also present a demo showcasing how SparkR can be used to seamlessly process large datasets on a cluster directly from the R console.
Slides PDF Video