Try Platform Profiler Today
You are invited to try a free technical preview release. Just follow these simple steps (if you are already registered, skip to step 2):
- Register for the Intel® Parallel Studio XE Beta
- Download and install (the Platform Profiler is a separate download from the Intel Parallel Studio XE Beta)
- Check out the getting started guide, then give Platform Profiler a test drive
- Fill out the online survey
Introduction
Intel® VTune™ Amplifier - Platform Profiler, currently available as a technology preview, is a tool that helps users to identify how well an application uses the underlying architecture and how users can optimize hardware configuration of their system. It displays high-level system configuration such as processor, memory, storage layout, PCIe* and network interfaces (see Figure 1), as well as performance metrics observed on the system such as CPU and memory utilization, CPU frequency, cycles per instruction (CPI), memory and disk input/output (I/O) throughput, power consumption, cache miss rate per instruction, and so on. Performance metrics collected by the tool can be used for deeper analysis and optimization.
There are two primary audiences for Platform Profiler:
- Software developers - Using performance metrics provided by the tool, developers can analyze the behavior of their workload across various platform components such as CPU, memory, disk, and network devices.
- Infrastructure architects - You can monitor your hardware by analyzing long collection runs and finding times when the system exhibits low performance. Moreover, you can optimize the hardware configuration of your system based on the tool findings. For example, if after running a mix of workloads, Platform Profiler shows high processor utilization, memory use, or that I/O is limiting the application performance, you can bring in more cores, more memory, or use more or faster I/O devices.
Figure 1. High-level system configuration view from Platform Profiler
The main difference between Platform Profiler and other VTune Amplifier analyses is that Platform Profiler helps profile a platform for longer periods of time incurring very little performance overhead and generating a small amount of data. The current version of Platform Profiler can run up to 13 hours and generate less data than VTune Amplifier would do in 13 hours. So, one can simply start Platform Profiler running, and keep it running for 13 hours, meanwhile utilizing the system as he or she prefers, then stopping Platform Profiler profiling. Platform Profiler will collect all the profiled data and display the system utilization diagrams. VTune Amplifier, on the other hand, cannot run for such a long period of time, since it generates profiled data of gigabytes of magnitude in a matter of minutes. Thus, VTune Amplifier is more appropriate for fine tuning or for analyzing an application rather than a system. How well is my application using the machine or perhaps How well is the machine being used are key questions that Platform Profiler can answer. But, How do I fix my application to use the machine better is the question that VTune Amplifier answers.
Platform Profiler is composed of two main components: a data collector and a server.
- Data Collector - is a standalone package that needs to be installed on profiled system. It collects system-level hardware and operating system performance counters.
- Platform Profiler Server - post-processes the collected data into a time-series database, correlates with system topology information, and displays topology diagrams and performance graphs using a web-based interface.
In order to use Platform Profiler, one needs to install both the server and the data collector components first. Below are the steps on how to do the installation.
Installing the Server Component
- Copy the server package to the system on which you want to install the server.
- Extract the archive to a writeable directory.
- Run the setup script and follow the prompts. On Windows*, run the script using the Administrator Command Prompt. On Linux*, use an account with root (“sudo”) privileges.
Linux example: ./setup.sh
Windows example: setup.bat
By default, the server is installed in the following location:
- On Linux: /opt/intel/vpp
- On Windows: C:\Program Files(x86)\IntelSWTools\VTune Amplifier Platform Profiler
Installing the Data Collector Component
- Copy the collector package to the target system on which you want to collect platform performance data.
- Extract the archive to a writeable directory on the target system.
- Run the setup script and follow the prompts. On Windows, run the script using the Administrator Command Prompt. On Linux, use an account with root (“sudo”) privileges.
Linux example: ./setup
Windows example: setup.cmd
By default, the collectors are installed in the following location:
- On Linux: /opt/intel/vpp-collector
- On Windows: C:\Intel\vpp-collector
Tool Usage
Starting and Stopping the Server Component
On Linux:
- Run the following commands to start the server manually after initial installation or a system reboot:
- source ./vpp-server-vars.sh
- vpp-server-start
- Run the following commands to stop the server:
- source ./vpp-server-vars.sh
- vpp-server-stop
On Windows:
- Run the following commands to start the server manually after initial installation or a system reboot:
- vpp-server-vars.cmd
- vpp-server-start
- Run the following command to stop the server:
- vpp-server-vars.cmd
- vpp-server-stop
Collecting System Data
Collecting data using Platform Profiler is pretty straight forward. Below are the steps one needs to take to collect data using the tool:
- Setup the environment:
- On Linux: source /opt/intel/vpp-collector/vpp-collect-vars.sh
- On Windows: C:\Intel\vpp-collector\vpp-collect-vars.cmd
- Start the data collection: vpp-collect-start [-c “workload description – free text comment”].
- Optionally, you can also add timeline markers to distinguish the time periods between collections: vpp-collect-mark [“an optional label/text/comment”].
- Stop the data collection: vpp-collect-stop. After the collection is stopped, the compressed result file is stored in the current directory.
Note: Inserting timeline markers is useful when you leave Platform Profiler running for a long period of time. For example, you run the Platform Profiler collection for 13 hours straight. During these 13 hours you run various stress tests and would like to find out how each test affects the system. In order to distinguish the time between these tests, you may want to use the timeline markers.
View Results
- From the machine on which the server is installed, point your browser (Google Chrome* recommended) to the server home page: http://localhost:6543.
- Click “View Results”.
- Click the Upload button and select the result file to upload.
- Select the result from the list to open the viewer.
- Navigate through the result to identify areas for optimization.
Tool Demonstration
In the rest of the article, I demonstrate how to navigate and analyze the result data it collects. I use a movie recommendation system application as an example in this article. The movie recommendation code is obtained from the Spark* Training GitHub* website. The underlying platform is a two-socket Haswell server (Intel® Xeon® CPU E5-2699 v3) with Intel® Hyper-Threading Technology enabled, 72 logical cores, 64 GB of memory, running an Ubuntu* 14.04 operating system.
The code is run in Spark on a single node as follows:
spark-submit --driver-memory 2g --class MovieLensALS --master local[4] movielens-als_2.10-0.1.jar movies movies/test.dat
With the command line above, Spark runs in local mode with four threads specified with the --master local[4] option. In local mode there is only one driver, which acts as an executor, and the executor spawns the threads to execute tasks. There are two arguments that can be changed before launching the application which are driver memory (--driver-memory 2g) and number of threads to run with (local[4]). My goal is to see how much I stress my system by changing these arguments, and to find out if I can identify any interesting pattern happening during the execution using Platform Profiler's profiled data.
Here are the four test cases that were run and their corresponding run times:
spark-submit --driver-memory 2g --class MovieLensALS --master local[4] movielens-als_2.10-0.1.jar movies movies/test.dat (16 minutes 11 seconds)
spark-submit --driver-memory 2g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (11 minutes 35 seconds)
spark-submit --driver-memory 8g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (7 minutes 40 seconds)
spark-submit --driver-memory 16g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (8 minutes 14 seconds)
Figures 2 and 3 show observed CPU metrics during the first and second tests, respectively. Figure 2 shows that the CPU is underutilized and the user can add more work, if the rest of the system is similarly underutilized. The CPU frequency slows down often, supporting the thesis that the CPU will not be the limiter of performance. Figure 3 shows that the test utilizes the CPU more due to an increase in number of threads, but there is still significant headroom. Moreover, it is interesting to see that by increasing the number of threads we also decreased the CPI rate, as shown in the CPI chart of Figure 3.
Figure 2. Overview of CPU usage in Test 1.
Figure 3. Overview of CPU usage in Test 2.
Figure 4. Memory read/write throughput on Socket 0 for Test 1.
The increase in number of threads also increased the number of memory accesses, when you compare Figures 4 and 5. This is an expected behavior and is verified by the data collected by Platform Profiler.
Figures 6 and 7 show L1 and L2 miss rates per instruction for Tests 1 and 2, respectively. Increasing the number of threads in Test 2 drastically decreased L1 and L2 miss rates, as depicted in Figure 7. We found out that the application incurs less CPI rate and less L1 and L2 miss rate when you run the code with more threads, which means that once data is loaded from the memory to caches a fairly good amount of data reuse happens, which benefits the overall performance.
Figure 6. L1 and L2 miss rate per instruction for Test 1.
Figure 7. L1 and L2 miss rate per instruction for Test 2.
Figure 8 shows the memory usage chart for Test 3. Similar memory usage patterns are observed for all other tests as well; that is, used memory is between 15-25 percent, whereas cached memory is between 45-60 percent. Spark caches its intermediate results in memory for later processing, hence we see high utilization of cached memory.
Figure 8. Memory utilization overview for Test 3.
Finally, Figures 9-12 show the disk utilization overview for all four test runs. As the amount of work has increased across the four test runs, the data shows that a faster disk improves the performance of the tests. The number of bytes transacted is not a lot, but the I/O operations (iops) are spending significant time waiting for completion. This can be seen by the Queue Depth chart. If the user is unable to change the disk then adding more threads would help to tolerate the disk access latency.
Figure 9. Disk utilization overview for Test 1.
Figure 10. Disk utilization overview for Test 2.
Figure 11. Disk utilization overview for Test 3.
Figure 12. Disk utilization overview for Test 4.
Summary
Using Platform Profiler, I was able to understand the execution behavior of the movie recommendation workload and observe how certain performance metrics change across a different number of threads and driver memory settings. Moreover, I was surprised to find out that a lot of disk write operations happen during the execution of the workload, since Spark applications are designed to run in memory. In order to investigate the code further, I will proceed with running VTune Amplifier's Disk I/O analysis to find out the details behind the disk I/O performance.