How to Install and Use Intel® VTune™ Amplifier Platform Profiler

Try Platform Profiler Today

You are invited to try a free technical preview release. Just follow these simple steps (if you are already registered, skip to step 2):

Register for the Intel® Parallel Studio XE Beta
Download and install (the Platform Profiler is a separate download from the Intel Parallel Studio XE Beta)
Check out the getting started guide, then give Platform Profiler a test drive
Fill out the online survey

Introduction

Intel® VTune™ Amplifier - Platform Profiler, currently available as a technology preview, is a tool that helps users to identify how well an application uses the underlying architecture and how users can optimize hardware configuration of their system. It displays high-level system configuration such as processor, memory, storage layout, PCIe* and network interfaces (see Figure 1), as well as performance metrics observed on the system such as CPU and memory utilization, CPU frequency, cycles per instruction (CPI), memory and disk input/output (I/O) throughput, power consumption, cache miss rate per instruction, and so on. Performance metrics collected by the tool can be used for deeper analysis and optimization.

There are two primary audiences for Platform Profiler:

Software developers - Using performance metrics provided by the tool, developers can analyze the behavior of their workload across various platform components such as CPU, memory, disk, and network devices.
Infrastructure architects - You can monitor your hardware by analyzing long collection runs and finding times when the system exhibits low performance. Moreover, you can optimize the hardware configuration of your system based on the tool findings. For example, if after running a mix of workloads, Platform Profiler shows high processor utilization, memory use, or that I/O is limiting the application performance, you can bring in more cores, more memory, or use more or faster I/O devices.

Figure 1. High-level system configuration view from Platform Profiler

The main difference between Platform Profiler and other VTune Amplifier analyses is that Platform Profiler helps profile a platform for longer periods of time incurring very little performance overhead and generating a small amount of data. The current version of Platform Profiler can run up to 13 hours and generate less data than VTune Amplifier would do in 13 hours. So, one can simply start Platform Profiler running, and keep it running for 13 hours, meanwhile utilizing the system as he or she prefers, then stopping Platform Profiler profiling. Platform Profiler will collect all the profiled data and display the system utilization diagrams. VTune Amplifier, on the other hand, cannot run for such a long period of time, since it generates profiled data of gigabytes of magnitude in a matter of minutes. Thus, VTune Amplifier is more appropriate for fine tuning or for analyzing an application rather than a system. How well is my application using the machine or perhaps How well is the machine being used are key questions that Platform Profiler can answer. But, How do I fix my application to use the machine better is the question that VTune Amplifier answers.

Platform Profiler is composed of two main components: a data collector and a server.

Data Collector - is a standalone package that needs to be installed on profiled system. It collects system-level hardware and operating system performance counters.
Platform Profiler Server - post-processes the collected data into a time-series database, correlates with system topology information, and displays topology diagrams and performance graphs using a web-based interface.

Tool Installation

In order to use Platform Profiler, one needs to install both the server and the data collector components first. Below are the steps on how to do the installation.

Installing the Server Component

Copy the server package to the system on which you want to install the server.
Extract the archive to a writeable directory.
Run the setup script and follow the prompts. On Windows*, run the script using the Administrator Command Prompt. On Linux*, use an account with root (“sudo”) privileges.

Linux example: ./setup.sh

Windows example: setup.bat

By default, the server is installed in the following location:

On Linux: /opt/intel/vpp
On Windows: C:\Program Files(x86)\IntelSWTools\VTune Amplifier Platform Profiler

Installing the Data Collector Component

Copy the collector package to the target system on which you want to collect platform performance data.
Extract the archive to a writeable directory on the target system.
Run the setup script and follow the prompts. On Windows, run the script using the Administrator Command Prompt. On Linux, use an account with root (“sudo”) privileges.

Linux example: ./setup

Windows example: setup.cmd

By default, the collectors are installed in the following location:

On Linux: /opt/intel/vpp-collector
On Windows: C:\Intel\vpp-collector

Tool Usage

Starting and Stopping the Server Component

On Linux:

Run the following commands to start the server manually after initial installation or a system reboot:
- source ./vpp-server-vars.sh
- vpp-server-start
Run the following commands to stop the server:
- source ./vpp-server-vars.sh
- vpp-server-stop

On Windows:

Run the following commands to start the server manually after initial installation or a system reboot:
- vpp-server-vars.cmd
- vpp-server-start
Run the following command to stop the server:
- vpp-server-vars.cmd
- vpp-server-stop

Collecting System Data

Collecting data using Platform Profiler is pretty straight forward. Below are the steps one needs to take to collect data using the tool:

Setup the environment:
- On Linux: source /opt/intel/vpp-collector/vpp-collect-vars.sh
- On Windows: C:\Intel\vpp-collector\vpp-collect-vars.cmd
Start the data collection: vpp-collect-start [-c “workload description – free text comment”].
Optionally, you can also add timeline markers to distinguish the time periods between collections: vpp-collect-mark [“an optional label/text/comment”].
Stop the data collection: vpp-collect-stop. After the collection is stopped, the compressed result file is stored in the current directory.

Note: Inserting timeline markers is useful when you leave Platform Profiler running for a long period of time. For example, you run the Platform Profiler collection for 13 hours straight. During these 13 hours you run various stress tests and would like to find out how each test affects the system. In order to distinguish the time between these tests, you may want to use the timeline markers.

View Results

From the machine on which the server is installed, point your browser (Google Chrome* recommended) to the server home page: http://localhost:6543.
Click “View Results”.
Click the Upload button and select the result file to upload.
Select the result from the list to open the viewer.
Navigate through the result to identify areas for optimization.

Tool Demonstration

In the rest of the article, I demonstrate how to navigate and analyze the result data it collects. I use a movie recommendation system application as an example in this article. The movie recommendation code is obtained from the Spark* Training GitHub* website. The underlying platform is a two-socket Haswell server (Intel® Xeon® CPU E5-2699 v3) with Intel® Hyper-Threading Technology enabled, 72 logical cores, 64 GB of memory, running an Ubuntu* 14.04 operating system.

The code is run in Spark on a single node as follows:

spark-submit --driver-memory 2g --class MovieLensALS --master local[4] movielens-als_2.10-0.1.jar movies movies/test.dat

With the command line above, Spark runs in local mode with four threads specified with the --master local[4] option. In local mode there is only one driver, which acts as an executor, and the executor spawns the threads to execute tasks. There are two arguments that can be changed before launching the application which are driver memory (--driver-memory 2g) and number of threads to run with (local[4]). My goal is to see how much I stress my system by changing these arguments, and to find out if I can identify any interesting pattern happening during the execution using Platform Profiler's profiled data.

Here are the four test cases that were run and their corresponding run times:

spark-submit --driver-memory 2g --class MovieLensALS --master local[4] movielens-als_2.10-0.1.jar movies movies/test.dat (16 minutes 11 seconds)

spark-submit --driver-memory 2g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (11 minutes 35 seconds)

spark-submit --driver-memory 8g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (7 minutes 40 seconds)

spark-submit --driver-memory 16g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (8 minutes 14 seconds)

Figures 2 and 3 show observed CPU metrics during the first and second tests, respectively. Figure 2 shows that the CPU is underutilized and the user can add more work, if the rest of the system is similarly underutilized. The CPU frequency slows down often, supporting the thesis that the CPU will not be the limiter of performance. Figure 3 shows that the test utilizes the CPU more due to an increase in number of threads, but there is still significant headroom. Moreover, it is interesting to see that by increasing the number of threads we also decreased the CPI rate, as shown in the CPI chart of Figure 3.

Figure 2. Overview of CPU usage in Test 1.

Figure 3. Overview of CPU usage in Test 2.

Figure 4. Memory read/write throughput on Socket 0 for Test 1.

The increase in number of threads also increased the number of memory accesses, when you compare Figures 4 and 5. This is an expected behavior and is verified by the data collected by Platform Profiler.

Figure 5. Memory read/write throughput on Socket 0 for Test 2.

Figures 6 and 7 show L1 and L2 miss rates per instruction for Tests 1 and 2, respectively. Increasing the number of threads in Test 2 drastically decreased L1 and L2 miss rates, as depicted in Figure 7. We found out that the application incurs less CPI rate and less L1 and L2 miss rate when you run the code with more threads, which means that once data is loaded from the memory to caches a fairly good amount of data reuse happens, which benefits the overall performance.

Figure 6. L1 and L2 miss rate per instruction for Test 1.

Figure 7. L1 and L2 miss rate per instruction for Test 2.

Figure 8 shows the memory usage chart for Test 3. Similar memory usage patterns are observed for all other tests as well; that is, used memory is between 15-25 percent, whereas cached memory is between 45-60 percent. Spark caches its intermediate results in memory for later processing, hence we see high utilization of cached memory.

Figure 8. Memory utilization overview for Test 3.

Finally, Figures 9-12 show the disk utilization overview for all four test runs. As the amount of work has increased across the four test runs, the data shows that a faster disk improves the performance of the tests. The number of bytes transacted is not a lot, but the I/O operations (iops) are spending significant time waiting for completion. This can be seen by the Queue Depth chart. If the user is unable to change the disk then adding more threads would help to tolerate the disk access latency.

Figure 9. Disk utilization overview for Test 1.

Figure 10. Disk utilization overview for Test 2.

Figure 11. Disk utilization overview for Test 3.

Figure 12. Disk utilization overview for Test 4.

Summary

Using Platform Profiler, I was able to understand the execution behavior of the movie recommendation workload and observe how certain performance metrics change across a different number of threads and driver memory settings. Moreover, I was surprised to find out that a lot of disk write operations happen during the execution of the workload, since Spark applications are designed to run in memory. In order to investigate the code further, I will proceed with running VTune Amplifier's Disk I/O analysis to find out the details behind the disk I/O performance.

How to Install and Use Intel® VTune™ Amplifier Platform Profiler

Try Platform Profiler Today

Introduction

Tool Installation

Installing the Server Component

Installing the Data Collector Component

Tool Usage

Starting and Stopping the Server Component

Collecting System Data

View Results

Tool Demonstration

Summary

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112