This page provides the Release Notes for the Intel® MPI Library 2018 Beta. Please use the table below to select the version needed.
Linux* OS | Windows* OS | |
---|---|---|
Intel® MPI Library 2018 Beta | English Release Notes | English Release Notes |
This page provides the Release Notes for the Intel® MPI Library 2018 Beta. Please use the table below to select the version needed.
Linux* OS | Windows* OS | |
---|---|---|
Intel® MPI Library 2018 Beta | English Release Notes | English Release Notes |
Intel® MPI Library is a multi-fabric message passing library based on ANL* MPICH3* and OSU* MVAPICH2*.
Intel® MPI Library implements the Message Passing Interface, version 3.1 (MPI-3) specification. The library is thread-safe and provides the MPI standard compliant multi-threading support.
To receive technical support and updates, you need to register your product copy. See Technical Support below.
.so
) libraries.mpicc
, mpiicc
, etc.), include files and modules, static (.a
) libraries, debug libraries, and test codes.I_MPI_HARD_FINALIZE
and I_MPI_MEMORY_SWAP_LOCK
.I_MPI_ADJUST
family).I_MPI_ASYNC_PROGRESS
).I_MPI_OFI_DRECV
).I_MPI_PMI2
).I_MPI_HYDRA_PREFORK
).I_MPI_SLURM_EXT
).I_MPI_LUSTRE_STRIPE_AWARE
).dash
) shell support in compiler wrapper scripts and mpitune
.pvfs2
ADIO driver is disabled.libnuma.so
library and numactl
utility installed. numactl
should include numactl
, numactl-devel
and numactl-libs
.I_MPI_JOB_FAST_STARTUP
variable takes effect only when shm
is selected as the intra-node fabric.rm -r /dev/shm/shm-col-space-*
echo 1048576 > /proc/sys/vm/max_map_count
sysctl -w vm.max_map_count=1048576
I_MPI_COLL_INTRANODE=pt2pt
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
I_MPI_SHM_LMT=shm
-gdb
, this behavior must be disabled by setting the sysctl
value in /proc/sys/kernel/yama/ptrace_scope
to 0.ssh
from a Windows* host fail. Two workarounds exist:pmi_proxy
.hydra_persist
on the Linux* host in the background (hydra_persist &
) and use -bootstrap service from the Windows* host. This requires that the Hydra service also be installed and started on the Windows* host.MPI_Finalize
.ofi
/tmi
, and the second type will select dapl
as the internode fabric. To avoid this, explicitly specify a fabric that is available on all the nodes.I_MPI_FABRICS
variable to the same values for each application to avoid this issue.dlopen(3)
.system(3)
, fork(2)
, vfork(2)
, or clone(2)
system calls. Do not use these system calls or functions based upon them. For example, system(3)
, with OFED* DAPL provider with Linux* kernel version earlier than official version 2.6.16. Set the RDMAV_FORK_SAFE
environment variable to enable the OFED workaround with compatible kernel version.MPI_Mprobe
, MPI_Improbe
, and MPI_Cancel
are not supported by the TMI and OFI fabrics.-checkpoint-interval
option. The error message may look as follows:[proxy:0:0@hostname] HYDT_ckpoint_blcr_checkpoint (./tools/ckpoint/blcr/ ckpoint_blcr.c:313): cr_poll_checkpoint failed: No such process [proxy:0:0@hostname] ckpoint_thread (./tools/ckpoint/ckpoint.c:559): blcr checkpoint returned error [proxy:0:0@hostname] HYDT_ckpoint_finalize (./tools/ckpoint/ckpoint.c:878) : Error in checkpoint thread 0x7
/dev/shm
device in the system. To avoid failures related to the inability to create a shared memory segment, make sure the /dev/shm
device is set up correctly.stdin
stream to the application. If you redirect a large file, the transfer can take long and cause the communication to hang on the remote side. To avoid this issue, pass large files to the application as command line options.dapl-2.0.37
or newer.I_MPI_SHM_LMT=direct
, the setting has no effect if the Linux* kernel version is lower than 3.2.isolcpus
with an Intel® Xeon Phi™ processor using default MPI settings, an application launch may fail. If possible, change or remove the isolcpus Linux boot parameter. If it is not possible, you can try setting I_MPI_PIN
to off
.I_MPI_ADJUST_ALLGATHER
to a value between 1 and 4 to resolve the issue.Every purchase of an Intel® Software Development Product includes a year of support services, which provides priority customer support at our Online Support Service Center web site, http://www.intel.com/supporttickets.
In order to get support you need to register your product in the Intel® Registration Center. If your product is not registered, you will not receive priority support.
Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making.
Algorithms implemented in the library include:
Intel DAAL provides application programming interfaces (APIs) for C++, Java*, and Python* languages.
Download (14.06 MB)
Download (10.32 MB)
Download (7.75 MB)
For the Developer Guide and previous versions of API reference, see Intel® Data Analytics Acceleration Library - Documentation.
Release Notes include important information, such as
Beta - Release Notes
For additional information such as installation and user guides, visit Intel® Computer Vision SDK.
Previous Versions
Initial Beta (Part of the Intel® Deep Learning SDK) - Release Notes
All files are in PDF format - Adobe Reader* (or compatible) required.
The section below provides links to the Intel® MPI Library 2018 Beta documentation. You can find other documentation, including user guides and reference manuals for current and earlier Intel software product releases in the Intel® Software Documentation Library.
Visit this page for documentation pertaining to the latest stable Intel MPI Library release.
You can also download an offline version of the documentation from the Intel Registration Center > Product List > Intel® Parallel Studio XE Documentation Beta
Title | Format | Version | Type | Date |
---|---|---|---|---|
Developer Guide for Linux* | Online | PDF | 2018 Beta | Developer Guide | Apr 2017 |
Developer Reference for Linux* | Online | PDF | 2018 Beta | Developer Reference | Apr 2017 |
Title | Format | Version | Type | Date |
---|---|---|---|---|
Developer Guide for Windows* | Online | PDF | 2018 Beta | Developer Guide | Apr 2017 |
Developer Reference for Windows* | Online | PDF | 2018 Beta | Developer Reference | Apr 2017 |
The section below provides links to the Intel® Trace Analyzer and Collector 2018 Beta documentation. You can find other documentation, including user guides and reference manuals for current and earlier Intel software product releases in the Intel® Software Documentation Library.
Visit this page for documentation pertaining to the latest stable Intel Trace Analyzer and Collector release.
You can also download an offline version of the documentation from the Intel Registration Center > Product List > Intel® Parallel Studio XE Documentation Beta
Title | Format | Version | Type | Date |
---|---|---|---|---|
Intel® Trace Collector User and Reference Guide | Online | PDF | 2018 Beta | User/Reference Guide | Apr 2017 |
Intel® Trace Analyzer User and Reference Guide | Online | PDF | 2018 Beta | User/Reference Guide | Apr 2017 |
‹ Back to Intel® Parallel Studio XE
Intel Parallel Studio XE supports only IA 64-bit host.
Systems based on Intel® 64 architecture:
12 GB of disk space (minimum) on a standard installation.
During the installation process, the installer may need up to 12 GB of additional temporary disk storage to manage the intermediate installation files.
The operating systems listed below are supported by all components on Intel® 64 Architecture.
2 GB RAM (minimum)
On macOS*, the Intel® C/C++ Compiler and Intel® Fortran Compiler require a version of Xcode* to be installed. The following versions are currently supported:
You can customize the items listed below during the installation. We will perform a system check to ensure you configuration will work correctly on you system, and help you solve any issues that may be detected. Here are the configurable items:
Register & Download Version for C++ Register & Download Version for Fortran
Extract the contents of the installation package to a directory of your choice.
You can either install with the GUI or use the command line. You will find files for both methods in the main directory of the extracted files.
Open a terminal window and run the file named install_GUI.sh
.
Open a terminal window and run the file named install.sh
. Follow the prompts in the CLI to continue installation.
There are international regulations that require Intel to register users who download an application of this nature. Registering will also allow Intel to keep you up to date on the latest releases.
When the installation is complete, you are ready to start developing with the Intel Parallel Studio XE. Start exploring what's available in the studio with the Getting Started Guide.
How you register your floating license depends on how it was issued. Registration is the process of owning a particular serial number, while Activation is assigning the owned serial number to a license server.
If you have a serial number which has no owner, you may register it by following this process:
If you already have a registration center account, you may login and enter the unregistered serial number in the upper-right serial number box.
If the serial number is already registered, the above process will automatically add you as a user of the license. This grants the ability to download the products available with the license. If you expected to become the license owner, you can contact support to assist with determining the current owner and/or license transfer.
To activate your floating license, you must provide the host ID and host name of the server running the license manager. This can be done in one of two ways:
After the serial number is activated, you may download the license file.
The Intel® Software License Manager uses two ports to serve licenses - one for lmgrd (the main license service) and one for the INTEL vendor daemon. Both ports must be open and not blocked by a firewall.
This is the main process that controls license management, and is provided by FlexNet Publisher, formerly Flexlm. The Intel Software License Manager uses port 28518 as a default to avoid conflicts with other vendors. This can be entered through the Intel Registration Center during activation, or changed for activated licenses by following these steps.
This is the vendor daemon that serves Intel licenses. When lmgrd is started or restarted, it starts the vendor daemon which determines a port to use. At this time, the selected port number is displayed in the startup output, which is either written to a log file or stdout. There is no additional reporting by the license manager utilities on this port, causing it to be overlooked.
As firewalls have become more common, so have reports of issues stemming from the INTEL vendor daemon port being blocked. Even if the port was not previously blocked, restarting the license manager can cause the port number to change and be subsequently blocked. To determine the port number, run a command such as netstat and look for the INTEL daemon.
The INTEL vendor daemon port can be specified by modifying the license file. Change the second line as follows:
VENDOR INTEL port=<port>
Be sure to restart the license manager after any license file changes.
A primer on how to become a data scientist
How do I become a good data scientist? Should I learn R* or Python*? Or both? Do I need to get a PhD? Do I need to take tons of math classes? What soft skills do I need to become successful? What about project management experience? What skills are transferable? Where do I start?
Data science is a popular topic in the tech world today. It is the science that powers many of the trends in this world, from machine learning to artificial intelligence.
In this article, we discuss our teachings about data science in a series of steps so that any product manager or business manager interested in exploring this science will be able take their first step toward becoming a data scientist or at least develop a deeper understanding of this science.
We all have heard conversations that go sometime like this: "Look at the data and tell me what you find." This approach may work when the volume of data is small, structured, and limited. But when we are dealing with gigabytes or terabytes of data, it can lead to an endless, daunting detective hunt, which provides no answers because there were no questions to begin with.
As powerful as science is, it's not magic. Inventions in any field of science solve a problem. Similarly, the first step in using data science is to define a problem statement, a hypothesis to be validated, or a question to be answered. It may also focus on a trend to be discovered, an estimate, a prediction to be made, and so on.
For example, take MyFitnessPal*, which is a mobile app for monitoring health and fitness. A few of my friends and I downloaded it about a year ago, and then used it almost daily for a while. But over the past 6 months, most of us have completely stopped using it. If I were a product manager for MyFitnessPal, a problem I might want to solve would be: how can we drive customer engagement and retention for the app?
Today's data scientists access data from several sources. This data may be structured or unstructured. The raw data that we often get is unstructured and/or dirty data, which needs to be cleaned and structured before it can be used for analysis. Most of the common sources of data now offer connectors to import the raw data in R or Python.
Common data sources include the following:
In the data science world, common vocabulary includes:
⇄ Observations or examples | ⇄ are like the rows in a database. | ⇄ For example: A customer record for Joe Allen. |
⇅ Variables, signals, or characteristics | ||
⇅ are like the columns | ||
⇅ For example: Joe's Height. |
Several terms are used to refer to data cleaning, such as data munging, data preprocessing, data transformation, and data wrangling. These terms all refer to the process of preparing the raw data to be used for data analysis.
As much as 70–80 percent of the efforts in a data science analysis involve data cleansing.
A data scientist analyzes each variable in the data to evaluate whether it is worthy of being a feature in the model. If including the variable increases the model's predictive power, it is considered a predictor for the model. Such a variable is then considered a feature, and together all the features create a feature vector for the model. This analysis is called feature engineering.
Sometimes a variable may need to be cleaned or transformed to be used as a feature in the model. To do that we write scripts, which are also referred to as munging scripts. Scripts can perform a range of functions like:
Sometimes the data has numerical values that vary in magnitude, making it difficult to visualize the information. We can resolve this issue using feature scaling. For example,consider the square footage and number of rooms in a house. If we normalize the square footage of a house by making it a similar magnitude as the number of bedrooms, our analysis becomes easier.
A series of scripts are applied to the data in an iterative manner until we get data that is clean enough for analysis. To get a continuous supply of data for analysis, the series of data munging scripts need to be rerun on the new raw data. Data pipeline is the term given to this series of processing steps applied to raw data to make it analysis ready.
Now we have clean data and we are ready for analysis. Our next goal is to become familiar with the data using statistical modeling, visualizations, discovery-oriented data analysis, and so on.
For simple problems, we can use simple statistical analysis using the mean, medium, mode, min, max, average, range, quartile, and so on.
We could also use supervised learning with data sets that gives us access to actual values of response variables (dependent variables) for a given set of feature variables (independent variables). For example, we could find trends based on the tenure, seniority, and title for employees who have left the company (resigned=true) from actual data, and then use those trends to predict whether other employees will resign too. Or we could use historic data to correlate a trend between the number of visitors (an independent variable or a predictor) and revenue generated (a dependent variable or response variable). This correlation could then be used to predict future revenue for the site based on the number of visitors.
The key requirement for supervised learning is the availability of ACTUAL Values and a clear question that needs to be answered. For example: Will this employee leave? How much revenue can we expect? Data scientists often refer to this as "Response variable is labeled for existing data."
Regression is a common tool used for supervised learning. A one-factor regression uses one variable; a multifactor regression uses many variables.
Linear regression assumes that the unknown relation between the factor and the response variable is a linear relation Y = a + bx, where b is the coefficient of x.
A part of the existing data is used as training data to calculate the value of this coefficient. Data scientists often use 60 percent, 80 percent, or at times 90 percent of the data for training. Once the value of the coefficient is calculated for the trained model, it is tested with the remaining data also referred to as the test data to predict thevalue of the response variable. The difference between the predicted response value and the actual value is the Holy Grail of metrics referred to as the test error metric.
Our quest in data science modeling is to minimize the test error metrics in order to increase the predictive power of the model by:
Unsupervised learning is applied when we are trying to learn the structure of the underlying data itself. There is NO RESPONSE VARIABLE. Data sets are unlabeled and pre-existing insights are unclear. We are not clear about anything ahead of time so we are not trying to predict anything!
This technique is effective for exploratory analysis and can be used to answer questions like
Analysis of variance (ANOVA) is a common technique used to compare the means of two or more groups. It's named ANOVA though since the "estimates of variance" is the main intermediate statistics calculated. The means of various groups are compared using various distance metrics, Euclidean distance being a popular one.
ANOVA is used to organize observations into similar groups, called clusters. The observations can be classified into these clusters based on their respective predictors.
http://www.statsdirect.com/help/content/analysis_of_variance/anova.htm
Two common clustering applications are:
Hierarchical ANOVA
If a stable state is not achieved, we may need to refine the number of clusters (i.e., K) we assumed in the beginning or use a different distance metrics.
The final clusters can be visualized for easy communication using tools like Tableau* or graphing libraries.
In my quest to understand data science, I met with practitioners working in companies, including Facebook, eBay, LinkedIn, Uber, and some consulting firms, that are effectively leveraging the power of data. Here are some powerful words of advice I received:
R is a favorite tool of many data scientists and holds a special place in the world of academia, where data science problems are worked on from a mathematician's and statistician's perspective. R is an open source and rich language, with about 9,000 additional packages available. The tool used to program in R is called R Studio*. R has a steep learning curve, though its footprint is steadily increasing in enterprise world and owes some of it popularity to the rich and powerful Regular Expression-based algorithms already available.
Python is slowly becoming the most extensively used language in the data science community. Like R, it is also an open source language and is used primarily by software engineers who view data science as a tool to solve real customer-facing business problems using data. Python is easier to learn than R, because the language emphasizes readability and productivity. It is also more flexible and simpler.
SQL is the basic language used to interact with databases and is required for all tools.
Below is a list of important soft skills to have, many of which you might already have in your portfolio.
Your goal is to give them direct recommendations based on your solid prediction algorithm and accurate results. We recommend that you create four or five slides where you clearly tell this story—storytelling backed by solid data and solid research.Visualization. Good data scientist needs to communicate results and recommendations using visualization. You cannot give 200-page report for someone to read. You need to present using pictures, images, charts, and graphs.
Now it's time to decide. What type of data scientist should I become?
Intel® MKL 2018 Beta is now available as part of the Parallel Studio XE 2018 Beta.
Check the Join the Intel® Parallel Studio XE 2018 Beta program post to learn how to join the Beta program, and the provide your feedback.
What's New in Intel® MKL 2018 Beta:
Optimizations are not dispatched unless explicitly enabled with mkl_enable_instructions function call or MKL_ENABLE_INSTRUCTIONS environment variable.
IMPORTANT: the Intel XDK App Designer component (aka the UI layout tool) has been deprecated. It will be retired in an upcoming release. Once retired, existing App Designer projects will continue to work, but you will not be able to create new App Designer projects.
No bug fixes will be implemented for the existing App Designer component nor for any of the UI frameworks that were supported by App Designer.
If you have designed your layout by hand or by using an external tool, there will be no changes to your project. This change ONLY affects projects that have been created using the App Designer UI layout tool. If you are just starting with the Intel XDK we recommend that you do NOT use App Designer to create your layout, since the editor will not be maintained and may eventually be discontinued.
There are many UI frameworks and tools available for creating UI layouts; too many to enumerate here. The vast majority of layout tools that generate standard HTML5 code (HTML/CSS/JavaScript) should work with no issue. The Intel XDK creates standard Cordova CLI (aka PhoneGap) applications, so any UI frameworks and tools that work in the Cordova CLI environment will work with your Intel XDK applications.
There is no "best" UI framework for your application. Each UI framework has pros and cons. You should choose that UI framework which serves your application needs the best. Using App Designer to create your UI is not a requirement to building a mobile app with the Intel XDK. You can create your layout by hand or using any UI framework (by hand) that is compatible with the Cordova CLI (aka PhoneGap) webview environment.
Twitter Bootstrap 3 -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Twitter Bootstrap please visit the Twitter Bootstrap website and the Twitter Bootstrap GitHub repo for documentation and help.
Framework7 -- This UI framework has been retired from App Designer. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Framework7 please visit the Framework7 project page and the Framework7 GitHub repo for documentation and help.
Ionic -- This UI framework has been retired from App Designer. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Ionic please visit the Ionic project page and the Ionic GitHub repo for documentation and help.
App Framework 3 -- This UI framework has been retired from App Designer. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using App Framework please visit the App Framework project page and the App Framework GitHub repo for documentation and help.
Topcoat -- This UI framework has been retired from App Designer. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Topcoat please visit the Topcoat project page and the Topcoat GitHub repo for documentation and help.
Ratchet -- This UI framework has been retired from App Designer. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Ratchet please visit the Ratchet project page and the Ratchet GitHub repo for documentation and help.
jQuery Mobile -- This UI framework has been retired from App Designer. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using jQuery Mobile please visit the jQuery Mobile API page and jQuery Mobile GitHub page for documentation and help.
The "center type" parameter defines how the map view is centered in your div. It is used to initialize the map as follows:
This is just for initialization of the map widget. Beyond that you must use the standard Google maps APIs to move and/or modify the map. See the "google_maps.js" code for initialization of the widget and some calls to the Google maps APIs. There is also a pointer to the Google maps API at the beginning of the JS file.
To get the current position, you have to use the Geo API, and then push that into the Maps API to display it. The Google Maps API will not give you any device data, it will only display information for you. Please refer to the Intel XDK "Hello, Cordova" sample app for some help with the Geo API. There are a lot of useful comments and console.log messages.
Trying to implement "pixel perfect" user interfaces with HTML5 apps is not recommended as there is a wide array of device resolutions and aspect ratios and it is impossible to insure you are sized properly for every device. Instead, you should use "responsive web design" techniques to build your UI so that it adapts to different sizes automatically. You can also use the CSS media query directive to build CSS rules that are specific to different screen dimensions.
Note:The viewport is sized in CSS pixels (aka virtual pixels or device independent pixels) and so the physical pixel dimensions are not what you will normally be designing for.
The Intel XDK provides you with a way to build HTML5 apps that are run in a webview on the target device. This is analogous to running in an embedded browser (refer to this blog for details). Thus, the programming techniques are the same as those you would use inside a browser, when writing a single-page client-side HTML5 app. You can use the Intel XDK App Designer tool to drag and drop UI elements.
It could be that you are using an outdated version of the App Framework* files. You can find the recent versions here. You can safely replace any App Framework files that App Designer installed in your project with more recent copies as App Designer will not overwrite the new files.
You can replace the App Framework* files that the Intel XDK automatically inserted with more recent versions that can be found here. App designer will not overwrite your replacement.
This FAQ applies only to App Framework 2. App Framework 3 no longer includes a replacement for the jQuery selector library, it expects that you are using standard jQuery.
App Framework is a UI library that implements a subset of the jQuery* selector library. If you wish to use jQuery for XPath manipulation, it is recommend that you use jQuery as your selector library and not App Framework. However, it is also possible to use jQuery with the UI components of App Framework. Please refer to this entry in the App Framework docs.
It would look similar to this:
<script src="lib/jq/jquery.js"></script><script src="lib/af/jq.appframework.js"></script><script src="lib/af/appframework.ui.js"></script>
Ensure you have upgraded to the latest version of App Framework. If your app was built with the now retired Intel XDK "legacy" build system be sure to set the "Targeted Android Version" to 19 in the Android-Crosswalk build settings. The legacy build targeted Android 4.2.
If you want to, for example, change the theme only on Android*, you can add the following lines of code:
In App Framework the BODY is in the background and the page is in the foreground. If you set the background color on the body, you will see the page's background color. If you set the theme to default App Framework uses a native-like theme based on the device at runtime. Otherwise, it uses the App Framework Theme. This is normally done using the following:
<script> $(document).ready(function(){ $.ui.useOSThemes = false; });</script>
Please see Customizing App Framework UI Skin for additional details.
Currently, you can only create App Designer projects by selecting the blank 'HTML5+Cordova' template with app designer (select the app designer check box at the bottom of the template box) and the blank 'Standard HTML5' template with app designer.
App Designer versions of the layout and user interface templates were removed in the Intel XDK 3088 version.
The jQuery 1 library appears to be incompatible with the latest versions of the cordova-android framework. To fix this issue you can either upgrade your jQuery library to jQuery 2 or use a technique similar to that shown in the following test code fragment to check your AJAX return codes. See this forum thread for more details.
The jQuery site only tests jQuery 2 against Cordova/PhoneGap apps (the Intel XDK builds Cordova apps). See the How to Use It section of this jQuery project blog > https://blog.jquery.com/2013/04/18/jquery-2-0-released/ for more information.
If you built your app using App Designer, it may still be using jQuery 1.x rather than jQuery 2.x, in which case you need to replace the version of jQuery in your project. Simply download and replace the existing copy of jQuery 1.x in your project with the equivalent copy of jQuery 2.x.
Note, in particular, the switch case that checks for zero and 200. This test fragment does not cover all possible AJAX return codes, but should help you if you wish to continue to use a jQuery 1 library as part of your Cordova application.
function jqueryAjaxTest() { /* button #botRunAjax */ $(document).on("click", "#botRunAjax", function (evt) { console.log("function started"); var wpost = "e=132&c=abcdef&s=demoBASICA"; $.ajax({ type: "POST", crossDomain: true, //;paf; see http://stackoverflow.com/a/25109061/2914328 url: "http://your.server.url/address", data: wpost, dataType: 'json', timeout: 10000 }) .always(function (retorno, textStatus, jqXHR) { //;paf; see http://stackoverflow.com/a/19498463/2914328 console.log("jQuery version: " + $.fn.jquery) ; console.log("arg1:", retorno) ; console.log("arg2:", textStatus) ; console.log("arg3:", jqXHR) ; if( parseInt($.fn.jquery) === 1 ) { switch (retorno.status) { case 0: case 200: console.log("exit OK"); console.log(JSON.stringify(retorno.responseJSON)); break; case 404: console.log("exit by FAIL"); console.log(JSON.stringify(retorno.responseJSON)); break; default: console.log("default switch happened") ; console.log(JSON.stringify(retorno.responseJSON)); break ; } } if( (parseInt($.fn.jquery) === 2) && (textStatus === "success") ) { switch (jqXHR.status) { case 0: case 200: console.log("exit OK"); console.log(JSON.stringify(jqXHR.responseJSON)); break; case 404: console.log("exit by FAIL"); console.log(JSON.stringify(jqXHR.responseJSON)); break; default: console.log("default switch happened") ; console.log(JSON.stringify(jqXHR.responseJSON)); break ; } } else { console.log("unknown") ; } }); }); }
App Designer adds the data-uib and data-ver properties to many of the UI elements it creates. These property names only appear in the index.html file on various UI elements. There are other similar data properties, like data-sm, that only are required when you are using a service method.
The data-uib and data-ver properties are used only by App Designer. They are not needed by the UI frameworks supported by App Designer; they are used by App Designer to correctly display and apply widget properties when you are operating in the "design" view within App Designer. These properties are not critical to the functioning of your app; however, removing them will cause problems with the "design" view of App Designer.
The data-sm property is inserted by App Designer, and it may be used by data_support.js, along with other support libraries. The data-sm property is relevant to the proper functioning of your app.
If you previously created an App Designer project named 'ui-test' that you then delete and then create another App Designer project using the same name (e.g., 'ui-test'), you will not be given the option to select the UI framework for the new project named 'ui-test.' This is because the Intel XDK remembers a framework name for each project name that has been used and does not delete that entry from the global-settings.xdk file when you delete a project (e.g. if you chose "Framework 7" the first time you created an App Designer project with the name 'ui-test' then deleting 'ui-test' and creating a new 'ui-test' will result in another "Framework 7" project).
Because the UI framework name is not removed from the global-settings.xdk file when you delete the project, you must either use a new unique project name or edit the global-settings.xdk file to delete that old UI framework association. This is a bug that has been reported, but has not been fixed. Following is a workaround:
"FILE-/C/Users/xxx/Downloads/pkg/ui-test/www/index.html": {"canvas_width": 320,"canvas_height": 480,"framework": "framework 7" }
You should now see the list of App Designer framework UI selection options when you create the new project with a previously used project name that you have deleted.
Click the FAQ titles below to view specific FAQ pages.
Getting started as a new user; installing and updating the Intel XDK; questions related to the Brackets editor; differences between mobile platforms, etc.
Using Cordova* APIs; adding and using third-party plugins; selecting plugins for your app; AdMob and in-app purchases; Intel App Security; image capture; camera, etc.
Using the Crosswalk* runtime with Android*; why are Crosswalk app packages so large; controlling audio playback rate; Crosswalk GPU support; Crosswalk options, etc.
Enable testing via wifi with Intel App Preview; limitations of the Intel XDK simulator; debugging third-party plugins over USB, etc.
The App Designer layout editor; the App Framework library; creating and sizing UI elements; widget attributes; updating App Framework versions, etc.
Developing Internet of Things (IoT) NodeJS* apps using the Intel XDK; updating the MRAA library; connecting the Intel XDK to your IoT device; using the WebService API, etc.
IMPORTANT: on February, 2017, the Crosswalk Project was retired. Crosswalk 23 was the last version of the Crosswalk library produced by the Crosswalk team. You can continue to build with the Crosswalk library using Cordova CLI or PhoneGap Build, but no further updates to the Crosswalk library will occur.
No bug fixes will be implemented for Crosswalk components.
You can continue to use Crosswalk in your project, but there will be no new releases of the Crosswalk library and the Intel XDK will not add any new versions of Crosswalk to the build settings. If you are deploying your app to Android 5 or greater there is no reason to use the Crosswalk library, since those versions of Android include an upgradeable native Chromium webview that is up-to-date and is as capable and as performant as the Crosswalk webview. If you are still deploying to Android 4.x devices you may want to continue to use Crosswalk for those devices. Unlike the native webview in Android 5+ devices, the native webview in Android 4.x devices cannot be upgraded and is quite limited.
Here is a code snippet that allows you to specify playback rate:
var myAudio = new Audio('/path/to/audio.mp3'); myAudio.play(); myAudio.playbackRate = 1.5;
When your app is built with Crosswalk it will be a minimum of 15-18MB in size because it includes a complete web browser (the Crosswalk runtime or webview) for rendering your app instead of the built-in webview on the device. Despite the additional size, this is the preferred solution for Android, because the built-in webviews on the majority of Android devices are inconsistent and poorly performing.
See these articles for more information:
This is because the apk is a compressed image, so when installed it occupies more space due to being decompressed. Also, when your Crosswalk app starts running on your device it will create some data files for caching purposes which will increase the installed size of the application.
The Intel XDK Crosswalk build system used with CLI 4.1.2 Crosswalk builds does not support the library project format that was introduced in the "com.google.playservices@21.0.0" plugin. Use "com.google.playservices@19.0.0" instead.
There are some Android devices in which the GPU hardware/software subsystem does not work properly. This is typically due to poor design or improper validation by the manufacturer of that Android device. Your problem Android device probably falls under this category.
See the code posted in this forum thread for a solution: /en-us/forums/topic/557191#comment-1827376.
An alternate solution is to add the following lines to your intelxdk.config.additions.xml
file:
<!-- disable reset on vertical swipe down --><intelxdk:crosswalk xwalk-command-line="--disable-pull-to-refresh-effect" />
The specific versions of Crosswalk that are offered via the Intel XDK are based on what the Crosswalk project releases and the timing of those releases relative to Intel XDK build system updates. This is one of the reasons you do not see every version of Crosswalk supported by our Android-Crosswalk build system.
With the September, 2015 release of the Intel XDK, the method used to build embedded Android-Crosswalk versions changed to the "pluggable" webview Cordova build system. This new build system was implemented with the help of the Cordova project and became available with their release of the Android Cordova 4.0 framework (coincident with their Cordova CLI 5 release). With this change to the Android Cordova framework and the Cordova CLI build system, we can now more quickly adapt to new version releases of the Crosswalk project. Support for previous Crosswalk releases required updating a special build system that was forked from the Cordova Android project. This new "pluggable" webview build system means that the build system can now use the standard Cordova build system, because it now includes the Crosswalk library as a "pluggable" component.
The "old" method of building Android-Crosswalk APKs relied on a "forked" version of the Cordova Android framework, and is based on the Cordova Android 3.6.3 framework and is used when you select CLI 4.1.2 in the Project tab's build settings page. Only Crosswalk versions 7, 10, 11, 12 and 14 are supported by the Intel XDK when using this build setting.
Selecting CLI 5.1.1 in the build settings will generate a "pluggable" webview built app. A "pluggable" webview app (built with CLI 5.1.1) results in an app built with the Cordova Android 4.1.0 framework. As of the latest update to this FAQ, the CLI 5.1.1 build system supported Crosswalk 15. Future releases of the Intel XDK and the build system will support higher versions of Crosswalk and the Cordova Android framework.
In both cases, above, the net result (when performing an "embedded" build) will be two processor architecture-specific APKs: one for use on an x86 device and one for use on an ARM device. The version codes of those APKs are modified to insure that both can be uploaded to the Android store under the same app name, insuring that the appropriate APK is automatically delivered to the matching device (i.e., the x86 APK is delivered to Intel-based Android devices and the ARM APK is delivered to ARM-based Android devices).
For more information regarding Crosswalk and the Intel XDK, please review these documents:
Use the Ionic Keyboard plugin and set the spellcheck attribute to false.
Beginning with the Intel XDK CLI 5.1.1 build system you must add the --ignore-gpu-blacklist
option to your intelxdk.config.additions.xml
file if you want the additional performance this option provides to blacklisted devices. See this forum post for additional details.
If you are a Construct2 game developer, please read this blog by another Construct2 game developer regarding how to properly configure your game for proper Crosswalk performance > How to build optimized Intel XDK Crosswalk app properly?<
Also, you can experiment with the CrosswalkAnimatable option in your intelxdk.config.additions.xml
file (details regarding the CrosswalkAnimatable option are available in this Crosswalk Project wiki post: Android SurfaceView vs TextureView).
<!-- Controls configuration of Crosswalk-Android "SurfaceView" or "TextureView" --><!-- Default is SurfaceView if >= CW15 and TextureView if <= CW14 --><!-- Option can only be used with Intel XDK CLI5+ build systems --><!-- SurfaceView is preferred, TextureView should only be used in special cases --><!-- Enable Crosswalk-Android TextureView by setting this option to true --><preference name="CrosswalkAnimatable" value="false" />
See Chromium Command-Line Options for Crosswalk Builds with the Intel XDK for some additional tools that can be used to modify the Crosswalk's webview runtime parameters, especially the --ignore-gpu-blacklist
option.
For full details, please read Android and Crosswalk Cordova Version Code Issues. For a summary, read this FAQ.
There is a change to the version code handling by the Crosswalk and Android build systems based on Cordova CLI 5.0 and later. This change was implemented by the Apache Cordova project. This new version of Cordova CLI automatically modifies the android:versionCode when building for Crosswalk and Android. Because our CLI 5.1.1 build system is now more compatible with standard Cordova CLI, this change results in a discrepancy in the way your android:versionCode is handled when building for Crosswalk (15) or Android with CLI 5.1.1 when compared to building with CLI 4.1.2.
If you have never published an app to an Android store this change will have little or no impact on you. This change might affect attempts to side-load an app onto a device, in which case the simplest solution is to uninstall the previously side-loaded app before installing the new app.
Here's what Cordova CLI 5.1.1 (Cordova-Android 4.x) is doing with the android:versionCode number (which you specify in the App Version Code field within the Build Settings section of the Projects tab):
Cordova-Android 4.x (Intel XDK CLI 5.1.1 for Crosswalk or Android builds) does this:
then, if you are doing a Crosswalk (15) build:
otherwise, if you are performing a standard Android build (non-Crosswalk):
If you HAVE PUBLISHED a Crosswalk app to an Android store this change may impact your ability to publish a newer version of your app! In that case, if you are building for Crosswalk, add 6000 (six with three zeroes) to your existing App Version Code field in the Crosswalk Build Settings section of the Projects tab. If you have only published standard Android apps in the past and are still publishing only standard Android apps you should not have to make any changes to the App Version Code field in the Android Builds Settings section of the Projects tab.
The workaround described above only applies to Crosswalk CLI 5.1.1 and later builds!
When you build a Crosswalk app with CLI 4.1.2 (which uses Cordova-Android 3.6) you will get the old Intel XDK behavior where: 60000 and 20000 (six with four zeros and two with four zeroes) are added to the android:versionCode for Crosswalk builds and no change is made to the android:versionCode for standard Android builds.
NOTE:
If you are using the WebGL 2D canvas APIs and your app crashes on some devices because you added the --ignore-gpu-blacklist flag to your intelxdk.config.additions.xml file, you may need to also add the --disable-accelerated-2d-canvas flag. Using the --ignore-gpu-blacklist flag enables the use of the GPU in some problem devices, but can then result in problems with some GPUs that are not blacklisted. The --disable-accelerated-2d-canvas flag allows those non-blacklisted devices to operate properly in the presence of WebGL 2D canvas APIs and the --ignore-gpu-blacklist flag.
You likely have this problem if your app crashes after running a few seconds with the an error like the following:
<gsl_ldd_control:364>: ioctl fd 46 code 0xc00c092f (IOCTL_KGSL_GPMEM_ALLOC) failed: errno 12 Out of memory <ioctl_kgsl_sharedmem_alloc:1176>: ioctl_kgsl_sharedmem_alloc: FATAL ERROR : (null).
See Chromium Command-Line Options for Crosswalk Builds with the Intel XDK for additional info regarding the --ignore-gpu-blacklist flag and other Chromium option flags.
See this tutorial on the Scirra tutorials site > How to use AdMob and IAP official plugins on Android-Crosswalk/XDK < written by Construct2 developer Kyatric.
Also, see this blog written by a Construct2 game developer regarding how to build a Construct2 app using the Appodeal ad plugin with your Construct2 app and the Intel XDK > How to fix the build error with Intel XDK and Appodeal? <.
The "Target Android API" value (aka android-targetSdkVersion), found in the Build Settings section of the Projects tab, is the version of Android that your app and the libraries associated with your app are tested against, it DOES NOT represent the maximum level of Android onto which you can install and run your app. When building a Crosswalk app you should set to this value to that value recommend by the Crosswalk project.
The recommended "Target Android API" levels for Crosswalk on Android apps are:
As of release 3088 of the Intel XDK, the recommended value for your android-targetSdkVersion is 21. In previous versions of the Intel XDK the recommended value was 19. If you have it set to a higher number (such as 23), we recommend that you change your setting to 21.
As of release 3088 of the Intel XDK, it is possible to build your Crosswalk for Android app using versions of the Crosswalk library that are not listed in the Project tab's Build Settings section. You can override the value that is selected in the Build Settings UI by adding a line to the intelxdk.config.additions.xml file.
NOTE: The process described below is for experts only! By using this process you are effectively disabling the Crosswalk version that is selected in the Build Settings UI and you are overriding the version of Crosswalk that will be used when you build a custom debug module with the Debug tab.
When building a Crosswalk for Android application, with CLI 5.x and higher, the Cordova Crosswalk Webview Plugin is used to facilitate adding the Crosswalk webview library to the build package (the APK). That plugin effectively "includes" the specified Crosswalk library when the app is built. The version of the Crosswalk library selected in the Build Settings UI is effected by a line in the Android build config file, similar to the following:
<intelxdk:crosswalk version="16"/>
The line above is added automatically to the intelxdk.config.android.xml file by the Intel XDK. If you attempt to change lines in the Android build config file they will be overwritten by the Intel XDK each time you use the Build tab (perform a build) or the Test tab. In order to modify (or override) this line in the Android config file you need to add a line to the intelxdk.config.additions.xml file.
The precise line you include in the intelxdk.config.additions.xml file depends on the version of the Crosswalk library you want to include.
<!-- Set the Crosswalk embedded library to something other than those listed in the UI. --><!-- In practice use only one, multiple examples are shown for illustration. --><preference name="xwalkVersion" value="17+"/><preference name="xwalkVersion" value="14.43.343.24" /><preference name="xwalkVersion" value="org.xwalk:xwalk_core_library_beta:18+"/>
The first example line in the code snippet above asks the Intel XDK to build with the "last" or "latest" version of the Crosswalk 17 release library (the '+' character means "last available" for the specified version). The second example requests an explicit version of Crosswalk 14 when building the app (e.g., version 14.43.343.24). The third example shows how to request the "latest" version of Crosswalk 18 from the Crosswalk beta Maven repository.
NOTE: only one such "xwalkVersion" preference tag should be used. If you include more than one "xwalkVersion" only the last one specified in the intelxdk.config.additions.xml file will be used.
The specific versions of Crosswalk that you can use can be determined by reviewing the Crosswalk Maven repositories: one for released Crosswalk libraries and one for beta versions of the Crosswalk library.
Not all Crosswalk libraries are guaranteed to work with your built app, especially the beta versions of the Crosswalk library. There may be library dependencies on the specific version of the Cordova Crosswalk Webview Plugin or the Cordova-Android framework. If a library does not work, select a different version.
Detailed instructions on the preference tag being used here are available in the Crosswalk Webview Plugin README.md documentation.
If you are curious when a specific version of Chromium will be supported by Crosswalk, please see the Crosswalk Release Dates wiki published by the Crosswalk Project.
The white box or white bands you see between the ending of the splash screen and the beginning of your app appears to be due to some webview initialization. It also appears in non-Crosswalk apps on Android, but does not show up as white. The white band that does appear can cause an initial "100% image" to bounce up and down momentarily. This issue is not being caused by the splash screen plugin or the Intel XDK; it appears to be interference caused by the Cordova webview initialization.
The following solution appears to work, although there may be some situations that it does not help. As this problem is better understood more information will be provided in this FAQ.
Add the following lines to your intelxdk.config.additions.xml file:
<platform name="android"><!-- set Crosswalk default background color --><!-- see http://developer.android.com/reference/android/graphics/Color.html --><preference name="BackgroundColor" value="0x00000000" /></platform>
The value 0x00000000 configures the webview background color to be "transparent black," according to the Cordova documentation and the Crosswalk webview plugin code. You should be able to set that color to anything you want. However, this color appears to work the best.
You may also want to add the following to your intelxdk.config.additions.xml file:
<platform name="android"><!-- following requires the splash screen plugin --><!-- see https://github.com/apache/cordova-plugin-splashscreen for details --><preference name="SplashScreen" value="screen" /><preference name="AutoHideSplashScreen" value="false" /><!-- <preference name="SplashScreenDelay" value="30000" /> --><preference name="FadeSplashScreen" value="false"/><!-- <preference name="FadeSplashScreenDuration" value="3000"/> --><preference name="ShowSplashScreenSpinner" value="false"/><preference name="SplashMaintainAspectRatio" value="false" /><preference name="SplashShowOnlyFirstTime" value="false" /></platform>
Testing of this fix was done with Crosswalk 17 on an Android 4.4, Android 5.0 and an Android 6.0 device.
This article gets you started with hands-on development, execution, and profiling of the Data Plane Development Kit (DPDK) application on your own laptop. This enhances portability as well as sharing and teaching developers, customers, and students in a scalable way.
M Jay has worked with the DPDK team since 2009. He joined Intel in 1991 and has been in various roles and divisions: 64-bit CPU front side bus architect, 64 bit HAL developer, among others, before he joined the DPDK team. M Jay holds 21 US patents, both individually and jointly, all issued while working at Intel. M Jay was awarded the Intel Achievement Award in 2016, Intel's highest honor based on innovation and results.
To run and profile DPDK on the Linux* platform, please refer to the article Profiling DPDK Code with Intel® VTune™ Amplifier. If you don’t want to install Linux on your laptop, follow the steps in this article to learn how to configure your Intel® architecture-based Windows* laptop to develop, run and get started profiling DPDK applications.
Intel® VTune™ Amplifier, a performance profiler, will run natively on the Windows* OS so that it can access all the hardware performance registers. Developing and running DPDK applications will be done on an Oracle VM VirtualBox*.
The instructions in this article were tested on an Intel® Xeon® processor-based desktop, server, and laptop. Here we will use a laptop with the Windows OS.
If you have an Apple* laptop, the appendix provides information about systems based on the Mac OS*.
The platform can be any Intel® processor-based platform: desktop, server, laptop, or embedded system.
This article covers the following steps:
This is needed to ensure 64-bit guests can be run; VT-d and VT-x need to be on.
Intel VT-d and Intel VT-x enabling will be under Advanced CPU settings or Advanced Chipset settings as mentioned below. First we need to get into safe mode and look into BIOS setting.
To get into BIOS, press SHIFT+RESTART.
If you have a laptop installed with Windows* 8, go to safe mode (SHIFT+RESTART).
You will see the following settings. Note that depending on your computer, you may see different options.
To use advanced tools, choose Troubleshoot.
If the following screen displays, choose Enable Safe Mode to access the screen for the BIOS change.
Once you have selected safe mode, you will be able to access additional options, as shown below.
;
Select UEFI Firmware Settings
Note: In your system, it may be referred to as BIOS setting.
Depending on your vendor and BIOS, you will be able to access the Advanced setting or Advance Chipset Control or Advanced CPU Control. What you need to do is verify whether Intel VT is enabled. In certain BIOS models, it may display as VT-d and VT-x.
Some systems will have both a CPU section (for Intel VT-x) and a chipset section (for Intel VT-d) so you may have to look at both sections to enable virtualization.
Below are two screens: the CPU screen followed by the chipset screen. In this system, only the chipset screen has virtualization control.
Save and then exit.
Now the OS and applications come up.
To access the downloads, go to https://www.virtualbox.org/wiki/Downloads
For Windows:
For OS X*:
Select VirtualBox5.1.8 (or the latest) for OS X hosts.
Why install extension packs? What functionality do they provide?
Extension packs complement the functionality of VirtualBox.
Now you are ready to import the VMs.
Note: If you don’t see 64-bit versions and see only 32-bit version, you’ll need to enable Intel VT-d and Intel VT-x correctly. Return to the BIOS setting steps under “Step 1: “Make sure Intel VT-d and Intel VT-x are enabled in UEFI firmware/BIOS.”
In this article, we assume that you have plugged in a thumb drive with a copy of an exported DPDK application virtual machine that was built on a native Linux platform running DPDK. When you have connected the thumb drive, follow these instructions to import the VM.
Select the VM.
Example, as shown below: Ubuntu Nov 7 VTune DPDK.
Select Import.
You will see the appliance being imported as shown below.
You have successfully imported the DPDK virtual appliance, as shown by the arrow in the screenshot below.
You have successfully launched DPDK running in the Ubuntu guest OS with VirtualBox on your laptop as shown below.
Select the Imported DPDK appliance.
To start the imported DPDK appliance, click Start.
You have successfully launched DPDK running in the Ubuntu guest OS with VirtualBox on your laptop as shown below
Now you can start your own development by developing applications, building, and running. To get started, locate the README_FIRST file, as shown in the above screenshot. Click open and you’ll find instructions to run DPDK microbenchmarks and other applications.
Let’s say you want to know where cycles are being spent in the system. You can use Task Manager to have get a bird’s-eye view first. Then you can dig into Intel VTune Amplifier.
The screenshot below shows the CPU cycles and tasks running, with Windows Task Manager showing CPU utilization running the DPDK application as a guest with VirtualBox.
The next step is to open a terminal as an administrator. It is important to access Intel VTune Amplifier as an administrator.
C:cd Program Files (x86)\IntelSWTools\VTune Amplifier XE 2017\bin32
amplxe-sepreg.exe –c
The following message screen should display.
The screen above indicates that you have successfully verified the correct dependency checks required to install the sampling driver:
amplxe-sepreg.exe –s
The following message screen should display.
The screen above indicates that the sampling driver loaded successfully.
NOTE: If the sampling driver did NOT successfully load, refer to Appendix 3. Do NOT enter the command in Appendix 3 if you see the above success message.
What’s next?
The default installation path for the Intel VTune™ Amplifier XE is[Program Files (x86)]\IntelSWTools\VTune™ Amplifier XE
cd \Progam Files (x86)\IntelSWTools\VTune™ Amplifier XE 2017
amplxe-vars.bat
- run the batch file as shown below You have set the needed environment variables successfully. You will get output as shown below
The final step is to run Intel VTune Amplifier.
amplxe-gui
- run VTune; the GUI version is shown below You will see the welcome screen as shown below.
Be sure to print the items circles below: Getting Started and Discover Performance Snapshots.
Start practicing by clicking New Project (also circled)
To get hands-on practice, please refer to the sections after “Starting Intel VTune Amplifier” in the following article: Profiling DPDK Code with Intel® VTune™ Amplifier
Also refer to the resources given in the reference section of the above article for videos and articles.
With the above hands-on exercise, you have successfully completed your “DPDK-On-The-Go” hands-on exercise.
As your first step, please register on the DPDK mailing list http://www.dpdk.org/ml/listinfo/dev
Also, we encourage you to play an active role in our meetups and DPDK community: www.dpdk.org
Please provide your feedback on this article to Muthurajan.Jayakumar@intel.com within 2 weeks after you go through your hands-on experience.
This article’s instructions were tested on a laptop with the Windows OS. Here are some references for the Mac regarding enabling Intel VT.
http://kb.parallels.com/en/5653
https://support.apple.com/en-us/HT203296
If the sampling driver is not installed but the system is supported by Intel VTune Amplifier, execute the following command with the administrative privileges to install the driver:
amplxe-sepreg.exe –I
While Intel VTune Amplifier 2017 can run on Windows and Linux systems, the profiled results can be seen on OS X.
So you can run the DPDK applications with VirtualBox in Mac computers. For profiling you can use native tools that come with OS X.
And you can use the Viewer given below to view the output Intel VTune Amplifier generated on Windows or Linux machines.
Please refer to the article How to Download and Evaluate the VTune™ Amplifier OS X* Viewer
DPDK-in-a-Box uses Minnowboard Turbot Single Board Computer.
Profiling DPDK Code with Intel® VTune™ Amplifier
Video: Intel® VTune™ and Performance Optimizations
DPDK Performance Optimization Guidelines White Paper
The intro paragraph is a wonderful way to highlight the start of your article. This extra large copy helps you focus the developer on your main purpose. Simply select your copy and open the "Styles" dropdown in the WYSIWYG editor. Choose "Intro" and your copy should transform automatically.
This is the default for any copy you enter into the editor. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Maecenas sed diam eget risus varius blandit sit amet non magna. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.
We have a number of versions of this highlight box style that are all found in the styles menu. This one is called "Note" and has a yellow highlight band that appears on the right. This one is great for alerts or information you want to draw your developer to. When you hit a hard enter, you'll notice that you've created another box. Simply type in something, select that copy and click the remove styles button above.
Another one is called "Grey Highlight". You can change the alignment of the text (left, center, right) for all three of these styles. You can't restrict the width of this style to a portion of the page and align right. It is required that this style be full content width. Don't use in a table.
When adding your code snippet, make sure to select "Code Simple" from the styles dropdown to ensure for the best viewing experience.document.getElementById("demo").style.fontSize = "25px";document.getElementById('demo').style.fontSize = '25px';
Intel® Cluster Checker verifies the configuration and performance of Linux based clusters and checks compliance with the Intel® Scalable System Framework architecture specification. If issues are found, Intel® Cluster Checker diagnoses the problems and may provide recommendations on how to repair the cluster.
Intel® Cluster Checker has the following features:
Intel® Cluster Checker is installed as part of the following suites:
The following flowchart represents the usage model for working with the Intel® Cluster Checker.
source /opt/intel/clck/2018.0/bin/clckvars.sh
frontend #role: head
node1
node2
node3
node4
Run the following from a command line. nodefile should be in a shared & writeable location.
clck-collect -a -f nodefile
Run this from a command line:
clck-analyze -f nodefile
Resolve any issues reported in step 2 and repeat steps 1 and 2 until you are satisfied with the results.
By default, diagnosed signs are not included in the analyzer output. If the analyzer reports issues, then it will be beneficial to output diagnosed signs on subsequent runs. More data about signs and diagnoses can be found in the User's Guide. Run this from a command line to print diagnosed signs:
clck-analyze -f nodefile -p diagnosed_signs
There will be occasions where modifications of the default XML configuration file are needed. This can happen when more output is desired, test parameters need to be modified, the log level must be changed, etc. More information can be found in the User's Guide.
Files will be installed into /opt/intel/clck/2018.0.
clck-collect --help
clck-analyze --help
clckdb --help
cp /opt/intel/clck/2018.0/etc/clck.xml ~
clck-analyze -f nodefile -c ~/clck.xml
All of the following documents can be found at https://software.intel.com/en-us/intel-cluster-checker-support/documentation:
Document | Description |
---|---|
Intel® Cluster Checker Developer’s Guide | Contains a breakdown of the following components: the knowledge base, the connector, and the database schema. |
Intel® Cluster Checker User's Guide | Contains a description of the product, including the following components and processes: the analyzer, knowledge base, connector, data collection, data providers, and the database schema. |
Intel® Cluster Checker Release notes | Contains a brief overview of the product, new features, system requirements, installation notes, documentation, known limitations, technical support, and the disclaimer and legal information. |
Introduction
An Overview of the Classic Matrix Multiplication Algorithm
Total Number of Floating Point Operations
Implementation Complexity
Optimization Techniques
Memory Allocation Schemes
Loop Processing Schemes
Compute Schemes
Error Analysis
Performance on Intel® Xeon Phi™ Processor System
OpenMP* Product Thread Affinity Control
Recommended Intel® C++ Compiler Command-Line Options
Conclusion
References
Downloads
Abbreviations
Appendix A - Technical Specifications of Intel Xeon Phi Processor System
Appendix B - Comparison of Processing Times for MMAs vs. MTA
Appendix C - Error Analysis (Absolute Errors for SP FP Data Type)
Appendix D - Performance of MMAs for Different MASs
About the Author
Matrix multiplication (MM) of two matrices is one of the most fundamental operations in linear algebra. The algorithm for MM is very simple, it could be easily implemented in any programming language, and its performance significantly improves when different optimization techniques are applied.
Several versions of the classic matrix multiplication algorithm (CMMA) to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel® Math Kernel Library (Intel® MKL)7. Tests are completed on a computer system with Intel® Xeon Phi™ processor 72105 running the Linux Red Hat* operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.
All versions of CMMAs for single and double precision floating point data types described in the article are implemented in the C programming language and compiled with Intel® C++ Compiler versions 17 and 16 for Linux*6.
The article targets experienced C/C++ software engineers and can be considered as a reference on application optimization techniques, analysis of performance, and accuracy of computations related to MMAs.
If needed, the reader may review the contents of References1 or2 for a description of mathematical fundamentals of MM, because theoretical topics related to MM are not covered in this article.
A fundamental property of any algorithm is its asymptotic complexity (AC)3.
In generic form, AC for MMA can be expressed as follows:
MMA AC = O(N^Omega)
where O stands for operation on a data element, also known in computer science as a Big O; N is one dimension of the matrix, and omega is a matrix exponent which equals 3.0 for CMMA. That is:
CMMA AC = O(N^3)
In order to compute a product of two square matrices using CMMA, a cubic number of floating point (FP) multiplication operations is required. In other words, the CMMA runs in O(N^3) time.
An omega lower than 3.0 is possible, and it means that an MMA computes a product of two matrices faster because an optimization technique, mathematical or programming, is applied and fewer FP multiplication operations are required to compute the product.
A list of several MMAs with different values of omega is as follows:
Algorithm | Omega | Note |
---|---|---|
Francois Le Gall | 2.3728639 | (1) |
Virginia Vassilevska Williams | 2.3728642 | |
Stothers | 2.3740000 | |
Coppersmith-Winograd | 2.3760000 | |
Bini | 2.7790000 | |
Pan | 2.7950000 | |
Strassen | 2.8070000 | (2) |
Strassen-Winograd | 2.8070000 | |
Classic | 3.0000000 | (3) |
Table 1. Algorithms are sorted by omega in ascending order.
Let's assume that:
M x N is a dimension of a matrix A, or A[M,N]
N x P is a dimension of a matrix B, or B[N,P]
M x P is a dimension of a matrix C, or C[M,P]
There are three relations between M, N and P:
Relation #1: A[...,N] = B[N,...]
Relation #2: A[M,...] = C[M,...]
Relation #3: B[...,P] = C[...,P]
If one of these three relations is not met, the product of two matrices cannot be computed.
In this article only square matrices of dimension N, where M = N = P, will be considered. Therefore:
A[N,N] is the same as A[M,N]
B[N,N] is the same as B[N,P]
C[N,N] is the same as C[M,P]
The following table shows how many multiplications are needed to compute a product of two square matrices of different Ns for three algorithms from Table 1 with omega = 2.3728639 (1), omega = 2.807 (2) and omega = 3.0 (3).
N | Omega = 2.3728639 (1) | Omega = 2.807 (2) | Omega = 3.0 (3) |
---|---|---|---|
128 | 100,028 | 822,126 | 2,097,152 |
256 | 518,114 | 5,753,466 | 16,777,216 |
512 | 2,683,668 | 40,264,358 | 134,217,728 |
1024 | 13,900,553 | 281,781,176 | 1,073,741,824 |
2048 | 72,000,465 | 1,971,983,042 | 8,589,934,592 |
4096 | 372,939,611 | 13,800,485,780 | 68,719,476,736 |
8192 | 1,931,709,091 | 96,579,637,673 | 549,755,813,888 |
16384 | 10,005,641,390 | 675,891,165,093 | 4,398,046,511,104 |
32768 | 51,826,053,965 | 4,730,074,351,662 | 35,184,372,088,832 |
65536 | 268,442,548,034 | 33,102,375,837,652 | 281,474,976,710,656 |
Table 2.
For example, to compute a product of two square dense matrices of dimension N equal to 32,768, Francois Le Gall (1) MMA needs ~51,826,053,965 multiplications and Classic (3) MMA needs ~35,184,372,088,832 multiplications.
Imagine the case of the product of two square matrices where N equals 32,768 needs to be computed on a very slow computer system. It means that if the Francois Le Gall MMA completes the processing in one day, then the classic MMA will need ~679 days on the same computer system, or almost two years. This is because the Francois Le Gall MMA needs ~679x fewer multiplications to compute a product:
~35,184,372,088,832 / ~51,826,053,965 = ~678.9
In the case of using a famous Strassen (2) MMA, ~91 days would be needed:
~4,730,074,351,662 / ~51,826,053,965 = ~91.3
In many software benchmarks the performance of an algorithm, or some processing, is measured in floating point operations per second (FLOPS), and not in elapsed time intervals, like days, hours, minutes, or seconds. That is why it is very important to know an exact total number (TN) of FP operations completed to calculate a FLOPS value.
With modern C++ compilers, it is very difficult to estimate an exact TN of FP operations per unit of time completed at run time due to extensive optimizations of generated binary codes. It means that an analysis of binary codes could be required, and this is outside of the scope of this article.
However, an estimate value of the TN of FP operations, multiplications and additions, for CMMA when square matrices are used can be easily calculated. Here are two simple examples:
Example 1: N = 2
Multiplications = 8 // 2 * 2 * 2 = 2^3 Additions = 4 // 2 * 2 * 1 = 2^2*(2-1) TN FP Ops = 8 + 4 = 12
Example 2: N = 3
Multiplications = 27 // 3 * 3 * 3 = 3^3 Additions = 18 // 3 * 3 * 2 = 3^2*(3-1) TN FP Ops = 27 + 18 = 45
It is apparent that the TN of FP operations to compute a product of two square matrices can be calculated using a simple formula:
TN FP Ops = (N^3) + ((N^2) * (N-1))
Note: Take into account that in the versions of the MMA used for sparse matrices, no FP operations are performed if the matrix element at position (i,j) is equal to zero.
In the C programming language only four lines of code are needed to implement a core part of the CMMA:
for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[i][j] += A[i][k] * B[k][j];
Therefore, CMMA's implementation complexity (IC) could be rated as very simple.
Declarations of all intermediate variables, memory allocations, and initialization of matrices are usually not taken into account.
More complex versions of MMA, like Strassen or Strassen-Winograd, could have several thousands of code lines.
In computer programming, matrices could be represented in memory as 1-D or 2-D data structures.
Here is a static declaration of matrices A, B, and C as 1-D data structures of a single precision (SP) FP data type (float):
float fA[N*N]; float fB[N*N]; float fC[N*N];
and this is what a core part of the CMMA looks like:
for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[N*i+j] += A[N*i+k] * B[N*k+j];
Here is a static declaration of matrices A, B, and C as 2-D data structures of a single precision (SP) FP data type (float):
float fA[N][N]; float fB[N][N]; float fC[N][N];
and this is what the core part of CMMA looks like:
for( i = 0; i < N; i += 1 ) for( j = 0; j < N; j += 1 ) for( k = 0; k < N; k += 1 ) C[i][j] += A[i][k] * B[k][j];
Many other variants of the core part of CMMA are possible and they will be reviewed.
In the previous section of this article, two examples of a static declaration of matrices A, B, and C were given. In the case of dynamic allocation of memory for matrices, explicit calls to memory allocation functions need to be made. In this case, declarations and allocations of memory can look like the following:
Declaration of matrices A, B, and C as 1-D data structures:
__attribute__( ( aligned( 64 ) ) ) float *fA; __attribute__( ( aligned( 64 ) ) ) float *fB; __attribute__( ( aligned( 64 ) ) ) float *fC;
and this is how memory needs to be allocated:
fA = ( float * )_mm_malloc( N * sizeof( float ), 64 ); fB = ( float * )_mm_malloc( N * sizeof( float ), 64 ); fC = ( float * )_mm_malloc( N * sizeof( float ), 64 );
Note: Allocated memory blocks are 64-byte aligned, contiguous, and not fragmented by an operating system memory manager; this improves performance of processing.
Declaration of matrices A, B, and C as 2-D data structures:
__attribute__( ( aligned( 64 ) ) ) float **fA; __attribute__( ( aligned( 64 ) ) ) float **fB; __attribute__( ( aligned( 64 ) ) ) float **fC;
and this is how memory needs to be allocated:
fA = ( float ** )calloc( N, sizeof( float * ) ); fB = ( float ** )calloc( N, sizeof( float * ) ); fC = ( float ** )calloc( N, sizeof( float * ) ); for( i = 0; i < N; i += 1 ) { fA[i] = ( float * )calloc( N, sizeof( float ) ); fB[i] = ( float * )calloc( N, sizeof( float ) ); fC[i] = ( float * )calloc( N, sizeof( float ) ); }
Note: Allocated memory blocks are not contiguous and can be fragmented by an operating system memory manager, and fragmentation can degrade performance of processing.
In the previous examples, a DDR4-type RAM memory was allocated for matrices. However, on an Intel Xeon Phi processor system5 a multichannel DRAM (MCDRAM)-type RAM memory could be allocated as well, using functions from a memkind library11 when MCDRAM mode is configured to 'Flat' or 'Hybrid'. For example, this is how an MCDRAM-type RAM memory can be allocated:
fA = ( float * )hbw_malloc( N * sizeof( float ) ); fB = ( float * )hbw_malloc( N * sizeof( float ) ); fC = ( float * )hbw_malloc( N * sizeof( float ) );
Note: An 'hbw_malloc' function of the memkind library was used instead of an '_mm_malloc' function.
On an Intel Xeon Phi processor system, eight variants of memory allocation for matrices A, B, and C are possible:
Matrix A | Matrix B | Matrix C | Note |
---|---|---|---|
DDR4 | DDR4 | DDR4 | (1) |
DDR4 | DDR4 | MCDRAM | (2) |
DDR4 | MCDRAM | DDR4 | |
DDR4 | MCDRAM | MCDRAM | |
MCDRAM | DDR4 | DDR4 | |
MCDRAM | DDR4 | MCDRAM | |
MCDRAM | MCDRAM | DDR4 | |
MCDRAM | MCDRAM | MCDRAM |
Table 3.
It is recommended to use MCDRAM memory as much as possible because its bandwidth is ~400 GB/s, and it is ~5 times faster than the ~80 GB/s bandwidth of DDR4 memory5.
Here is an example of how 'cblas_sgemm' MMA performs for two memory allocation schemes (MASs) (1) and (2):
Matrix multiplication C=A*B where matrix A (32768x32768) and matrix B (32768x32768) Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:DDR4 Initializing matrix data Matrix multiplication started Matrix multiplication completed at 50.918 seconds
Allocating memory for matrices A, B, C: MAS=DDR4:DDR4:MCDRAM Initializing matrix data Matrix multiplication started Matrix multiplication completed at 47.385 seconds
It is clear that there is a performance improvement of ~7 percent when an MCDRAM memory was allocated for matrix C.
A loop processing scheme (LPS) describes what optimization techniques are applied to the 'for' statements of the C language of the core part of CMMA. For example, the following code:
for( i = 0; i < N; i += 1 ) // loop 1 for( j = 0; j < N; j += 1 ) // loop 2 for( k = 0; k < N; k += 1 ) // loop 3 C[i][j] += A[i][k] * B[k][j];
corresponds to an LPS=1:1:1, and it means that loop counters are incremented by 1.
Table 4 below includes short descriptions of different LPSs:
LPS | Note |
---|---|
1:1:1 | Loops not unrolled |
1:1:2 | 3rd loop unrolls to 2-in-1 computations |
1:1:4 | 3rd loop unrolls to 4-in-1 computations |
1:1:8 | 3rd loop unrolls to 8-in-1 computations |
1:2:1 | 2nd loop unrolls to 2-in-1 computations |
1:4:1 | 2nd loop unrolls to 4-in-1 computations |
1:8:1 | 2nd loop unrolls to 8-in-1 computations |
Table 4.
For example, the following code corresponds to an LPS=1:1:2, and it means that counters 'i' and 'j' for loops 1 and 2 are incremented by 1, and counter 'k' for loop 3 is incremented by 2:
for( i = 0; i < N; i += 1 ) // :1 { for( j = 0; j < N; j += 1 ) // :1 { for( k = 0; k < N; k += 2 ) // :2 (unrolled loop) { C[i][j] += A[i][k ] * B[k ][j]; C[i][j] += A[i][k+1] * B[k+1][j]; } } }
Note: A C++ compiler could unroll loops as well if command line-options for unrolling are used. A software engineer should prevent such cases when source code unrolling is used at the same time, because it prevents vectorization of inner loops, and degrades performance of processing.
Another optimization technique is the loop interchange optimization technique (LIOT). When the LIOT is used, a core part of CMMA looks as follows:
for( i = 0; i < N; i += 1 ) // loop 1 for( k = 0; k < N; k += 1 ) // loop 2 for( j = 0; j < N; j += 1 ) // loop 3 C[i][j] += A[i][k] * B[k][j];
It is worth noting that counters 'j' and 'k' for loops 2 and 3 were exchanged.
The loops unrolling and LIOT allow for improved performance of processing because elements of matrices A and B are accessed more efficiently.
A compute scheme (CS) describes the computation of final or intermediate values and how elements of matrices are accessed.
In a CMMA an element (i,j) of the matrix C is calculated as follows:
C[i][j] += A[i][k] * B[k][j]
and its CS is ij:ik:kj.
However, elements of matrix B are accessed in a very inefficient way. That is, the next element of matrix B, which needs to be used in the calculation, is located at a distance of (N * sizeof (datatype)) bytes. For very small matrices this is not critical because they can fit into CPU caches. However, for larger matrices it affects performance of computations, which can be significantly degraded, due to cache misses.
In order to solve that problem and improve performance of computations, a very simple optimization technique is used. If matrix B is transposed, the next element that needs to be used in the calculation will be located at a distance of (sizeof (datatype)) bytes. Thus, access to the elements of matrix B will be similar to the access to the elements of matrix A.
In a transpose-based CMMA, an element (i,j) of the matrix C is calculated as follows:
C[i][j] += A[i][k] * B[j][k]
and its CS is ij:ik:jk. Here B[j][k] is used instead of B[k][j].
It is very important to use the fastest possible algorithm for the matrix B transposition before processing is started. In Appendix B an example is given on how much time is needed to transpose a square matrix of 32,768 elements, and how much time is needed to compute the product on an Intel Xeon Phi processor system.
Another optimization technique is the loop blocking optimization technique (LBOT) and it allows the use of smaller subsets of A, B, and C matrices to compute the product. When the LBOT is used, a core part of CMMA looks as follows:
for( i = 0; i < N; i += BlockSize ) { for( j = 0; j < N; j += BlockSize ) { for( k = 0; k < N; k += BlockSize ) { for( ii = i; ii < ( i+BlockSize ); ii += 1 ) for( jj = j; jj < ( j+BlockSize ); jj += 1 ) for( kk = k; kk < ( k+BlockSize ); kk += 1 ) C[ii][jj] += A[ii][kk] * B[kk][jj]; } } }
Note: A detailed description of LBOT can be found at10.
Table 5 shows four examples of CSs:
CS | Note |
---|---|
ij:ik:kj | Default |
ij:ik:jk | Transposed |
iijj:iikk:kkjj | Default LBOT |
iijj:iikk:jjkk | Transposed LBOT |
Table 5.
In any version of MMA many FP operations need to be done in order to compute values of elements of matrix C. Since FP data types SP or DP have limited precision4, rounding errors accumulate very quickly. A common misconception is that rounding errors can occur only in cases where large or very large matrices need to be multiplied. This is not true because, in the case of floating point arithmetic (FPA), a rounding error is also a function of the range of an input value, and it is not a function of the size of input matrices.
However, a very simple optimization technique allows improvement in the accuracy of computations.
If matrices A and B are declared as an SP FP data type, then intermediate values could be stored in a variable of DP FP data type:
for( i = 0; i < N; i += 1 ) { for( j = 0; j < N; j += 1 ) { double sum = 0.0; for( k = 0; k < N; k += 1 ) { sum += ( double )( A[i][k] * B[k][j] ); } C[i][j] = sum; } }
The accuracy of computations will be improved, but performance of processing can be lower.
An error analysis (EA) is completed using the mmatest4.c test program for different sizes of matrices of SP and DP FP data types (see Table 6 in Appendix C with results).
Several versions of the CMMA to compute a product of square dense matrices are evaluated in four test programs. Performance of these CMMAs is compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7. Also see Appendix D for more evaluations.
Figure 1. Performance tests for matrix multiply algorithms on Intel® Xeon Phi™ processor using mmatest1.c with KMP_AFFINITY environment variable set to 'scatter', 'balanced', and 'compact'. A lower bar height means faster processing.
Here are the names of source files with a short description of tests:
mmatest1.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor.
mmatest2.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in one MCDRAM mode ('Flat') for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
mmatest3.c - Performance tests matrix multiply algorithms on an Intel Xeon Phi processor in three MCDRAM modes ('All2All', 'Flat', and 'Cache') for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRAM MASs. Note: In 'Cache' MCDRAM mode, MCDRAM:MCDRAM:MCDRAM MAS cannot be used.
mmatest4.c - Verification matrix multiply algorithms accuracy of computations on an Intel Xeon Phi processor.
OpenMP* product compiler directives can be easily used to parallelize processing and significantly speed up processing. However, it is very important to execute OpenMP threads on different logical CPUs of modern multicore processors in order to utilize their internal resources as best as possible.
In the case of using the Intel C++ compiler and Intel OpenMP run-time libraries, the KMP_AFFINITY environment variable provides flexibility and simplifies that task. Here are three simple examples of using the KMP_AFFINITY environment variable:
KMP_AFFINITY = scatter KMP_AFFINITY = balanced KMP_AFFINITY = compact
These two screenshots of the Htop* utility12 demonstrate how OpenMP threads are assigned (pinned) to Intel Xeon Phi processor 72105 logical CPUs during processing of an MMA using 64 cores of the processor:
Screenshot 1. KMP_AFFINITY = scatter or balanced. Note: Processing is faster when compared to KMP_AFFINITY = compact.
Screenshot 2. KMP_AFFINITY = compact. Note: Processing is slower when compared to KMP_AFFINITY = scatter or balanced.
Here is a list of Intel C++ Compiler command-line options that a software engineer should consider, which can improve performance of processing of CMMAs:
O3
fp-model
parallel
unroll
unroll-aggressive
opt-streaming-stores
opt-mem-layout-trans
Os
openmp
ansi-alias
fma
opt-matmul
opt-block-factor
opt-prefetch
The reader can use 'icpc -help' or 'icc -help' to learn more about these command-line options.
Application of different optimization techniques to the CMMA were reviewed in this article.
Three versions of CMMA to compute a product of square dense matrices were evaluated in four test programs. Performance of these CMMAs was compared to a highly optimized 'cblas_sgemm' function of the Intel MKL7.
Tests were completed on a computer system with an Intel® Xeon Phi processor 72105 running the Linux Red Hat operating system in 'All2All' Cluster mode and for 'Flat', 'Hybrid 50-50', and 'Cache' MCDRAM modes.
It was demonstrated that CMMA could be used for cases when matrices of small sizes, up to 1,024 x 1,024, need to be multiplied.
It was demonstrated that performance of MMAs is higher when MCDRAM-type RAM memory is allocated for matrices with sizes up to 16,384 x 16,384 instead of DDR4-type RAM memory.
Advantages of using CMMA to compute the product of two matrices are as follows:
Disadvantages of using CMMA to compute a product of two matrices are as follows:
1. Matrix Multiplication on Mathworld
http://mathworld.wolfram.com/MatrixMultiplication.html
2. Matrix Multiplication on Wikipedia
https://en.wikipedia.org/wiki/Matrix_multiplication
3. Asymptotic Complexity of an Algorithm
https://en.wikipedia.org/wiki/Time_complexity
4. The IEEE 754 Standard for Floating Point Arithmetic
5. Intel® Many Integrated Core Architecture
https://software.intel.com/en-us/xeon-phi/x200-processor
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
https://software.intel.com/en-us/forums/intel-many-integrated-core
6. Intel® C++ Compiler
https://software.intel.com/en-us/c-compilers
https://software.intel.com/en-us/forums/intel-c-compiler
7. Intel® MKL
https://software.intel.com/en-us/intel-mkl
https://software.intel.com/en-us/intel-mkl/benchmarks
https://software.intel.com/en-us/forums/intel-math-kernel-library
8. Intel® Developer Zone Forums
https://software.intel.com/en-us/forum
9. Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen 9
https://software.intel.com/en-us/articles/sgemm-ocl-opt
10. Performance Tools for Software Developers Loop Blocking
https://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking
11. Memkind library
https://github.com/memkind/memkind
12. Htop* monitoring utility
https://sourceforge.net/projects/htop
List of all files (sources, test reports, and so on):
Performance_CMMA_system.pdf - Copy of this paper.
mmatest1.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors.
dataset1.txt - Results of tests.
mmatest2.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors for DDR4:DDR4:DDR4 and DDR4:DDR4:MCDRAM MASs.
dataset2.txt - Results of tests.
mmatest3.c - Performance tests for matrix multiply algorithms on Intel® Xeon Phi processors in three MCDRAM modes for DDR4:DDR4:DDR4 and MCDRAM:MCDRAM:MCDRA MASs.
dataset3.txt - Results of tests.
mmatest4.c - Verification of matrix multiply algorithms accuracy of computations on Intel® Xeon Phi processors.
dataset4.txt - Results of tests.
Note: Intel C++ Compiler versions used to compile tests:
17.0.1 Update 132 for Linux*
16.0.3 Update 210 for Linux*
CPU - Central processing unit
GPU - Graphics processing unit
ISA - Instruction set architecture
MIC - Intel® Many Integrated Core Architecture
RAM - Random access memory
DRAM - Dynamic random access memory
MCDRAM - Multichannel DRAM
HBW - High bandwidth memory
DDR4 - Double data rate (generation) 4
SIMD - Single instruction multiple data
SSE - Streaming SIMD extensions
AVX - Advanced vector extensions
FP - Floating point
FPA - Floating point arithmetic4
SP - Single precision4
DP - Double precision4
FLOPS - Floating point operations per second
MM - Matrix multiplication
MMA - Matrix multiplication algorithm
CMMA - Classic matrix multiplication algorithm
MTA - Matrix transpose algorithm
AC - Asymptotic complexity
IC - Implementation complexity
EA - Error analysis
MAS - Memory allocation scheme
LPS - Loop processing scheme
CS - Compute scheme
LIOT - Loop interchange optimization technique
LBOT - Loop blocking optimization technique
ICC - Intel C++ Compiler6
MKL - Math kernel library7
CBLAS - C basic linear algebra subprograms
IDZ - Intel® Developer Zone8
IEEE - Institute of Electrical and Electronics Engineers4
GB - Gigabytes
TN - Total number
Summary of the Intel Xeon Phi processor system used for testing:
Process technology: 14nm
Processor name: Intel Xeon Phi processor 7210
Frequency: 1.30 GHz
Packages (sockets): 1
Cores: 64
Processors (CPUs): 256
Cores per package: 64
Threads per core: 4
On-Package Memory: 16 GB high bandwidth MCDRAM (bandwidth ~400 GB/s)
DDR4 Memory: 96 GB 6 Channel (Bandwidth ~ 80 GB/s)
ISA: Intel® AVX-512 (Vector length 512-bit)
Detailed processor specifications:
http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core
Summary of a Linux operating system:
[guest@... ~]$ uname -a
Linux c002-n002 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 #1 SMP
Fri Jul 8 11:44:24 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[guest@... ~]$ cat /proc/version
Linux version 3.10.0-327.13.1.el7.xppsl_1.4.0.3211.x86_64 (qb_user@89829b4f89a5)
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) #1 SMP Fri Jul 8 11:44:24 UTC 2016
Comparison of processing times for Intel MKL 'cblas_sgemm' and CMMA vs. MTA:
[Intel MKL & CMMA]
Matrix A [32768 x 32768] Matrix B [32768 x 32768]
Number of OpenMP threads: 64
MKL - Completed in: 51.2515874 seconds
CMMA - Completed in: 866.5838490 seconds
[MTA]
Matrix size: 32768 x 32768
Transpose Classic - Completed in: 1.730 secs
Transpose Diagonal - Completed in: 1.080 secs
Transpose Eklundh - Completed in: 0.910 secs
When compared processing time of MTA to:
MKL 'cblas_sgemm'. the transposition takes ~2.42 percent of the processing time.
CMMA, the transposition takes ~0.14 percent of the processing time.
N | MMA | Calculated SP Value | Absolute Error |
---|---|---|---|
8 | MKL | 8.000080 | 0.000000 |
8 | CMMA | 8.000080 | 0.000000 |
16 | MKL | 16.000160 | 0.000000 |
32 | CMMA | 16.000160 | 0.000000 |
32 | MKL | 32.000309 | -0.000011 |
32 | CMMA | 32.000320 | 0.000000 |
64 | MKL | 64.000671 | 0.000031 |
128 | CMMA | 64.000641 | 0.000001 |
128 | MKL | 128.001160 | -0.000120 |
128 | CMMA | 128.001282 | 0.000002 |
256 | MKL | 256.002319 | -0.000241 |
512 | CMMA | 256.002563 | 0.000003 |
512 | MKL | 512.004639 | -0.000481 |
512 | CMMA | 512.005005 | -0.000115 |
1024 | MKL | 1024.009521 | -0.000719 |
2048 | CMMA | 1024.009888 | -0.000352 |
2048 | MKL | 2048.019043 | -0.001437 |
2048 | CMMA | 2048.021484 | 0.001004 |
4096 | MKL | 4096.038574 | -0.002386 |
8192 | CMMA | 4096.037109 | -0.003851 |
8192 | MKL | 8192.074219 | -0.007701 |
8192 | CMMA | 8192.099609 | 0.017689 |
16384 | MKL | 16384.14648 | -0.017356 |
32768 | CMMA | 16384.09961 | -0.064231 |
32768 | MKL | 32768.33594 | 0.008258 |
32768 | CMMA | 32768.10156 | -0.226118 |
65536 | MKL | 65536.71875 | 0.063390 |
65536 | CMMA | 65536.10156 | -0.553798 |
Table 6.
Figure 2. Performance of Intel® MKL 'cbals_sgemm'. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest2.c. A lower bar height means faster processing.
Figure 3. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Flat'. Test program mmatest3.c. A lower bar height means faster processing.
Figure 4. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Hybrid 50-50'. Test program mmatest3.c. A lower bar height means faster processing.
Figure 5. Performance of Intel® MKL 'cbals_sgemm' vs. CMMA. KMP_AFFINITY environment variable set to 'scatter'. Cluster mode: 'All2All'. MCDRAM mode: 'Cache'. Test program mmatest3.c. A lower bar height means faster processing.
Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt Developer. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms, and high performance computing of big data sets.