Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Accelerating Secondary Genome Analysis Using Intel® Reference Architecture

$
0
0

The dramatic reduction in whole human genome sequencing costs, from USD 100 million per genome in 2001 to USD 4,500 per genome in 2014, combined with the increasing performance gains in computing technology, are revitalizing the healthcare and life sciences industries in ways only imagined a few years ago.

In fact, the healthcare and life sciences industries are reaching an exciting new inflection point, where they are shifting from population-based healthcare to personalized medicine, and where diagnostics and treatments are prescribed based on each person’s health history and genetic profile.

But many technical and policy challenges remain that must be addressed to enable ubiquitous genomics-based medicine and research. While recent U.S. and European laws have gone a long way in evolving healthcare and healthcare research policy, there is still much work to do on the technical infrastructure to enable ubiquitous genomics at scale.

This paper begins to address one of those technical challenges that illustrates the need for data platform technology innovation.

Download complete article PDF DownloadDownload

 


Statistical Analysis of Genome Sequencing Data with Intel® Reference Architecture

$
0
0

Next generation sequencing (NGS) technologies generate vast amounts of variant data, the analysis of which poses a big computational challenge. Numerous currently undertaken research efforts, such as population genetics studies or association studies, require computing various statistics and performing statistical tests on the genome sequencing data. With the aim of facilitating such analyses, Intel has developed a specialized analytics platform, referred to as the Intel Reference Architecture. This platform provides a comprehensive set of solutions, which enable convenient storing, manipulating and analyzing the genome sequencing data. The intuitive representation of variant data in a table format and the SQL-like interactive query interface make the Intel Reference Architecture a very attractive alternative to the existing NGS analytics tools.

In this study, we present a set of exemplary queries, which allow executing commonly used operations, such as calculating allele and genotype frequencies, testing for Hardy-Weinberg equilibrium and for association between SNPs and a given condition. To illustrate these queries, we used the 1000 Genomes data and we applied the operations to a set of 12 SNPs, known to be associated with type 2 diabetes.

Download complete article PDF DownloadDownload

Tuning Java* Garbage Collection for Spark* Applications

$
0
0

Spark is gaining wide industry adoption due to its superior performance, simple interfaces, and a rich library for analysis and calculation. Like many projects in the big data ecosystem, Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection (GC). New initiatives like Project Tungsten will simplify and optimize memory management in future Spark versions. But today, users who understand Java’s GC options and parameters can tune them to eek out the best the performance of their Spark applications. This article describes how to configure the JVM’s garbage collector for Spark, and gives actual use cases that explain how to tune GC in order to improve Spark’s performance. We look at key considerations when tuning GC, such as collection throughput and latency.

Download complete article PDF DownloadDownload

 

Intel® XDK FAQs - General

$
0
0
Q1: How can I get started with Intel XDK?

There are plenty of videos and articles that you can go through here to get started. You could also start with some of our demo apps that you think fits your app idea best and learn or take parts from multiple apps.

Having prior understanding of how to program using HTML, CSS and JavaScript* is crucial to using Intel XDK. Intel XDK is primarily a tool for visualizing, debugging and building an app package for distribution.

You can do the following to access our demo apps:

  • Select Project tab
  • Select "Start a New Project"
  • Select "Samples and Demos"
  • Create a new project from a demo

If you have specific questions following that, please post it to our forums.

Q2: Can I use an external editor for development in Intel® XDK?

Yes, you can open your files and edit them in your favorite editor. However, note that you must use Brackets* to use the "Live Layout Editing" feature. Also, if you are using App Designer (the UI layout tool in Intel XDK) it will make many automatic changes to your index.html file, so it is best not to edit that file externally at the same time you have App Designer open.

Some popular editors among our users include:

  • Sublime Text* (Refer to this article for information on the Intel XDK plugin for Sublime Text*)
  • Notepad++* for a lighweight editor
  • Jetbrains* editors (Webstorm*)
  • Vim* the editor
Q3: How do I get code refactoring capability in Brackets*, the code editor in Intel® XDK?

You will have to add the "Rename JavaScript* Identifier" extension and "Quick Search" extension in Brackets* to achieve some sort of refactoring capability. You can find them in Extension Manager under File menu.

Q4: Why doesn’t my app show up in Google* play for tablets?

...to be written...

Q5: What is the global-settings.xdk file and how do I locate it?

global-settings.xdk contains information about all your projects in the Intel XDK, along with many of the settings related to panels under each tab (Emulate, Debug etc). For example, you can set the emulator to auto-refresh or no-auto-refresh. Modify this file at your own risk and always keep a backup of the original!

You can locate global-settings.xdk here:

  • Mac OS X*
    ~/Library/Application Support/XDK/global-settings.xdk
  • Microsoft Windows*
    %LocalAppData%\XDK
  • Linux*
    ~/.config/XDK/global-settings.xdk

If you are having trouble locating this file, you can search for it on your system using something like the following:

  • Windows:
    > cd /
    > dir /s global-settings.xdk
  • Mac and Linux:
    $ sudo find / -name global-settings.xdk
Q6: When do I use the intelxdk.js, xhr.js and cordova.js libraries?

The intelxdk and xhr libraries are only needed with legacy build tiles. The Cordova* library is needed for all. When building with Cordova* tiles, intelxdk and xhr libraries are ignored and so they can be omitted.

Q7: What is the process if I need a .keystore file?

Please send an email to html5tools@intel.com specifying the email address associated with your Intel XDK account in its contents.

Q8: How do I rename my project that is a duplicate of an existing project?

Make a copy of your existing project directory and delete the .xdk and .xdke files from them. Import it into Intel XDK using the ‘Import your HTML5 Code Base’ option and give it a new name to create a duplicate.

Q9: How do I try to recover when Intel XDK won't start or hangs?
  • If you are running Intel XDK on Windows* it must be Windows* 7 or higher. It will not run reliably on earlier versions.
  • Delete the "project-name.xdk" file from the project directory that Intel XDK is trying to open when it starts (it will try to open the project that was open during your last session), then try starting Intel XDK. You will have to "import" your project into Intel XDK again. Importing merely creates the "project-name.xdk" file in your project directory and adds that project to the "global-settings.xdk" file.
  • Rename the project directory Intel XDK is trying to open when it starts. Create a new project based on one of the demo apps. Test Intel XDK using that demo app. If everything works, restart Intel XDK and try it again. If it still works, rename your problem project folder back to its original name and open Intel XDK again (it should now open the sample project you previously opened). You may have to re-select your problem project (Intel XDK should have forgotten that project during the previous session).
  • Clear Intel XDK's program cache directories and files.
    On a [Windows*] machine this can be done using the following on a standard command prompt (administrator not required):
    > cd %AppData%\..\Local\XDK
    > del *.* /s/q
    To locate the "XDK cache" directory on [OS X*] and [Linux*] systems, do the following:
    $ sudo find / -name global-settings.xdk
    $ cd <dir found above>
    $ sudo rm -rf *
    You might want to save a copy of the "global-settings.xdk" file before you delete that cache directory and copy it back before you restart Intel XDK. Doing so will save you the effort of rebuilding your list of projects. Please refer to this question for information on how to locate the global-settings.xdk file.
  • If you save the "global-settings.xdk" file and restored it in the step above and you're still having hang troubles, try deleting the directories and files above, along with the "global-settings.xdk" file and try it again.
  • Do not store your project directories on a network share (Intel XDK currently has issues with network shares that have not yet been resolved). This includes folders shared between a Virtual machine (VM) guest and its host machine (for example, if you are running Windows* in a VM running on a Mac* host). This network share issue is a known issue with a fix request in place.

Please refer to this post for more details regarding troubles in a VM. It is possible to make this scenario work but it requires diligence and care on your part.

  • There have also been issues with running behind a corporate network proxy or firewall. To check them try running Intel XDK from your home network where, presumably, you have a simple NAT router and no proxy or firewall. If things work correctly there then your corporate firewall or proxy may be the source of the problem.
  • Issues with Intel XDK account logins can also cause Intel XDK to hang. To confirm that your login is working correctly, go to the Intel XDK App Center and confirm that you can login with your Intel XDK account. While you are there you might also try deleting the offending project(s) from the App Center.

If you can reliably reproduce the problem, please send us a copy of the "xdk.log" file that is stored in the same directory as the "global-settings.xdk" file to mailto:html5tools@intel.com.

Q10: Is Intel XDK an open source project? How can I contribute to the Intel XDK community?

No, It is not an open source project. However, it utilizes many open source components that are then assembled into Intel XDK. While you cannot contribute directly to the Intel XDK integration effort, you can contribute to the many open source components that make up Intel XDK.

The following open source components are the major elements that are being used by Intel XDK:

  • Node-Webkit
  • Chromium
  • Ripple* emulator
  • Brackets* editor
  • Weinre* remote debugger
  • Crosswalk*
  • Cordova*
  • App Framework*
Q11: How do I configure Intel XDK to use 9 patch png for Android* apps splash screen?

Intel XDK does support the use of 9 patch png for Android* apps splash screen. You can read up more at http://developer.android.com/tools/help/draw9patch.html on how to create a 9 patch png image. We also plan to incorporate them in some of our sample apps to illustrate their use.

Q12: How do I stop AVG from popping up the "General Behavioral Detection" window when Intel XDK is launched?

You can try adding nw.exe as the app that needs an exception in AVG.

Q13: What do I specify for "App ID" in Intel XDK under Build Settings?

Your app ID uniquely identifies your app. For example, it can be used to identify your app within Apple’s application services allowing you to use things like in-app purchasing and push notifications.

Here are some useful articles on how to create an App ID for your

iOS* App

Android* App

Windows* Phone 8 App

Q14: Is it possible to modify Android* Manifest through Intel XDK?

You cannot modify the AndroidManifest.xml file directly with our build system, as it only exists in the cloud. However, you may do so by creating a dummy plugin that only contains a plugin.xml file which can then be add to the AndroidManifest.xml file during the build process. In essence, you need to change the plugin.xml file of the locally cloned plugin to include directives that will make those modifications to the AndroidManifext.xml file. Here is an example of a plugin that does just that:

<?xml version="1.0" encoding="UTF-8"?><plugin xmlns="http://apache.org/cordova/ns/plugins/1.0" id="com.tricaud.webintent" version="1.0.0"><name>WebIntentTricaud</name><description>Ajout dans AndroidManifest.xml</description><license>MIT</license><keywords>android, WebIntent, Intent, Activity</keywords><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- android --><platform name="android"><config-file target="AndroidManifest.xml" parent="/manifest/application"><activity android:configChanges="orientation|keyboardHidden|keyboard|screenSize|locale" android:label="@string/app_name" android:launchMode="singleTop" android:name="testa" android:theme="@android:style/Theme.Black.NoTitleBar"><intent-filter><action android:name="android.intent.action.SEND" /><category android:name="android.intent.category.DEFAULT" /><data android:mimeType="*/*" /></intent-filter></activity></config-file></platform></plugin>

You can check the AndroidManifest.xml created in the apk, using the apktool with the command line:  

aapt l -M appli.apk >text.txt  

This adds the list of files of the apk and details of the AndroidManifest.xml to text.txt.

Q15: How can I share my Intel XDK app build?

You can send a link to your project via an email invite from your project settings page. However, a login to your account is required to access the file behind the link. Alternatively, you can download the build from the build page, onto your workstation, and push that built image to some location from which you can send a link to that image. 

Q16: Why does my iOS build fail when I am able to test it successfully on a device and the emulator?

Common reasons include:

  • Your App ID specified in the project settings do not match the one you specified in Apple's developer portal.
  • The provisioning profile does not match the cert you uploaded. Double check with Apple's developer site that you are using the correct and current distribution cert and that the provisioning profile is still active. Download the provisioning profile again and add it to your project to confirm.
  • In Project Build Settings, your App Name is invalid. It should be modified to include only alpha, space and numbers.
Q17: How do I add multiple domains in Domain Access? 

Here is the primary doc source for that feature.

If you need to insert multiple domain references, then you will need to add the extra references in the intelxdk.config.additions.xml file. This StackOverflow entry provides a basic idea and you can see the intelxdk.config.*.xml files that are automatically generated with each build for the <access origin="xxx" /> line that is generated based on what you provide in the "Domain Access" field of the "Build Settings" panel on the Project Tab. 

Q18: How do I build more than one app using the same Apple developer account?

On Apple developer, create a distribution certificate using the "iOS* Certificate Signing Request" key downloaded from Intel XDK Build tab only for the first app. For subsequent apps, reuse the same certificate and import this certificate into the Build tab like you usually would.

Q19: How do I include search and spotlight icons as part of my app?

Please refer to this article in the Intel XDK documentation. Create an intelxdk.config.additions.xml file in your top level directory (same location as the other intelxdk.*.config.xml files) and add the following lines for supporting icons in Settings and other areas in iOS*.

<!-- Spotlight Icon --><icon platform="ios" src="res/ios/icon-40.png" width="40" height="40" /><icon platform="ios" src="res/ios/icon-40@2x.png" width="80" height="80" /><icon platform="ios" src="res/ios/icon-40@3x.png" width="120" height="120" /><!-- iPhone Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-small.png" width="29" height="29" /><icon platform="ios" src="res/ios/icon-small@2x.png" width="58" height="58" /><icon platform="ios" src="res/ios/icon-small@3x.png" width="87" height="87" /><!-- iPad Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-50.png" width="50" height="50" /><icon platform="ios" src="res/ios/icon-50@2x.png" width="100" height="100" />

For more information related to these configurations, visit http://cordova.apache.org/docs/en/3.5.0/config_ref_images.md.html#Icons%20and%20Splash%20Screens.

For accurate information related to iOS icon sizes, visit https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHIG/IconMatrix.html

NOTE: The iPhone 6 icons will only be available if iOS* 7 or 8 is the target.

Cordova iOS* 8 support JIRA tracker: https://issues.apache.org/jira/browse/CB-7043

Q20: Does Intel XDK support Modbus TCP communication?

No, since Modbus is a specialized protocol, you need to write either some JavaScript* or native code (in the form of a plugin) to handle the Modbus transactions and protocol.

Q21: How do I sign an Android* app using an existing keystore?

Uploading an existing keystore in Intel XDK is not currently supported but you can send an email to html5tools@intel.com with this request. We can assist you there.

Q22: How do I build separately for different Android* versions?

Under the Projects Panel, you can select the Target Android* version under the Build Settings collapsible panel. You can change this value and build your application multiple times to create numerous versions of your application that are targeted for multiple versions of Android*.

Q23: How do I display the 'Build App Now' button if my display language is not English?

If your display language is not English and the 'Build App Now' button is proving to be troublesome, you may change your display language to English which can be downloaded by a Windows* update. Once you have installed the English language, proceed to Control Panel > Clock, Language and Region > Region and Language > Change Display Language.

Q24: How do I update my Intel XDK version?

When an Intel XDK update is available, an Update Version dialog box lets you download the update. After the download completes, a similar dialog lets you install it. If you did not download or install an update when prompted (or on older versions), click the package icon next to the orange (?) icon in the upper-right to download or install the update. The installation removes the previous Intel XDK version.

Q25: How do I import my existing HTML5 app into the Intel XDK?

If your project contains an Intel XDK project file (<project-name>.xdk) you should use the "Open an Intel XDK Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round green "eject" icon, on the Projects tab). This would be the case if you copied an existing Intel XDK project from another system or used a tool that exported a complete Intel XDK project.

If your project does not contain an Intel XDK project file (<project-name>.xdk) you must "import" your code into a new Intel XDK project. To import your project, use the "Start a New Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round blue "plus" icon, on the Projects tab). This will open the "Samples, Demos and Templates" page, which includes an option to "Import Your HTML5 Code Base." Point to the root directory of your project. The Intel XDK will attempt to locate a file named index.html in your project and will set the "Source Directory" on the Projects tab to point to the directory that contains this file.

If your imported project did not contain an index.html file, your project may be unstable. In that case, it is best to delete the imported project from the Intel XDK Projects tab ("x" icon in the upper right corner of the screen), rename your "root" or "main" html file to index.html and import the project again. Several components in the Intel XDK depend on this assumption that the main HTML file in your project is named index.hmtl. See Introducing Intel® XDK Development Tools for more details.

It is highly recommended that your "source directory" be located as a sub-directory inside your "project directory." This insures that non-source files are not included as part of your build package when building your application. If the "source directory" and "project directory" are the same it results in longer upload times to the build server and unnecessarily large application executable files returned by the build system. See the following images for the recommended project file layout.

Q26: I am unable to login to App Preview with my Intel XDK password.

On some devices you may have trouble entering your Intel XDK login password directly on the device in the App Preview login screen. In particular, sometimes you may have trouble with the first one or two letters getting lost when entering your password.

Try the following if you are having such difficulties:

  • Reset your password, using the Intel XDK, to something short and simple.

  • Confirm that this new short and simple password works with the XDK (logout and login to the Intel XDK).

  • Confirm that this new password works with the Intel Developer Zone login.

  • Make sure you have the most recent version of Intel App Preview installed on your devices. Go to the store on each device to confirm you have the most recent copy of App Preview installed.

  • Try logging into Intel App Preview on each device with this short and simple password. Check the "show password" box so you can see your password as you type it.

If the above works, it confirms that you can log into your Intel XDK account from App Preview (because App Preview and the Intel XDK go to the same place to authenticate your login). When the above works, you can go back to the Intel XDK and reset your password to something else, if you do not like the short and simple password you used for the test.

Q27: How do I completely uninstall the Intel XDK from my system?

See the instructions in this forum post: https://software.intel.com/en-us/forums/topic/542074. Then download and install the latest version from http://xdk.intel.com.

Q28: Is there a tool that can help me highlight syntax issues in Intel XDK?

Yes, you can use the various linting tools that can be added to the Brackets editor to review any syntax issues in your HTML, CSS and JS files. Go to the "File > Extension Manager..." menu item and add the following extensions: JSHint, CSSLint, HTMLHint, XLint for Intel XDK. Then, review your source files by monitoring the small yellow triangle at the bottom of the edit window (a green check mark indicates no issues).

Q29: How do I manage my Apps in Development?

You can manage them by logging into: https://appcenter.html5tools-software.intel.com/csd/controlpanel.aspx. This functionality will eventually be available within Intel XDK after which access to app center will be removed.

Q30: I need help with the App Security API plugin; where do I find it?

Visit the primary documentation book for the App Security API and see this forum post for some additional details.

Q31: When I install my app onto my test device Avast antivirus flags it as a possible virus, why?

If you are receiving a "Suspicious file detected - APK:CloudRep [Susp]" message it is likely due to the fact that you are side-loading the app onto your device (using a download link or by using adb) or you have downloaded your app from an "untrusted" store. See the following official explanation from Avast:

Your application was flagged by our cloud reputation system. "Cloud rep" is a new feature of Avast Mobile Security, which flags apks when the following conditions are met:
  1. The file is not prevalent enough; meaning not enough users of Avast Mobile Security have installed your APK.
  2. The source is not an established market (Google Play is an example of an established market).
If you distribute your app using Google Play (or any other trusted market) your users should not see any warning from Avast.

Back to FAQs Main

Intel® Parallel Computing Center at Princeton University, Princeton Neuroscience Institute and Computer Science Dept.

$
0
0

Princeton University

Principal Investigators:

Princeton - Kai Li

Kai Li is a professor at Computer Science Department of Princeton University. He pioneered Distributed Shared Memory allowing shared-memory programming on clusters of computers, which one the ACM SIGOPS Hall of Fame Award and proposed user-level DMA which evolved into RDMA in the Infiniband standard.  He led the PARSEC project which became the de factor benchmark for multicore processors.  He recently co-led the ImageNet project and propelled the advancement of deep learning methods.  He co-founded Data Domain, Inc. (now an EMC division) and led the innovation of deduplication storage system products to displace tape automation market.  He is an ACM fellow, IEEE fellow and a member of National Academy of Engineering.

Princeton - Sebastian Seung

Sebastian Seung is Professor at the Princeton Neuroscience Institute and Department of Computer Science. Over the past decade, he has helped pioneer the new field of connectomics, developing new computational technologies for mapping the connections between neurons. His lab created EyeWire.org, a site that has recruited 200,000 players from 150 countries to a game to map neural connections. His book Connectome: How the Brain's Wiring Makes Us Who We Are was chosen by the Wall Street Journal as Top Ten Nonfiction of 2012.  Before joining the Princeton faculty in 2014, Seung studied at Harvard University, worked at Bell Laboratories, and taught at the Massachusetts Institute of Technology.

Description:

Over the past few years, convolutional neural networks (rebranded as “deep learning”) have become the leading approach to big data.  In order to perform well, deep learning requires large amount of training data and substantial amount of computing power for training and classification.   Most deep learning implementations use GPUs instead of general-purpose CPUs because the conventional wisdom is that a GPU is an order-of-magnitude faster than a CPU for deep learning at a similar cost.  As a result, the machine learning community as well as vendors have invested a lot of efforts to develop deep learning packages.

Intel® Xeon Phi™ coprocessors, based on Many-Integrated-Core (MIC) architecture, offer an alternative to GPUs for deep learning, because its peak floating-point performance and cost are on par with a GPU, while offering several advantages such as easy to program, binary compatible with host processor, and direct access to large host memory.  However, it is still challenging to fully take advantage of the hardware capabilities.  It requires running many threads in parallel (e.g. 240+ threads for 60+ cores), executing 16 floating point operations in parallel (for AVX-512), and reducing the working set for each thread (128KB L2 cache per thread).   

This center will develop an efficient deep learning package for Intel® Xeon Phi™ coprocessor.  The project is built on Sebastian Seung’s lab’s work on ZNN, a deep learning package (https://github.com/seung-lab/znn-release) based on two key concepts, both of which leverage the advantages of CPUs. (1) FFT-based convolution becomes more efficient when FFTs are cached and reused. This trades memory for speed, and is therefore appropriate for the larger working memory of CPUs. (2) Task parallelism on CPUs can make more efficient use of computing resources than SIMD parallelism on GPUs.  Our preliminary results with ZNN are encouraging. We have shown that CPUs can be competitive with GPUs in speed of deep learning, for certain network architectures. Furthermore, an initial port to Intel® Xeon Phi™ coprocessor (Knights Corner) was done quickly, supporting the idea that CPU implementations are likely to incur relatively low development cost.

The proposed optimizations for the future Intel® Xeon Phi™ processor family include trading memory space for computation (transforming convolution networks to reusable FFTs), intelligently choosing direct vs. FFT-based convolution for each layer of the network, choosing the right flavor of task parallelism, intelligent tiling to optimize L2 cache performance, and careful data structure layouts to maximize the utilization of AVX-512 vector units.  We will carefully evaluate the deep learning package with 2D ImageNet dataset, 3D electron microscopy image dataset, and 4D fMRI dataset.  We plan to deploy the software package and datasets in the public domain.

Related websites:

http://www.cs.princeton.edu/~li/

Intel® Parallel Computing Center at King Abdullah University of Science and Technology

$
0
0

King Abdullah University

Principal Investigators:

KAUST - David Keyes

David Keyes is a founding professor of Applied Mathematics and Computational Science at KAUST, where he focuses on high performance implementations of implicit methods for PDEs.  He received a BSE from Princeton and a PhD from Harvard. He has held faculty positions at Yale, Old Dominion, and Columbia Universities and research positions at NASA and DOE laboratories, and has led the scalable solvers initiative of the DOE SciDAC program. He is a Fellow of AMS and SIAM, and recipient of the IEEE Sidney Fernbach Award, the ACM Gordon Bell Prize, and the SIAM Prize for Distinguished Service to the Profession.

KAUST - Hatem Ltaief

Hatem Ltaief is a Senior Research Scientist in the Extreme Computing Research Center at KAUST, where he directs the KBLAS software project for dense and sparse linear algebraic operations on emerging architectures.  He received an MS in computational science from the University of Lyon and an MS in applied mathematics and a PhD in computer science from the University of Houston.  He has been a Research Scientist at the Innovative Computing Laboratory of the University of Tennessee and a Computational Scientist in the KAUST Supercomputing Laboratory. He is a member of the European Exascale Software Initiative (EESI2).

KAUST - Rio Yokota

Rio Yokota is an associate professor in the Global Scientific Information and Computing Center at the Tokyo Institute of Technology and a consultant at KAUST, where he researches fast multipole methods, their implementation on emerging architectures, and their applications in PDEs, BEMs, molecular dynamics, and particle methods. He received his undergraduate and doctoral degrees in Mechanical Engineering from Keio University, and held postdoctoral appointments at the University of Bristol and Boston University and a Research Scientist appointment at KAUST. He is a recipient of the ACM Gordon Bell Prize.

Description:

The Intel® Parallel Computing Center (Intel® PCC) at King Abdullah University of Science and Technology (KAUST) aims to provide scalable software kernels common to scientific simulation codes that will adapt well to future architectures, including a scheduled upgrade of KAUST’s globally Top10 Intel-based Cray XC40 system. In the spirit of co-design, Intel® PCC at KAUST will also provide feedback that could influence architectural design trade-offs. The Intel® PCC at KAUST is hosted in the KAUST’s Extreme Computing Research Center (ECRC), directed by co-PI Keyes, which aims to smooth the architectural transition of KAUST’s simulation-intensive science and engineering code base.  Rather than taking a specific application code and optimizing it, the ECRC adopts the strategy of optimizing algorithmic kernels that are shared among many application codes, and of providing the results in open source libraries.  Chief among such kernels are Poisson solvers and dense symmetric generalized eigensolvers.

We focus on optimizing two types of scalable hierarchical algorithms – fast multipole methods (FMM) and hierarchical matrices – on next generation Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. These algorithms have the potential to replace workhorse kernels of molecular dynamics codes (drug/material design), sparse matrix preconditioners (structural/fluid dynamics), and covariance matrix calculations (statistics/big data). Co-PI Yokota is the architect of the open source fast multipole library ExaFMM, which attempts to integrate best solutions offered by FMM algorithms, including the ability to control expansion order and octtree decomposition strategy independently to create the fastest inverter to meet a given accuracy requirement for solver or a preconditioner on manycore and heterogenous architectures.  Co-PI Ltaief is the architect of the KBLAS library, which promotes the directed acyclic graph-based dataflow execution model to create NUMA-aware work-stealing tile algorithms of high concurrency, with innermost SIMD structure well suited to floating point accelerators.  The overall software framework of this Intel® PCC at KAUST, Hierarchical Computations on Manycore Architectures (HiCMA), is built upon these linear solvers and the philosophy that dense blocks of low rank should often be replaced with hierarchical matrices as they arise.  Hierarchical matrices are natural algebraic generalizations of fast multipole, and are implementable in data structures similar to those that have made FMM successful on distributed nodes of shared memory cores.

FMM and hierarchical matrix algorithms share a rare combination of O(N) arithmetic complexity and high arithmetic intensity (flops/Byte). This is in contrast to traditional algorithms that have either low arithmetic complexity with low arithmetic intensity (FFT, sparse linear algebra, and stencil application), or high arithmetic intensity with high arithmetic complexity (dense linear algebra, direct N-body summation). In short, FMM and hierarchical matrices are efficient algorithms that will remain compute-bound on future architectures. Furthermore, these methods have a communication complexity of O(log P) for P processors, and permit high asynchronicity in their communication. Therefore, they are amenable to asynchronous programming models that are gaining popularity as architectures approach the exascale.

Related websites:

http://ecrc.kaust.edu.sa

Intel® Parallel Computing Center at Indiana University

$
0
0

Indiana University

Principal Investigators:

Indiana - JudyQiu

Judy Qiu is an assistant professor of Computer Science at Indiana University. Her general area of research is in data-intensive computing at the intersection of Cloud and HPC multicore technologies. This includes a specialization on programming models that support iterative computation, ranging from storage to analysis which can scalably execute data intensive applications. Her research has been funded by NSF, NIH, Microsoft, Google, and Indiana University. She is the recipient of a NSF CAREER Award in 2012, Indiana University Trustees Award for Teaching Excellence in 2013-2014, and Indiana University Outstanding Junior Faculty Award in 2015.

Indiana - StevenGottlieb

Steven Gottlieb is a Distinguished Professor of Physics at Indiana University. He works in Lattice QCD an area of theoretical high energy physics that relies large scale computing to understand the quantum field theory that describes the strong force.  His research has been funded for many years by the US Department of Energy and National Science Foundation. He received an A.B. degree from Cornell University with majors in mathematics and physics, as well as Masters and Ph.D. degrees in physics from Princeton University. He was a DOE Outstanding Junior Investigator and Indiana University Outstanding Junior Faculty Award recipient.

Description:

The Indiana University Intel® Parallel Processing Center (Intel® PCC) is a multi-component interdisciplinary center. The initial activities involve Center Director Judy Qiu, an Assistant Professor in the School of Informatics and Computing, and Distinguished Professor of Physics Steven Gottlieb. Qiu will be researching novel parallel systems supporting data analytics and Gottlieb will be adapting the physics simulation code of the MILC Collaboration to the Intel® Xeon Phi™ coprocessor.

More generally, the focus of the Center will be grand challenges in high performance simulation and data analytics with innovative applications, and software development using the Intel architecture. Issues of programmer productivity and performance portability will be studied.

Steven Gottlieb is a founding member of the MILC Collaboration which studies Quantum Chromodynamics, one of nature's four fundamental forces. The open source MILC code is part of the SPEC benchmark and has been used as a performance benchmark for a number of supercomputer acquisitions. Gottlieb will be working on restructuring the MILC code to make optimal use of the SIMD vector units and many-core architecture of the Intel® Xeon Phi™ coprocessor. These will be used in upcoming supercomputers at the National Energy Research Supercomputing Center (NERSC) and the Argonne Leadership Computing Center (ALCC). The MILC code currently is used for hundreds of millions of core-hours at NSF and DOE supercomputer centers.

Data analysis plays an important role in data-driven scientific discovery and commercial services. Judy Qiu's earlier research has shown that previous complicated versions of MapReduce can be replaced by Harp (a Hadoop plug-in) that offers both data abstractions useful for high performance iterative computation and MPI-quality communication that can drive libraries like Mahout, MLlib, and DAAL on HPC and Cloud systems. A subset of machine learning algorithms have been selected and will be implemented with optimal performance using Hadoop/Harp and Intel's library DAAL. The code will be tested on Intel’s Haswell and Xeon Phi™ coprocessor architectures.

Related websites:

http://ipcc.soic.iu.edu/

Pulse Detection with Intel® RealSense™ Technology

$
0
0

1. Introduction

When I first heard of a system that could determine your heart rate without actually touching you, I was sceptical to the point where I dismissed the claim as belonging somewhere between fakery and voodoo. Many moons later I had a reason to look deeper into the techniques required and realized it was not only possible, but had already been achieved, and furthermore had been implemented in the latest version of the Intel® RealSense™ SDK.

It was only when I located and ran the appropriate sample program from the SDK, read the value, and then checked it against my own heart rate by counting the beats in my neck for 15 seconds and multiplying by four, that I realized it actually works! I jumped up and down a little to get my heart rate up, and amazingly, after some seconds, the computer once again accurately calculated my accelerated rate. Of course by this time I was so pumped at the revelation and excited at the prospects of a computer that knows how calm you are, that I could not get my heart rate down below my normal 76 beats per minute to test the lower levels.

 

2. Why Is This Important

Once you begin your journey into the frontier world of hands-free control systems, 3D scanning, and motion detection, you will eventually find yourself asking what else you can do with the Intel® RealSense™ camera. When you move from large clearly defined systems to the more subtle forms of detection, you enter a realm where computers gain abilities never seen before.

Pulse detection, along with other Intel RealSense SDK features, are much more subtle streams of information that may one day play as critical a role in your daily life as your keyboard or mouse.  For example, a keyboard or mouse is no good to you if you’re suffering from RSI (Repetitive Strain Injury) and no amount of clever interfacing will help you if you’re distracted, agitated, sleepy, or simply unhappy.  Using the subtle science of reading a user’s physical and possible emotional condition allows the computer to do something about it for the benefit of the user and improve that experience. Let’s say it’s nine thirty in the morning, the calendar shows a full day of work ahead, and the computer detects the user is sleepy and distracted. Using some pre-agreed recipes, the computer could trigger your favourite ‘wake me up with 5 power ballads’ music, flash up your calendar for the next 4 hours and throw some images of freshly brewed coffee on screen as a gentle reminder to switch up a gear. 

Technological innovation isn’t always about what button does what or how we can make things quicker, easier, or smarter, it can also be about improving quality of life and enriching an experience. If your day can be made better because your computer has a sense of what you might need and then takes autonomous steps to help you, that can only be a good thing.

By way of another example and not directly related to pulse detection, imagine your computer is able to detect temperature and notices that when you get hot your work rate drops (i.e., less typing, more distracted, etc.) and also records that when the temperature was cooler, your work level increases. Now imagine it recorded sensor metrics about you on a daily basis, and during a particularly hot morning your computer flashes a remark that two days ago you had also been hot, you left the desk for 20 seconds, and 2 minutes later everything was cool (and your subsequent work level improved that day). Such a prompt might recall a memory that you opened a few windows, or turned on the air conditioning in the next room, and so you follow the advice and your day improves. Allowing the computer to collect this kind of data and experimenting with the ways in which this data can improve your own life will ultimately lead to innovations that will improve life for everyone.

Pulse estimation is just one way in which a computer can extract subtle data from the surrounding world, and as technology evolves, the sophistication of pulse detection will lead to readings as accurate as traditional methods.

 

3. How Is This Even Possible?

My research into precisely how pulse estimation currently works took me on a brief journey through the techniques that have proved to be successful so far, such as detecting what are so called micro-movements in the head.

Detecting micro-movements in the head

You need more than a tape measure to detect micro-movements in the head.

Apparently when your heart beats, a large amount of blood is pumped into your head to keep your brain happy, and this produces an involuntary and minuscule movement that can be detected by a high resolution camera. By counting these movements, filtered by normal Doppler and other determinable movements, you can work out how many beats the user is likely to have per minute. Of course, many factors can disrupt this technique such as natural movements that can be mistaken for micro-movements, or capturing shaky footage if you are in transit at the time, or you are simply cold and shivering. Under regulated conditions, this technique has been proven to work with nothing more than a high resolution color camera and software capable of filtering out visual noise and detecting the pulses.

Another technique that is closer to the method used by the Intel RealSense SDK is the detection of color changes in a live stream and using those color changes to determine if a pulse happened. The frame rate does not have to be particularly high for this technique to work, nor does the camera need to be perfectly still, but the lighting conditions need to be ideal for the best results. This alternative technique has a number of variations, each with varying levels of success, two of which I will briefly cover here.

Your eyes can tell you how fast your heart is beating

Did you know your eyes can tell you how fast your heart is beating?

Obviously, the technique works better when you are not wearing glasses, and with a high resolution capture of the eyeball you have an increased chance of detecting subtle changes in the blood vessels of the eye over the course of the detection phase. Unlike veins under the skin that are subject to subsurface scattering and other occlusions, the eye offers a relatively clear window into the vascular system of the head. You do have a few hurdles to overcome, such as locking the pixels for the eye, so you only work with the eye area and not the surrounding skin. You also need to detect blinking and track pupils to ensure no noise gets into the sample, and finally you need to run the sample long enough to get a good sense of background noise that needs to be eliminated before you can magnify the remaining color pixels to help in detecting the pulse.

Your mileage will vary as to how long you need to run the sample, and there will be a lot of noise that will mean you have to throw the sample out, but even by running at a modest 30 frames per second you’ll have anywhere from 20-30 samples to find just one pulse (assuming your subject has a heart rate from between 60 to 90 beats per minute).

If you find the color information from the eye is insufficient, such as might occur for users who are sitting a good distance away from the computer, wearing glasses, or meditating, then you need another solution. One more variation on the skin color change method is the use of the IR stream (InfraRed), which is readily provided by the Intel® RealSense™ camera. Unlike color and depth streams, IR streams can be sent to the software at upwards of 300 frames per second, which is quite fast. As suggested before, however, we only need around 30 frames per second of good quality samples to find our elusive pulse, and the IR image you get back from the camera has a special trick to reveal.

Infra-Red detecting the veins in the wrist

Notice the veins in the wrist, made highly visible thanks to Infra-Red

For the purpose of brevity, I will not launch into a detailed description of the properties of IR and its many applications. Suffice it to say that it occupies a specific spectrum of light that the human eye cannot entirely perceive. The upshot is that when we bounce this special light off objects, capture the results, and convert them to something we can see, it reacts a little differently than its neighboring colors higher up the spectrum.

One of the side effects of bouncing IR off a human is that we can detect veins near the surface of the skin and other characteristics such as detecting grease on an otherwise perfectly clean shirt. Given that blood flow is the precise metric we want to measure you might think this approach is perfectly suited to the job of detecting a heart rate. With a little research you will find that IR has indeed been used for the purpose of scanning the human body and detecting the passage of blood around the circulatory system, but only under strict medical conditions. The downside to using IR is that you effectively limit the information you are receiving from the camera and must throw away the equally valuable visible spectrum returned via the regular RGB color stream.

Of course, the ultimate solution is to combine all three sources of information; taking micro-movements, IR blood flow, and full color skin changes to act as a series of checks and balances to reject false positives and produce a reliable pulse reading.

 

4. How Intel® RealSense™ Technology Detects Your Pulse

Now that you know quite a bit about the science of touchless heart rate detection, we are going to explore how you can add this feature to your own software. You are free to scan the raw data coming from the camera and implement one or all of the above techniques, or thanks to the Intel RealSense SDK you can instead implement your own heart rate detection in just a few lines of code.

The first step is not specifically related to the pulse detection function, but for clarity we will cover it here so you have a complete picture of which interfaces you need and which ones you can ignore for now. We first need to create a PXC session, a SenseManager pointer, and a faceModule pointer as we will be using the Face system to eventually detect the heart rate. For a complete version of this source code, the best sample to view and compile against is the Face Tracking example, which contains the code below but with support for additional features such as pose detection.

PXCSession* session = PXCSession_Create();
PXCSenseManager* senseManager = session->CreateSenseManager();
senseManager->EnableFace();
PXCFaceModule* faceModule = senseManager->QueryFace();

Once the housekeeping is done and you have access to the critical faceModule interface, you can make the pulse-specific function calls, starting with the command to enable the pulse detector.

PXCFaceConfiguration* config=faceModule->CreateActiveConfiguration();
config->QueryPulse()->Enable();
config->ApplyChanges();

The ActiveConfiguration object encompasses all the configuration you need for the Face system, but the one line that specifically relates to getting a heart rate reading is the function to QueryPulse()->Enable(), which activates this part of the system and starts it running.

The final set of commands drills down to the value we are after, and as you can see below relies on parsing through all the faces that may have been detected by the system. It does not assume that a single user is sitting at the computer—someone could be looking over your shoulder or standing in the background. Your software must make additional checks, perhaps using the pose data structure, to determine which is the main head (perhaps the closest) and only use the heart rate for that face/user. Below is the code that makes no such distinction and simply moves through all the faces detected and takes the heart rate for each one, though it does nothing with the value in this example.

PXCFaceData* faceOutput = faceModule->CreateOutput();
const int numFaces = faceOutput->QueryNumberOfDetectedFaces();
for (int i = 0; i < numFaces; ++i)
{
	PXCFaceData::Face* trackedFace = faceOutput->QueryFaceByIndex(i);
	const PXCFaceData::PulseData* pulse = trackedFace->QueryPulse();
	if (pulse != NULL)
	{
		pxcF32 hr = pulse->QueryHeartRate();
	}
}

You can ignore most of the code except for the trackedFace->QueryPulse() which asks the system to work out the latest heart rate from the data collected thus far, and if data is available, to use the pulse->QueryHeartRate() to interrogate that data and return the heart rate in beats per minute.

An expression of surprise during the pulse estimate

An expression of surprise as the pulse estimate was exactly right.

By running the Face Tracking sample included with the Intel RealSense SDK and deselecting everything from the right except detection and pulse, then pressing start, you will be greeted with your own heart rate after 10 seconds of staying relatively still.

Once you have stripped out the non-pulse code from the above example, you can use it as a good code base for further experiments with the technique. Perhaps drawing a graph of the readings over time, or adding code to have the app run in the background and produce an audible beep to let you know when you’re getting too relaxed or excited. More seriously, you can monitor the accuracy and resolution of the readings returned to determine if they are sufficient for your application.

 

5. Tricks and Tips

Do’s

  • For best results not only when detecting your heart rate but for all capture work, use the camera in good lighting conditions (not exposed to sunlight) and stay relatively still during the sampling phase until you get an accurate reading.
  • As the current SDK only provides a single function for the detection of pulse, the door is wide open for innovators to use the range of raw data to obtain more accurate and instant readings from the user. The present heart rate estimate takes over 10 seconds to calculate, can you write one that performs the measurement in less time?
  • If you want to perform heart rate estimation outdoors and want to write your own algorithm to perform the analysis, it is recommended you use the color stream only for detecting skin color changes.

Don’ts

  • Don’t try to detect a heart rate with all the options in FaceTracking activated as this will reduce the quality of the result or fail to report a value altogether. You will need sufficient processing power available for the Face module to accurately estimate the heart rate.
  • Don’t use an IR detection technique in outdoor spaces as any amount of direct sun light will completely obliterate the IR signals returned, rendering any analysis impossible.

 

6. Summary

As touched on at the start of this article, the benefits of heart rate detection are not immediately apparent when compared to the benefits of hands-free controls and 3D scanning, but when combined with other sensory information can provide incalculable help to the user when they need it most. We’re not yet at the stage where computers can record our heart rate simply by walking past the doctor’s office window, but we’re half way there and it’s only a matter of time and further innovation and application before we see it take its place in our modern world.

From a personal perspective, living the life of an overworked, old-school, code-crunching relic, my health and general work ethic are more important to me now than in my youth, and I am happy to get help from any quarter that provides it. If that help takes the form of a computer nagging me to ‘wake up, drink coffee, eyes front, don’t forget, open a window, take a break, eat some food, go for a walk, play a game, go to the doctor you don’t have a pulse, listen to Mozart, and go to sleep’—  especially if it’s a nice computer voice—then so be it.

Of course being a computer, if the nagging gets a little persistent you can always switch it off. Something tells me though that we’ll come to appreciate these gentle reminders, knowing that behind all the cold logic, computers are only doing what we asked them to do, and at the end of the day, we can all use a little help, even old-school, code-crunching relics.

 

About The Author

When not writing articles, Lee Bamber is the CEO of The Game Creators (http://www.thegamecreators.com), a British company that specializes in the development and distribution of game creation tools. Established in 1999, the company and surrounding community of game makers are responsible for many popular brands including Dark Basic, The 3D Game Maker, FPS Creator, App Game Kit (AGK) and most recently, Game Guru.


Intel Resources for Game Developers

$
0
0


"Thank you for making games! Intel is a strong supporter of game development, and we've assembled all our best information to help you get your game running great on Intel hardware. Intel® HD Graphics and Intel® IrisTM/IrisTM  Pro Graphics parts are some of the most commonly used graphics solutions in PCs worldwide (see http://store.steampowered.com/hwsurvey/videocard/). By following the advice on these pages and using the tools we provide, you'll ensure that your game is able to be enjoyed by millions of gamers. We want you to be successful! Get what you need here, and if you can't find something, let us know in our game developer forum." - Aaron Coday, Director, Visual Computing Engineering, Intel Corporation

Intel wants to help game developers optimize their programs so they run, look, and play great on the developers' platforms of choice. Intel offers documentation, tools, code samples, case studies and events to help with development and optimization. The below lists are provided to quickly direct you to information of interest.

Developer Guides and Reference Manuals

Intel provides comprehensive graphics API usage guides for our hardware, going back to 2011, as well as other documentation:

Code Samples (https://software.intel.com/gamecode)

Considered the most helpful pieces by many developers, Intel offers a wide variety of code samples. Here's some links but be sure to see OpenCL code samples below.

Tools for Optimizing Games and Graphics for Intel® Processors

At Intel, we believe we have created the best processing hardware for computing. We know that it's not always easy to harness that processing power, so we've built powerful software tools for analyzing your application on Intel hardware. Our best tools for graphics are collected here.

Intel® Graphics Performance Analyzers (Intel® GPA)

This tool covers OpenCL and DirectX and shows high-level performance metrics for the CPU. You can capture a frame for detailed analysis and on-the-fly tweaking. Intel GPA Platform View can show you detailed interaction between the CPU and graphics. Articles and Guides on Intel GPA include:

Intel® VTune™ Amplifier XE 2015 

Performance analysis tool targeted for users developing serial and multithreaded applications. The tool is delivered as VTune Amplifier XE Performance Profiler and VTune Amplifier for Systems Performance Profiler and Intel Energy Profiler

Analyzing Applications Using Intel® HD Graphics

Intel® Integrated Native Developer Experience (Intel® INDE)

Intel® C++ Compiler 15.0

OpenCL Optimization

At Intel, we know that OpenCL is for more than just image manipulation. Here's our best resources for game developers who want to leverage OpenCL.

OpenCL Code Samples

OpenGL (Android)

Unity

DirectX* 12

Case Studies

Many PC game developers have already had success using our tools to improve their games' performance. For some, we've written about what challenges they faced and what solutions they found. We hope these case studies will be helpful for your own efforts. Maybe we'll be writing about your game next!    https://software.intel.com/en-us/gamedev/learn

Presentations from GDC2015 

Upcoming events- https://software.intel.com/en-us/game-dev/events
 

Achievement Unlocked: Site and Q&A

For more info about game development on Intel processor graphics, visit the Intel® game development community. There you’ll find useful references for everything from multithreading to audio. If you have more questions, including driver questions, head to the forums. If you can’t find the answer to your question above, you can visit the Intel® HD Graphics support page.

Intel Texture Compression Plugin for Photoshop

$
0
0

Achievement Unlocked Badge

Intel is working to extend Photoshop to take advantage of the latest Intel hardware and image compression methods (BCn/DXT). The purpose of this plugin is to provide tools for Artists to access the superior compression methods at Intel accelerated speeds within the well established industry application Photoshop.

Sign Up for Beta

 

Before Compression

Test strip before compression

After BC7 (Fine) Compression

Test Strip After BC7 (Fine) Compression

 


Benefits

Context Menu

  • Access to hardware supported superior compression methods
  • Compression at Intel accelerated speeds
  • Previewing and convenience features to aid productivity
  • Runs within established content tool
  • Pluggable architecture for future compression schemes

 


Key Features

File Menu Export

  • Multiple image format support for BCn,
  • Export with DirectX10 extended header for sRGB
  • Choice of Fast and Fine (more accurate) compression
  • Support for alpha maps, color maps, normal maps
  • Support for cube maps with BCn compression
  • Real-time preview to visualize quality trade-offs
  • Photoshop Batch/Action support
  • Extensible

 


Export Formats

Formats

Available formats change based on Texture Type chosen. Helper guidance in simple terms is also provided. Color format list shown at left. Full list shown below.

BC1RGB4BPP 
BC1sRGB4BPP 
BC3RGBA8BPP 
BC3sRGBA8BPP 
BC4R4BPPGrayscale
BC5RG8BPP2 Channel Tangent Map
BC6HRGB8BPPFast Compression
BC6HRGB8BPPFine Compression
BC7RGBA8BPPFast Compression
BC7RGBA8BPPFine Compression
BC7sRGBA8BPPFast Compression
BC7sRGBA8BPPFine Compression
NoneRGBA32BPPUncompressed

Beta Requirements

  • Window (32/64) versions 7, 8, 10
  • Photoshop CS6 through CC2015

Reference


Feedback is Welcome

Sign up on IDZ to Join the Conversation


More Comparisons

Preview BC7 Fast Comparison

Preview BC7 Fine Comparison

Intel Software - Achievement Unlocked

Books - High Performance Parallelism Pearls

$
0
0

The two “Pearls” books contain an outstanding collection of examples of code modernization, complete with discussions by software developers of how code was modified with commentary on what worked as well as what did not!  Code for these real world applications is available for download from http://lotsofcores.com whether you have bought the books or not.  The figures are freely available as well, a real bonus for instructors who choose to uses these examples when teaching code modernization techniques.  The books, edited by James Reinders and Jim Jeffers, had 67 contributors for volume one, and 73 contributors for volume 2. 

Experts wrote about their experiences in adding parallelism to their real world applications. Most examples illustrate their results on processors and on the Intel® Xeon Phi™ coprocessor. The key issues of scaling, locality of reference and vectorization are recurring themes as each contributed chapter contains explanations of the thinking behind adding use of parallelism to their applications. The actual code is shown and discussed, with step-by-step thinking, and analysis of their results.  While OpenMP* are MPI are the dominant method for parallelism, the books also include usage of TBB, OpenCL and other models. There is a balance of Fortran, C and C++ throughout. With such a diverse collection of real world examples, the opportunities to learn from other experts is quite amazing.

 

Volume 1 includes the following chapters:

Foreword by Sverre Jarp, CERN.

Chapter 1: Introduction

Chapter 2: From ‘Correct’ to ‘Correct & Efficient’: A Hydro2D Case Study with Godunov’s Scheme

Chapter 3: Better Concurrency and SIMD on HBM

Chapter 4: Optimizing for Reacting Navier-Stokes Equations

Chapter 5: Plesiochronous Phasing Barriers

Chapter 6: Parallel Evaluation of Fault Tree Expressions

Chapter 7: Deep-Learning and Numerical Optimization

Chapter 8: Optimizing Gather/Scatter Patterns

Chapter 9: A Many-Core Implementation of the Direct N-body Problem

Chapter 10: N-body Methods

Chapter 11: Dynamic Load Balancing Using OpenMP 4.0

Chapter 12: Concurrent Kernel Offloading

Chapter 13: Heterogeneous Computing with MPI

Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor

Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

Chapter 17: NWChem: Quantum Chemistry Simulations at Scale

Chapter 18: Efficient Nested Parallelism on Large-Scale Systems

Chapter 19: Performance Optimization of Black-Scholes Pricing

Chapter 20: Data Transfer Using the Intel COI Library

Chapter 21: High-Performance Ray Tracing

Chapter 22: Portable Performance with OpenCL

Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations

Chapter 24: Profiling-Guided Optimization

Chapter 25: Heterogeneous MPI optimization with ITAC

Chapter 26: Scalable Out-of-Core Solvers on a Cluster

Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

Chapter 28: Morton Order Improves Performance

 

Volume 2 includes the following chapters:

Foreword by Dan Stanzione, TACC

Chapter 1: Introduction

Chapter 2: Numerical Weather Prediction Optimization

Chapter 3: WRF Goddard Microphysics Scheme Optimization

Chapter 4: Pairwise DNA Sequence Alignment Optimization

Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery     

Chapter 6: Amber PME Molecular Dynamics Optimization

Chapter 7: Low Latency Solutions for Financial Services

Chapter 8: Parallel Numerical Methods in Finance    

Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization

Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism In Practice  

Chapter 11: Visual Search Optimization

Chapter 12: Radio Frequency Ray Tracing

Chapter 13: Exploring Use of the Reserved Core

Chapter 14: High Performance Python Offloading

Chapter 15: Fast Matrix Computations on Asynchronous Streams 

Chapter 16: MPI-3 Shared Memory Programming Introduction

Chapter 17: Coarse-Grain OpenMP for Scalable Hybrid Parallelism  

Chapter 18: Exploiting Multilevel Parallelism with OpenMP

Chapter 19: OpenCL: There and Back Again

Chapter 20: OpenMP vs. OpenCL: Difference in Performance?      

Chapter 21: Prefetch Tuning Optimizations

Chapter 22: SIMD functions via OpenMP

Chapter 23: Vectorization Advice  

Chapter 24: Portable Explicit Vectorization Intrinsics

Chapter 25: Power Analysis for Applications and Data Centers

 

Step by Step guide to build iOS apps using UI builder of Multi-OS Engine of Intel INDE

$
0
0

Author: Yash Oswal

If you have reached this article, I presume you know what Intel Multi-OS Engine is capable of and its cool new features. If not, refer to these articles - Why Multi-OS Engine? and the Technical Overview of Multi-OS Engine.

This article specifically talks about the UI Builder. Multi-OS Engine(MOE) provides its own UI builder to build iOS apps which is similar to Android layout to build but has view objects similar to that of iOS. In this tutorial I am going to build a simple Quiz application for iOS using MOE and its UI builder.

Step 1: To build any MOE application first make a stock Android sample application like this:       

Step 2: Then right click on the android project and select Intel MOE Module to create a new module.

Step 3: Then select stock Hello World Application as a starting point for making the app.

Step 4: Name your application and click finish

Step 5: These are the files that are created by MOE.

Step 6: Next click on the sample_design.ixml file to start editing your UI.

Step 7: As you can see some parameters are already set as default like initialViewController and its viewController. You can set parameters according to your own needs.

Step 8: After deleting the default label and button you can start adding your own view objects. Method of adding object is quite simple you just have to drag and drop just like android layout.

Step 9: You can set all the parameters related to your UI in the properties tab. The parameters here are quite similar to those seen on Xcode.

Step 10: Here add two labels and two buttons to make the UI of the app.

Step 11: Name the iboutlets for all the view objects that you place in your UI.

Step 12: And set up your IBActions for buttons under events. 

Step 13: The MOE auto-generates all the IBOutlets and IBActions in the corresponding Java viewcontroller file.

Step 14: Now add the class variables corresponding to all the view objects in UI and assign them their corresponding IBOutlets in viewDidLoad method.

Step 15: Now create a QuizDataSet.java file which can work as a data source for the app and instantiate its object in the AppViewController class.

Step 16: Next set the texts of the Question and Answer Labels in the Action methods showQuestion and showAnswer.

Step 17: After this just run the app. 

Voila! You have designed your first iOS app with the UI designer of the Multi-OS Engine.

New, Exciting Media Transcoding Software Capabilities to Showcase at IBC 2015

$
0
0

See You at International Broadcasters Conference (IBC)

Visit Intel at IBC in Amsterdam, Sept. 11 to 15, to see demos of new media software capabilities—some of which are so special, that we can’t even list them! Of course, with consumer demand exploding for video content and ultra HD TVs, to stay competitive, media and video solution providers need to innovate now for HEVC, 4K, and UHD. And with Intel hardware and software, we make these transitions so much easier, faster, and more powerful.

Learn how Intel® Media Server Studio, Intel® Video Pro Analyzer, and Intel® Stress Bitstreams and Encoder can help.

 

 

Intel® Digital Random Number Generator (DRNG) Library Implementation and Uses

$
0
0

1. Introduction

Intel® Data Protection Technology with Intel® Secure Key provides two instructions:  RDRAND, which is useful for generating high-quality keys for cryptographic protocols, and RDSEED, which is for seeding software-based pseudo random number generators (PRNGs). More information about the Digital Random Number Generator (DRNG) in Secure Key is available at Intel® Digital Random Number Generator (DRNG) Software Implementation Guide  by John Mechalas.

The RDRAND library has been updated to include RDSEED support and has been renamed to the DRNG library. This document includes the following sections:

Section 2: Structure of DRNG Library and Prerequisites

Section 3: Changes in RDRAND APIs

Section 4: Design decisions for RDSEED APIs

Section 5: Sample user code to use DRNG library

Programmers who are new to the DRNG should first read reference 1, as this documentation only discusses the DRNG library implementation. 

 

2. Structure of DRNG Library and Prerequisites

The DRNG library provides easy access to the RDRAND and RDSEED instructions and is compatible with Linux*, Windows*, and OS X*. In addition, it can be built with the Intel® Compiler, GNU Compiler (GCC), and Microsoft Compiler in 32-bit as well as 64-bit machines.

2.1 Structure

The DRNG library contains files that make it compatible with multiple platforms. Short descriptions of all the files are given below:

FilenameDescription
/libFolder with precompiled libraries
drng.hThe macros and function definitions used in the library
rdseed.cThe API for the RDSEED instruction
rdrand.cThe API for the RDRAND instruction
main.cProgram to test RDRAND and RDSEED API calls
configureUsed in Linux to generate the Makefile
configure.acGNU autoconfig file
drng.vcxprojVisual Studio* 13 project file for DRNG library
drng.slnVisual Studio 13 solution file for DRNG library
test.vcxprojVisual Studio 13 test project
test.slnVisual Studio 13 solution file for test project
Makefile.inMakefile template for Linux and OS X systems

 

2.2 Prerequisites

Please make sure the following prerequisites are completed before using the library.

  • Hardware Prerequisite

    Although the DRNG library checks for the support of instructions, it is a good idea to check your system before downloading the library.

    DRNG instructions are available beginning with 3rd generation Intel® Core™ processors and Intel® Xeon® v2 processors.

  • Linux Software Prerequisite

    Make sure you are using GCC version 4.8.

  • Windows Software Prerequisite

    Visual Studio is required to build the test program in Windows [Visual Studio 13 is preferred]

    Reference 2 contains the details to download Visual Studio 13.

  • OS X Software Prerequisite

    Make sure you are using GCC version 4.8

  • Microsoft Compiler

    RDRAND intrinsics are available in Visual Studio 12 and RDSEED intrinsics are available in Visual Studio 13. Hence, it is highly recommended to use Visual Studio 13.

  • Intel Compiler

    Developers can install Intel Compiler version 15.0 for any platform to use DRNG library.

    Reference 3 contains the details about installing the Intel Compiler.

 

3. Changes in the RDRAND API

A few minor changes have been made in RDRAND library to improve the efficiency of programs. These changes minimally affect users’ API calls.

3.1 The drng header file

The RDRAND header file (rdrand.h) has been replaced by the DRNG header file (drng.h). This change has been made to wrap up the new API calls for RDSEED API with existing macros and API calls of RDRAND library.

Consequently, the names of macros have been changed accordingly (drng.h). Some of the changes are mentioned below:

Old NameNew Name
RDRAND_SUCCESSDRNG_SUCCESS
RDRAND_SUPPORTEDDRNG_SUPPORTED
RDRAND_NOT_READYDRNG_NOT_READY

3.2 Caching of DRNG_SUPPORT

To increase the run time efficiency of the library, the number of calls to __cpuid() has been reduced. Now, the library will call __cpuid() only once during the program scope, and the result will be cached in a variable for further references.

 

4. Design Decisions for RDSEED

One of the major differences in software implementation between RDSEED and RDRAND is the retry value. RDRAND is guaranteed to generate a random number within 10 retries on a working CPU. On the other hand, because the RDSEED instruction does not have a fairness mechanism built into it, there are no guarantees as to how often a thread should retry the instruction, or how many retries might be needed to obtain a random seed. In practice, the number of retries depends on the number of hardware threads on the CPU and how aggressively they are calling RDSEED.

4.1 Reentrant Code

To mitigate the uncertainty of the RDSEED instruction, the APIs have been designed to be reentrant. This helps the RDSEED instruction to attain 100% success rate without a busy wait.

To use this, a “skip” parameter has been introduced that needs to be passed while calling the RDSEED APIs. Code Snippet 1 illustrates the use.

Code Snippet 1

        skip = 0;
	skip = rdseed_get_n_32(RDSEED_CUTOFF, array32, skip, MAX_RETRY_LIMIT); 

The API returns the total number of seeds generated by the call. So, when calling the API for the first time, the “skip” value should be zero, as no seeds have been generated yet.

If the desired number of seeds was generated from the first call, then the program can continue. Otherwise, the API needs to be called again with an updated value of “skip” until the desired number (RDSEED_CUTOFF) of seeds is generated. Code Snippet 2 illustrates the use. New seeds will be appended to the buffer without affecting the previously generate seeds.

Code Snippet 2

while (skip < RDSEED_CUTOFF)
		{
		   printf("\nRSEED numbers %d uint32's:\n", skip);
		   skip = rdseed_get_n_32(RDSEED_CUTOFF, array32, skip, MAX_RETRY_LIMIT);
	}

4.2 Retry Value

A high “retry” value and a call similar to RDRAND library can keep the processor busy for a long time and hence decrease its efficiency. In a worst case, a retry value of 100 to generate ten 32-bit seeds can cause 1000 retries.

As a solution to this, users can provide a MAX_RETRY_VALUE. This value will determine the total number of retries that will be made by the library if a RDSEED instruction fails to return a value. The scope of MAX_RETRY_VALUE will be a single call to the API. For a simple direct call to generate a single seed value, the API will retry MAX_RETRY_VALUE times to generate the seed. Code Example 1 shows this implementation.

Code Example 1

int rdseed_16(uint16_t* x, int retry)
{
	if (RdSeed_isSupported())
	{

		if (_rdseed16_step(x))
			return DRNG_SUCCESS;
		else
		{
			retry_counter = retry;
			while (retry_counter > 0)
			{
				retry_counter--;
				if (_rdseed16_step(x))
					return DRNG_SUCCESS;
			}

			return DRNG_NOT_READY;
		}
	}
}

While using the rdseed_get_n_32(),rdseed_get_n_64() , or rdseed_get_bytes() APIs the MAX_RETRY_VALUE determines the maximum number of RDSEED instruction attempts in total, that is, the scope of MAX_RETRY _VALUE exists until the API call exits. Hence the library is controlling the overall retries that were made in the call instead of passing a fresh retry value for each rdseed_32() or rdseed_64 call. Code Example 2 illustrates the implementation.

Code Example 2

int rdseed_get_n_32(unsigned int n,
                    uint32_t *dest,
                    unsigned int skip,
                    unsigned int retry)
{
	int success;
	unsigned int i;
	unsigned int success_count = 0;
	retry_counter = retry;

	if (skip)
	{
		n = n - skip;
		dest = &(dest[skip]);
		success_count = skip;
	}


	for (i = 0; i<n; i++)
	{
		success = rdseed_32(dest, retry_counter);
		if (success != DRNG_SUCCESS)
        return ((success == DRNG_UNSUPPORTED) ? success : success_count);
		dest = &(dest[1]);
		success_count++;
	}
	return success_count;
}

4.3 Windows Compiler

The Intel Compiler and GNU Compiler allow rdrand_64() and rdseed_64() even in a 32-bit compilation by concatenating two 32-bit values together. The Windows Compiler does not include the 64-bit API calls in its 32-bit compiler intrinsics, hence calls to the 64-bit API should be done using “#ifndefs” or other equivalent solutions.

 

5. Sample User Code

RdRand_isSupported() and RdSeed_isSupported() APIs can be used to determine the support for RDRAND and RDSEED instruction by system. If the function returns 1, then the given instruction is supported by the system. Otherwise, if the return value is 0, then the instruction is not supported by the system.

5.1 Inline Assembly

Code Example 3 shows the implementation for 16-, 32-, and 64-bit invocations of RDSEED using inline assembly.

      Code Example 3

int rdseed16_step(uint16_t *seed)
{
		unsigned char ok;

		asm volatile ("rdseed /* .byte 0x66; .byte 0x0f; .byte 0xc7 */ %0; setc %1"
				: "=r" (*seed), "=qm" (ok));

		return (int)ok;
}

int rdseed32_step(uint32_t *seed)
{
		unsigned char ok;

		asm volatile ("rdseed /* .byte 0x0f; .byte 0xc7; .byte 0xf8 */ %0; setc %1"
				: "=r" (*seed), "=qm" (ok));

		return (int)ok;
}

int rdseed64_step(uint64_t *seed)
{
		unsigned char ok;

asm volatile ("rdseed/*.byte 0x48; .byte 0x0f; .byte 0xc7; .byte 0xf8*/ %0;
 setc %1": "=r" (*seed), "=qm" (ok));

		return (int)ok;
} 

5.2 A Simple Call

Code Example 4 shows the implementation for 16-, 32-, and 64-bit invocations of RDSEED using the DRNG library. MAX_RETRY_LIMIT is used to set the number of retries that will made before the API call returns failure. The function will return DRNG_SUCCESS ( = 1) if successful, otherwise it will return an error.

Code Example 4

/* Maximum retry value of rdseed instruction in a single call*/
#define MAX_RETRY_LIMIT 75

uint16_t u16;
uint32_t u32;
uint64_t u64;

r = rdseed_16(&u16, MAX_RETRY_LIMIT);
if (r != DRNG_SUCCESS) printf("rdseed instruction failed with code %d\n", r);
printf("RDSEED uint16: %u\n", u16);

r = rdseed_32(&u32, MAX_RETRY_LIMIT);
if (r != DRNG_SUCCESS) printf("rdseed instruction failed with code %d\n", r);
printf("RDSEED uint32: %u\n", u32);

r = rdseed_64(&u64, MAX_RETRY_LIMIT);
if (r != DRNG_SUCCESS) printf("rdseed instruction failed with code %d\n", r);
printf("RDSEED uint64: %llu\n", u64);

5.3 Generating n 32- or 64-bit Seeds

Code Example 5 illustrates the use of rdseed_get_n_32 and rdseed_get_n_64. These functions are used to generate the 32- or 64-bit seeds and store them in a buffer of size ‘n’. As explained in section 4 these codes are reentrant, and the results will get appended to the buffer if the call is resumed. Also, MAX_RETRY_LIMIT has the scope of one call. The API call returns the total number of seeds generated by the call.

Code Example 5

/* Minimum number of results required*/
#define BUFFSIZE 1275

/* Minimum number of results required*/
#define RDSEED_CUTOFF 10

/* Maximum retry value of rdseed instruction in a single call*/
#define MAX_RETRY_LIMIT 75

	int i;
	int r;
	uint32_t array32[10];
	uint64_t array64[10];
    unsigned char buffer[BUFFSIZE];

		r = 0;
		r = rdseed_get_n_32(RDSEED_CUTOFF, array32, r, MAX_RETRY_LIMIT);
		if (r == DRNG_UNSUPPORTED)
			printf("RDSEED is not supported by system\n");

		while (r < RDSEED_CUTOFF)
		{
			printf("\nRSEED numbers %d uint32's:\n", r);
			r = rdseed_get_n_32(RDSEED_CUTOFF, array32, r, MAX_RETRY_LIMIT);
		}

		printf("\nRDSEED numbers %d uint32's:\n", r);
		for (i = 0; i < RDSEED_CUTOFF; ++i) {
			printf("%u\n", array32[i]);
		}

		r = 0;
		r = rdseed_get_n_64(RDSEED_CUTOFF, array64, r, MAX_RETRY_LIMIT);
		if (r == DRNG_UNSUPPORTED)
			printf("RDSEED is not supported by system\n");

		while (r < RDSEED_CUTOFF)
		{
			printf("\nRDSEED numbers %d uint64's:\n", r);
			r = rdseed_get_n_64(RDSEED_CUTOFF, array64, r, MAX_RETRY_LIMIT);
		}


		printf("\nRDSEED numbers %d uint64's:\n", r);
		for (i = 0; i < RDSEED_CUTOFF; ++i) {
			printf("%llu\n", array64[i]);
		}

5.4 Generating n Bytes of Seed

Similarly, Code Example 6 illustrates the use of rdseed_get_bytes. This function can be used to generate seeds of n bytes long. As explained in section 4 these codes are reentrant, and the results will get appended to the buffer if the call is resumed. Also, MAX_RETRY_LIMIT has the scope of one call. The API call returns the size of seeds generated by the call.

Code Example 6

memset(buffer, 0, BUFFSIZE);
		r = 0;
		r = rdseed_get_bytes(BUFFSIZE, buffer, r, MAX_RETRY_LIMIT);
		if (r == DRNG_UNSUPPORTED)
			printf("RDSEED is not supported by system\n");

		else
		{

			while (r < BUFFSIZE)
			{
				printf("\nRDSEED generated %d bytes:\n", r);
				r = rdseed_get_bytes(BUFFSIZE, buffer, r, MAX_RETRY_LIMIT);
			}

			int i, j;
			printf("\nTotal generated RDSEED Buffer of %d bytes:\n", r);

			j = 0;
			for (i = 0; i < BUFFSIZE; ++i)
			{
				printf("%02x ", buffer[i]);

				++j;

				if (j == 16) {
					j = 0;
					printf("\n");
				}
				else if (j == 8) printf("");

			}
			printf("\n");
		}

 

Summary

The DRNG library provides random numbers with excellent statistical qualities, highly unpredictable random sequences, and high performance thanks to Intel® Data Protection Technology with Intel® Secure Key. The library is very portable as it is designed to be used on any platform with multiple compilers. Accessible via 2 simple instructions and 12 user-friendly APIs, it is also very easy to use.

References

  1. Intel® Digital Random Number Generator (DRNG) Software Implementation Guide.John Mechalas.

    https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide
  2. Installing Visual Studio. Microsoft Corp.

    https://msdn.microsoft.com/en-us/library/e2h7fzkw.aspx
  3. Intel Compilers for Windows Silent Installation Guide. Steven Lionel.

    https://software.intel.com/en-us/articles/intel-compilers-for-windows-silent-installation-guides

    Intel Compilers for Linux and OS X* Compiler Installation Help. Ronald W Green.

    https://software.intel.com/en-us/articles/intel-compilers-for-linux-and-mac-os-x-compiler-installation-help

PhonoPaper Optimization using Intel® tools

$
0
0

Download PDF

PhonoPaper is a technology and an application that allows you to play audio converted into a picture of a special format, called a spectrogram on paper or any other surface. The process is roughly as follows: 10 seconds of audio (voice, song excerpt, etc.) are converted into a picture of a special format. The printed picture can, for example, be stuck to a wall. Passersby, noticing the code, launch the PhonoPaper scanner on their phones, aim the camera at the image and in an instant begin to hear the sound encoded in it. The user is fully involved in the process—the direction and speed of playback depend on the movement of the user's hands (although there is also an automatic mode). All the necessary information is stored in the image and Internet access is not required.

An example of Phonopaper in action:

PhonoPaper aroused keen interest among musicians, artists, and fans of unusual experiments. In the 3rd quarter of 2014, the application took first place in the Intel Developer Rating project on the Apps4All.ru website. Thanks to an x86-based Android* tablet Intel loaned me for testing, the app has been improved and optimized. Here, I’ll show how I optimized PhonoPaper.

Video capture

The first thing I did was to connect the Intel® INDE Media for Mobile set of libraries, specifically, the GLCapture class which captures video with an OpenGL ES* surface in real time (in HD sound quality). Why is this necessary? First, the process of capturing and playing PhonoPaper codes is a fun, exciting spectacle like playing an unusual musical instrument. Second, PhonoPaper can operate in free mode, when the sound is converted indiscriminately, such as with a carpet and cat. Both would be great to record and upload to YouTube*.

Free mode example: any camera image is perceived as a sound spectrum.

PhonoPaper codes, drawn by hand


The process of connecting GLCapture has been described in several  articles (1, 2 ). I'll just mention a few points that are good to know before starting.

You should use Android version 4.3 or later. For older devices, I created a cross-platform MJPEG recorder, the speed and quality of which is of course much inferior to hardware-accelerated GLCapture which writes in mp4. But still, it gets the job done.

The application must be built on the basis of OpenGL ES 2.0. My program historically used version 1.1, so I had to rewrite the code. But the transition to OpenGL ES 2.0 ultimately had a positive impact on productivity, as it gave the opportunity to manually adjust the shaders.

GLCapture can write sound from a microphone. This is good if you want video that accompanies your comments. If you need high-quality sound directly from the application, you should record it in a separate file and then combine it with the mp4. For combining them, you can use MediaComposer with the SubstituteAudioEffect effect from the Media for Mobile set. Another way is recording to WAV, encoding WAV to AAC, and adding the AAC track to the mp4 file using the mp4parser library.

Since PhonoPaper is written in Pixilang programming language, the function of captu­ring video can later be used with other pixilang-based applications (PixiTracker, PixiVisor, Nature - Oscillator, Virtual ANS), and most importantly, will be available for all developers using Pixilang. At the same time, access is very easy (there are only a few options: start capture, stop, and save).

Pixilang is an open cross-platform programming language, customized to work with sound and graphics. The language syntax is highly minimalistic and is a hybrid of BASIC and C, which together with other features (the ability to write code without functions and universal containers for storing any data) reduces the entry threshold.

Intel® C++ Compiler and optimization

The next step was to assemble the x86 Android version of PhonoPaper using the Intel® C Compiler and compare the results with GCC 4.8. I use Debian Linux*, and a rather old version at that. Therefore, the first problem was to find the appropriate version of the Intel C++ Compiler. Fortunately, the right installation package was found – Intel® System Studio 2015. Despite the warning during installation, everything worked well and the first assembly was successful.

The compilation works with the following keys: -xATOM_SSSE3 -ipo -fomit-frame-pointer -fstrict-aliasing -finline-limit=300 -ffunction-sections -restrict. To test the performance of the Pixilang virtual machine (it is the basis of all my applications) small tests were written, and the sources and results of them can be viewed in this archive (zip). As a result, even without preparation, some code fragments were accelerated 5-fold(!). This is quite impressive!

In PhonoPaper, most of the load is for the spectral synthesizer function (table-wave, not FFT) – wavetable_generator (). For it, a single test was written that renders an audio stream with a random spectrum within four seconds. At the end of the test, the highest possible sampling frequency is produced. Unfortunately, the Intel C Compiler did not perform well here: 105 kHz compared to 100 kHz on GCC. By adding the -qopt-report=2 key during compilation, this message displays in the report:

loop was not vectorized: vector dependence prevents vectorization.

The main loop within our function could not be vectorized because input data indicators can point to overlapping memory areas:

int* amp = (int*)amp_cont->data

int* amp_delta = (int*)amp_delta_cont->data;


Viewing the code, I see that at this point the intersection has been eliminated and I just need to tell the compiler. In C/C++ there is a special keyword, restrict, stating that the declared indicator points to a block of memory, which is not pointed to by any other indicator. Therefore, I replace the above code with this:

int* restrict amp = (int*)amp_cont->data;

int* restrict amp_delta = (int*)amp_delta_cont->data;


Then, I assemble the application and see that the cycle is successfully vectorized. With some additional changes (in the process, it proved that it is possible to eliminate a few bit operations), the result is now 190 kHz. Taking these amendments into account, GCC gave 130 kHz—a 1.46-fold performance increase!

What is next?

As you can see, the results are very positive. PhonoPaper now runs faster (thanks largely to the Intel C++ compiler) and was extended with the functionality of the video capture. In addition, the video recording will appear in a few simple functions in the next Pixilang 3.6 update.

This optimization helped with even doing things like recording, printing, and then playing back voice codes.

There’s lots to explore with PhonoPaper!

About the Author

Alexander Zolotov is a software developer, demoscener, composer, and sound and graphics designer. He is the owner of WarmPlace.ru website. He developed several audio-visual applications: SunVox (modular synth and tracker), Virtual ANS, PhonoPaper, PixiTracker, and Pixilang programming language.


OpenCV 3.0.0 ( IPP & TBB enabled ) on Yocto with Intel® Edison with new Yocto image release

$
0
0

For OpenCV 3.0.0 - Beta , please see this article

The following article is for OpenCV 3.0.0 and Intel(R) Edison with the latest (ww25) Yocto Image.

< Overview >

 This article is a tutorial for setting up OpenCV 3.0.0 on Yocto with Intel® Edison. We will build OpenCV 3.0.0 on Edison Breakout/Expansion Board using a Windows/Linux host machine.

 In this article, we will enable Intel® Integrated Performance Primitives ( IPP ) and Intel® Threading Building Blocks ( TBB ) to optimize and parallelize some OpenCV functions. For example, cvHaarDetectObjects(...) , an OpenCV function that detects objects of different sizes in the input image, is parallelized with the TBB library. By doing this, we can fully utilize the dual-core of Edison.

1. Prepare the new Yocto image for your Edison

   Go to Intel(R) Edison downloads and download 'Release 2.1 Yocto* complete image' and 'Flash Tool Lite' that matches your OS. Then refer Flash Tool Lite User Manual for Edison to flash the new image. Using this new release, you don't need to manually enable UVC for Webcams and will have enough storage space for OpenCV 3.0.0. Additionally, CMake is already included. To enable UVC by customizing the Linux Kernel and change partition setting for different space configuration, refer the past article's step 2 and 3. 

2. Setup root password and WiFi for ssh and FTP

  Follow Edison Getting Started to connect your host and Edison as you want.

  Setup any FTP method for transferring files from your host to your Edison. ( For an easy file transfer, MobaXterm is recommended for Windows hosts )

3. OpenCV 3.0.0

 When you check your space with 'df -h', you will see a very similar result with the following. 

  Go to OpenCV Official Page and download OpenCV on your host machine. When download is done, copy the zip file to your Edison through FTP. It is recommended to copy the OpenCV to '/home/<User>' and work it out there. Since '/home' has more than 1G.

 Unzip the downloaded file by typing 'unzip opencv-3.0.0.zip' and check if your opencv folder is created.

 go to <OpenCV DIR> and type 'cmake .' and take a look what kind of options are there.

 We will enable IPP and TBB for better performance. The library to enable IPP and TBB will be downloaded automatically when the flag is turned on. 

 Now, on Edison, go to <OpenCV DIR> and type ( do not forget '.' at the end of the command line )

 root@edison:<OpenCV DIR># cmake -D WITH_IPP=ON -D WITH_TBB=ON -D BUILD_TBB=ON -D WITH_CUDA=OFF -D WITH_OPENCL=OFF -D BUILD_SHARED_LIBS=OFF -D BUILD_PERF_TESTS=OFF -D BUILD_TESTS=OFF .

 which turns on IPP & TBB flags and turns off irrelevant features to make it simple. With 'BUILD_SHARED_LIBS=OFF' , your Edison will make the executables able to run without OpenCV installed in case of distribution. ( If you don't want IPP & TBB, go with WITH_TBB=OFF and WITH_IPP=OFF )

 In the configuration result, you should see IPP and TBB are enabled.

If you observe no problems, then type

 root@edison:<OpenCV DIR># make -j2

 It will take a while to complete the building. ( 30mins ~ 1hour )

 If you encounter 'undefined reference to symbol 'v4l2_munmap' ... libv4l2.so.0 : error adding symbols: DSO missing from command line' error while building OpenCV or OpenCV samples later, we need to add ‘-lv4l2’ after ‘-lv4l1’ in the corresponding configuration files. This error could happen for more than 50+ files so it’s better to add them all with a line of command instead.

root@edison:<OpenCV DIR># grep -rl -- -lv4l1 samples/* modules/* | xargs sed -i ‘s/-lv4l1/-lv4l1 -lv4l2/g’

 

 When building is done, install what is made by typing

 root@edison:<OpenCV DIR># make install

 

4. Making applications with OpenCV 3.0.0

 

 The easiest way to make a simple OpenCV application is using the sample came along with the package. Go to '<OpenCV DIR>/samples' and type

 root@edison:<OpenCV DIR>/samples# cmake .

 then it will configure and get ready to compile and link the samples. Now you can replace one of the sample code file in 'samples/cpp' and build it using cmake. For example, we can replace 'facedetect.cpp'  with our own code. Now at '<OpenCV DIR>/samples' type

 root@edison:<OpenCV DIR>/samples# make example_facedetect

 then it will automatically get the building done and output file will be placed in 'samples/cpp'

If you encounter 'undefined reference to symbol 'v4l2_munmap' ... libv4l2.so.0 : error adding symbols: DSO missing from command line' error while building OpenCV or OpenCV samples later, we need to add ‘-lv4l2’ after ‘-lv4l1’ in the corresponding configuration files. This error could happen for more than 50+ files so it’s better to add them all with a line of command instead.

root@edison:<OpenCV DIR># grep -rl -- -lv4l1 samples/* modules/* | xargs sed -i ‘s/-lv4l1/-lv4l1 -lv4l2/g’

 

One more thing, since Edison does not have a video out, an error will occur as you call functions related to 'display on the screen' such as 'imshow' which creates and displays an image or a video on the screen. Therefore, before you build samples or examples that include those functions, you need to comment them out. 

 

 

 

 

Tutorial: Using Intel® INDE GPA to improve the performance of your Android* game

$
0
0

Download PDF

Introduction

This tutorial presents a step-by-step guide to performance analysis, bottleneck identification, and rendering optimization of an OpenGL ES* 3.0 application on Android*. The sample application, entitled “City Racer,” simulates a road race through a stylized urban setting.  Performance analysis of the application is done using the Intel® INDE Graphics Performance Analyzers (Intel® INDE GPA) tool suite.

City Racer Icon
The combined city and vehicle geometry consists of approximately 230K polygons (690K vertices) with diffuse mapped materials lit by a single non-shadow casting directional light.  The provided source material includes the code, project files, and art assets required to build the application, including the source code optimizations identified throughout this tutorial.

 

Acknowledgements

This tutorial is an Android and OpenGL ES 3.0 version of the Intel Graphics Performance Workshop for 3rd Generation Intel® Core™ Processor (Ivy Bridge) (PDF) created by David Houlton.  It ships with Intel INDE GPA.

Tutorial Organization

This tutorial guides you through four successive optimization steps.  At each step the application is analyzed with Intel INDE GPA to identify specific performance bottlenecks.  An appropriate optimization is then toggled within the application to overcome the bottleneck and it is analyzed again to measure the performance gained.  The optimizations applied are generally in line with the guidelines provided in the Developer’s Guide for Intel® Processor Graphics (PDF).

Over the course of the tutorial, the applied optimizations improve the rendering performance of City Racer by 83%.

Prerequisites

 

City Racer Sample Application

City Racer is logically divided into race simulation and rendering subcomponents.  Race simulation includes modeling vehicle acceleration, braking, turning parameters, and AI for track following and collision avoidance.  The race simulation code is in the track.cpp and vehicle.cpp files and is not affected by any of the optimizations applied over the course of this tutorial.

The rendering component consists of drawing the vehicles and scene geometry using the OpenGL ES 3.0 and our internally developed CPUT framework.  The initial version of the rendering code represents a first-pass effort, containing several performance-limiting design choices.

Mesh and texture assets are loaded from the Media/defaultScene.scene file.  Individual meshes are tagged as either pre-placed scenery items, instanced scenery with per-instance transformation data, or vehicles for which the simulation provides transformation data.  There are several cameras in the scene:  one follows each car and an additional camera allows the user to freely explore the scene.  All performance analysis and code optimizations are targeted at the vehicle-follow camera mode.

For the purpose of this tutorial, City Racer is designed to start in a paused state, which allows you to walk through each profiling step with identical data sets.  City Racer can be unpaused by unchecking the Pause check box in the City Racer HUD or by setting g_Paused = false at the top of CityRacer.cpp.

 

Optimization Potential

Consider the City Racer application as a functional but non-optimized prototype.  In its initial state it provides the visual result desired, but not the rendering performance.  It has a number of techniques and design choices in place that are representative of those you’d find in a typical game-in-development that limits the rendering performance.  The goal of the optimization phase of development is to identify the performance bottlenecks one by one, make code changes to overcome them, and measure the improvements achieved.

Note that this tutorial addresses only a small subset of all possible optimizations that could be applied to City Racer.  In particular, it only considers optimizations that can be applied completely in source code, without any changes to the model or texture assets.  Other asset-changing optimizations are excluded here simply because they become somewhat cumbersome to implement in tutorial format, but they can be identified using Intel INDE GPA tools and should be considered in a real-world game optimization.

Performance numbers shown in this document were captured on an Intel® Atom™ processor-based system (codenamed Bay Trail) running Android.  The numbers may differ on your system, but relative performance relationships should be similar and logically lead to the same performance optimizations.

The optimizations to be applied over the course of the tutorial are found in CityRacer.cpp. They can be toggled through City Racer’s HUD or through direct modification in CityRacer.cpp.

CityRacer.cpp

CityRacer.cpp

bool g_Paused = true;
bool g_EnableFrustumCulling = false;
bool g_EnableBarrierInstancing = false;
bool g_EnableFastClear = false;
bool g_DisableColorBufferClear = false;
bool g_EnableSorting = false;

They are enabled one by one as you progress through the optimization steps.  Each variable controls the substitution of one or more code segments to achieve the optimization for that step of the tutorial.

 

Optimization Tutorial

The first step is to build and deploy City Racer on an Android device.  If your Android environment is set up correctly, the buildandroid.bat file located in CityRacer/Game/Code/Android will perform these steps for you. 

Next, launch Intel INDE GPA Monitor, right click the system tray icon, and select System Analyzer.

System Analyzer will show a list of possible platforms to connect to. Choose your Android x86 device and press “Connect.”

System Analyzer - Choose your Android x86 device

When System Analyzer connects to your Android device, it will display a list of applications available for profiling. Choose City Racer and wait for it to launch.

System Analyzer - a list of applications available for profiling

While City Racer is running, press the frame capture button to capture a snapshot of a GPU frame to use for analysis.

Capture a snapshot of a GPU frame to use for analysis

Examine the Frame

Open Frame Analyzer for OpenGL* and choose the City Racer frame you just captured, which will allow you to examine GPU performance in detail.

Open Frame Analyzer for OpenGL* to examine GPU performance

The timeline corresponds to an OpenGL draw call

The timeline at the top is laid out in equally spaced ‘ergs’ of work, each of which usually corresponds to an OpenGL draw call.  For a more traditional timeline display, select GPU Duration on the X and Y axis. This will quickly show us which ergs are consuming the most GPU time and where we should initially focus our efforts.  If no ergs are selected, then the panel on the right shows our GPU time for the entire frame, which is 55ms.

GPU duration

Optimization 1 – Frustum Culling

When viewing all of the draws, we can see that there are many items drawn that are not visually apparent on the screen.  By changing the Y-axis to Post-Clip Primitives the gaps in this view serve to point out which draws are wasted because the geometry is entirely clipped.

A view-frustum culling routine

The buildings in City Racer are combined into groups according to spatial locations. We can cull out the groups not visible and thus eliminate the GPU work associated with them. By toggling the Frustum Culling check box, each draw will be run through a view-frustum culling routine on the CPU before being submitted to the GPU.

Turn on the Frustum Culling check box and use System Analyzer to capture another frame.  Once the frame is captured, open it again in Frame Analyzer.

Frame Analyzer after frustum culling option enabled

By viewing this frame we can see the number of draws is reduced by 22% from 740 to 576 and our overall GPU time is reduced by 18%.

Frustum Culling Draw calls

Frustum Culling GPU duration

Optimization 2 – Instancing

While frustum culling reduced the overall amount of ergs, there are still a great number of small ergs (highlighted in yellow) which, when taken cumulatively, add up to a non-trivial amount of GPU time.

A non-trivial amount of GPU time

By examining the geometry for these ergs we can see the majority of them are the concrete barriers which line the sides of the track.

Concrete barriers which line the sides of the track

We can eliminate much of the overhead involved in these draws by combining them into a single instanced draw.  By toggling the Barrier Instancing check box the barriers will be combined into a single instanced draw thus removing the need for the CPU to submit each one of them via a draw to the GPU.

Turn on the Barrier Instancing check box and use System Analyzer to capture another frame.  Once the frame is captured, open it with Frame Analyzer.

Frame Analyzer - after barrier instancing enabled

By viewing this frame we can see the number of draws is reduced by 90% from 576 to 60.

Draw calls before concrete barrier instancing

Draw calls after concrete barrier instancing

Draw calls before concrete barrier instancing (top) and after instancing (bottom)

Additionally, the GPU duration is reduced by 71% to 13ms.

Instancing gpu duration

Optimization 3 – Front to Back Sorting

The term “overdraw” refers to writing to each pixel multiple times; this can impact pixel fill rate and increase frame rendering time.  Examining the Samples Written metric shows us that each pixel is being written to approximately 1.8 times per frame (Resolution / Samples Written).

Output Merger

Sorting the draws from front to back before rendering is a relatively straightforward way to reduce overdraw because the GPU pipeline will reject any pixels occluded by previous draws.

Turn on the Sort Front to Back check box and use System Analyzer to capture another frame.  Once the frame is captured, open it with Frame Analyzer.

Frame Analyzer after enabling sort front to back

By viewing this frame we can see the Samples Written metric decreased by 6% and our overall GPU time is reduced by 8%.

Output Merger after enabling sort front to back

GPU duration after enabling sort front to back

 

Optimization 4 – Fast Clear

A final look at our draw times shows the first erg is taking the longest individual GPU time.  Selecting this erg reveals that it’s not a draw but a glClear call.

First erg taking the longest individual GPU time

glClear call

Intel’s GPU hardware has an optimization path that performs a ‘fast clear’ in a fraction of the time it takes a traditional clear.  A fast clear can be performed by setting the glClearColor to all black or all white (0, 0, 0, 0 or 1, 1, 1, 1).

Turn on the Fast Clear check box and use System Analyzer to capture another frame.  Once the frame is captured, open it with Frame Analyzer.

Frame Analyzer after enabling fast clear

By viewing this frame we can see the GPU duration for the clear has decreased by 87% over the regular clear, from 1.2ms to 0.2ms.

GPU duration for the clear decreased

GPU duration for the clear decreased

As a result, the overall frame duration of the GPU is decreased by 24% to 9.2ms.

The overall frame duration of the GPU decreased

 

Conclusion

This tutorial has taken a representative early-stage game application and used the Intel INDE GPA to analyze application behavior and make targeted changes to improve performance.  The changes made and improvements realized were:

OptimizationBeforeAfter% Improvement
Frustum Culling55.2ms45.0ms82%
Instancing45.0ms13.2ms71%
Sorting13.2ms12.1ms8%
Fast Clear12.1ms9.2ms24%
Overall GPU Optimizations55.2ms9.2ms83%

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

Overall, from the initial implementation of City Racer to the best optimized version, we demonstrate rendering performance improvement of 300%, from 11 fps to 44 fps.  Since this implementation starts out significantly sub-optimal, a developer applying these techniques will probably not see the same absolute performance gain on a real-world game.

Nevertheless, the primary goal of this tutorial is not the optimization of this specific sample application, but the potential performance gains you can find by following the recommendations in Developer’s Guide for Intel Processor Graphics and the usefulness of Intel INDE GPA in finding and measuring those improvements.

Porting Guide for Unity* Game on Intel® Architecture for China Market

$
0
0

Download PDF

Overview

Unity* software is one of the most popular game engines for the mobile environment (Android* and iOS*). As technology improves, especially as GPUs in mobile chips get faster, players are demanding more 3D mobile games. According to Wikipedia, there are over 1.2 billion of mobile deivce users in China in 2014, nearly 4 times than that of United States. With the growth in mobile market share of Intel® processors in the People’s Republic of China (PRC), developers want to know how to enable their Unity-based games for Intel® Architecture for Android phones/tablets in China,  a unique booming market where apps are not distributed/sold by Google Play. China has some unique situations that require game developer to take certain measures when developing and porting their games for this most active and fast growing country, which this article covers.

 

General porting guide for a Unity game

It is very easy to port an ARM*-based Unity game to Intel Architecture if the game doesn’t have any plugins. We will show you how.

First, download Unity version 4.6 or higher. Then open your game project with Unity. On the File menu select Build settings. You will see the window below.

Unity game - Project Build settings

After choosing Android and clicking the Player Settings button, a configuration window will be shown as below.

Unity Build settings - Android player settings

Make sure the device filter option is FAT or x86, then build the project. You will get an APK that supports x86 native.

However, we’re not done yet. Mobile games are a bit more complicated because they contain third-party plugins for performing various tasks.

 

The impact of plugins  on game porting

Most Unity games use plugins that provide added-value services. In China, the plugins tend to be the ones listed below:

Plugin TypeComments
Payment SDKIn-app purchase
Security SDKProtect app with re-compile
Exception Handler SDKDebug game in remote
Advertisement SDKProvide advertisement within game
Platform Access SDKProvide account services for online game
Data Statistics SDKCollect users’ information to back-end server
Cloud Push SDKPush notification from server

More complete descriptions of each plugin are given below.

Payment SDK

Many independent software vendors (ISV) focus on revenue collection in game development. An effective way to do this is in-app purchase, which needs a payment plugin. In China there are lots of payment vendors such as Alipay, Caifutong, and Wechat payment to name a few. Additionally, telecom operators, such as China Mobile, China Unicom, and China Telecom, have payment plugins for ISVs.

Security SDK

Most game code for the Android platform is developed in Java*, which is compiled to a DEX file. This file is an easy target of a decompile hack. Because of this, security for --- is needed. Ijiami, 360, Bangbang, and Najia are all main security vendors in China market.

Exception handler SDK

Debugging software for Android usually includes checking logs, setting breakpoints, and observing running parameters step by step. When a crash happens, it can be more effective to dump the crash information, especially after the application has been released. This information includes platform registers and the call stack. Some ISVs combine exception handlers to games for tracking log information. So, as an example, PRC ISV Tencent developed an SDK called Bugly that integrates all Tencent’s applications and games exceptions.

Advertisement SDK

Like in-app purchases, advertising integrated in games is yet another business model for game developers to make revenue.

Platform Access SDK

Many games, especially online games, want players to have accounts in order to record their progress and store scores in back servers. So popular PRC account systems like QQ and Wechat have been embedded in games.

Data Statistics SDK

Data Statistic SDKs record game player’s statistics, which developers can then use to optimize and modify the game.

Cloud Push SDK

This SDK sends notification to users from a server.

When enabling an Android game to x86 native, all related SDKs must be ported to x86 before plug into Unity; otherwise, the game will not run well on Intel Architecture-based platforms.

Here is good example for Unity plugin to enable to x86 native.

Unity plugin to enable to x86 native

 

Case Study: We Fire

We Fire is the first 3D first person shooter mobile game from Tencent Lightspeed Studio. It has more than 137 million registered users and 26 million active users per month; it’s one of the most popular mobile games in China right now.

The ARM version of We Fire has 11 libraries in the lib/armeabi-v7a folder, but only three libraries, libunity.so, libmain.so, and libmono.so, are actually responsible for the gaming rendering. The others are all plugins. For example, libtpnsSecurity.so is a plugin SDK for game security with anti-theft and anti-reverse engineering.

We fire - ARM version has 11 libraries

First, we need to port all libraries in this folder to x86 native like below.

We Fire - porting 11 libraries to x86 native

However, the game crashed when this APK was installed on x86 Android 5.0 due to a libwbsafeedit library, which is an ARM binary hidden in the assets folder.

We Fire - ARM binary hidden in the assests folder

When we enabled this library for x86 as below, the app worked well on an Intel Architecture-based device.

We Fire - app works on an intel architecture based device

But not all libraries can be ported to x86 native easily. Take the library libBugly.so as an example. It is responsible for tracing crash information and uploading it to the cloud. It also gets stack information from NDK C/C++ code, so it involves ASM functions of the system platform. Before it can be compiled to x86 native successfully, we had to rewrite parts of the code from ARM to x86.

 

Performance Comparison

We ran some performance comparisons between the ARM version and x86 version on an x86 device. The results are below.

Performance Comparison between the ARM version and x86 version

After porting, the average power consumption improved about 10% than when running via Native Bridge. And the CPU and RAM utilization rates were reduced 26% and 21% respectively.

In addition, the APK size of Fat (ARM+x86) is just 9 MB more than the ARM version.

We FireAPK Size
ARM234M
Fat (X86+ARM)243M

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance.

 

Conclusion

For creating Android games with Unity, you’ll see a performance benefit on Intel Architecture by porting to x86 native. But the Unity engine is not the only thing that must be ported; all the related plugins must be ported as well. If you want to successfully port a Unity game to x86 native, you should do the porting for these plugins in advance.

 

About the Author

Tao Peng is a software apps engineer in the Intel Software and Services Group. He currently focuses on gaming enabling and performance optimization, in particular on Android mobile platforms.

 

Adaptive Volumetric Shadow Maps for Android* Using OpenGL* ES 3.1

$
0
0

 

Download PDF

As a follow-up to Adaptive Volumetric Shadow Maps for DirectX* 11, we present a port of the same algorithm adapted for Android* devices that support OpenGL ES* 3.1 and the GL_INTEL_fragment_shader_ordering OpenGL* extension.

Beyond being just a simple port, this version includes a number of optimizations and tradeoffs that allow the algorithm to run on low power mobile devices (such as tablets and phones), in contrast to the previous sample which targeted Ultrabook™ system-level hardware.

The AVSM algorithm allows for generation of dynamic shadowing and self-shadowing for volumetric effects such as smoke, particles, or transparent objects in real-time rendering engines using today’s Intel® hardware.

To achieve this, each texel of the AVSM shadow map stores a compact approximation of the transmittance curve along the corresponding light ray.

Transmittance curve expressed using 4 nodes
Transmittance curve expressed using 4 nodes

The main innovation of the technique is a new streaming compression algorithm that is capable of building a constant-storage, variable-error representation of a visibility curve that represents the light ray’s travel through the media that can be used in later shadow lookups. This essentially means that every time a new partial shadow caster (light occluder) is added to the shadow map, the algorithm performs an optimal lossy compression of the transmittance curve – for each texel individually.

 reduction from 4 to 3 nodes by removal of the least visually significant node (A)
Lossy compression: reduction from 4 to 3 nodes by removal of the least visually significant node (A)

This algorithm relies on the GL_INTEL_fragment_shader_ordering OpenGL extension that enforces deterministic shader execution order on the per-pixel level based on triangle submission order. This accomplishes two important goals that are needed by the AVSM algorithm:

  • Synchronization, allowing for thread-safe data structure access.
  • Per-pixel shader execution ordering (based on triangle submission order), allowing for deterministic behavior of lossy compression between subsequent frames, which prevents temporal visual artifacts (“flickering”) that otherwise appear.

On DirectX 11 this feature is available through Intel Pixel Synchronization Extension, or, more recently, natively through DirectX 11.3 and DirectX 12 feature called Raster Order Views.

Example of AVSM smoke shadows (disabled - left, enabled - right) in Lego Minifigures* Online
Example of AVSM smoke shadows (disabled - left, enabled - right) in Lego Minifigures* Online by Funcom Productions A/S

Below is a list of the main differences between the Android and the original DirectX implementations. The differences are mostly focused on optimizing the algorithm performance for low power target hardware such as tablets and phones:

  1. Transparent smoke particles are rendered into a lower resolution frame buffer and blended to the native resolution render target to reduce the blending cost of the high overdraw. This is not an AVSM-specific optimization but was necessary for such an effect to be practical on the target hardware.
  2. In some scenarios, mostly when shadow casters are not moving too quickly in reference to the AVSM shadow map matrix, the shadow map can be updated only every second frame to reduce the cost. To balance computation across both frames, some operations (such as clearing the buffers) can then be performed in the alternate frame.
  3. Only every second smoke particle can be added to AVSM map but with twice the opacity. This slightly reduces visual quality but improves insertion performance by a factor of two.
  4. To reduce the cost of sampling the AVSM shadows, the sampling can be moved from per-pixel to per-vertex frequency. The old (DirectX11) sample uses screen space tessellation to achieve this at no quality loss compared to per-pixel sampling. This sample, however, uses a geometry shader to output a fixed billboard quad made up of four triangles and five vertices. AVSM sampling and interpolation using a five vertex quad (one in the middle in addition to four corners) provides good balance between quality and performance better suited to target hardware.
  5. For receiver geometry that is always behind shadow casters (such as the ground), full sampling is unnecessary and replaced by only reading the value of the last node.

Sample running on Tesco hudl2* device with Android*
Sample running on Tesco hudl2* device with Android*

The sample UI provides ways of toggling on/off or tweaking most of the above listed optimizations, as a way of demonstrating the cost and visual quality tradeoffs.

The sample code will run on any OpenGL ES 3.1 device that supports GL_INTEL_fragment_shader_ordering extension. Devices that support that extension include Intel® Atom™ processor-based tablets (code named Bay Trail or Cherry Trail).

Please refer to README.TXT included in the archive for build instructions.

Based on Adaptive Volumetric Shadow Maps, EGSR 2010 paper by Marco Salvi.

Rasterizer Order Views 101: a Primer

$
0
0

Download PDF

Rasterizer Order Views

Introduction – What are Rasterizer Order Views?

One of the new features of DirectX* 12 is Rasterizer Ordered Views, which allow read/write access to resources, such as buffers, textures, and texture arrays, without multisampling from multiple threads and without generating memory conflicts through the use of atomic functions. This feature means resources created with Unordered Access Views (UAV) can mark in the pixel shader code certain resources to follow strict ordering rules similar to those used to ensure the correct pixel blending during draw operations. Raster ordered views (ROVs) allow the creation of a whole range of new algorithms, such as Order Independent Transparency (OIT), Adaptive Volumetric Shadow Maps (AVSM), and custom blending operations, that are not possible in the fixed function blending pipeline.

How did Rasterizer Order Views come about?

Certain graphics problems like order independent transparency are vital for realistic looking smoke, foliage, hair, and water but don’t fit into the traditional rendering pipeline.  The flexibility and power of programmable shaders would seem to offer a solution to these kinds of problems, except that, even with atomics, it is not possible to read and modify data in a deterministic manner inside a shader using UAVs, leading to potential visual artifacts. ROVs are important because they can help solve this problem by synchronizing shader execution in order of triangle submission.

Order Independent Transparency is vital for realistic looking of smoke
Order Independent Transparency is vital for realistic looking of smoke.

So, why is Intel talking about rasterizer order views in DirectX 12? Well, two years ago Intel introduced similar functionality with the release of PixelSync as part of 4th generation Intel® Core™ processors.  

Johan Andersson, Technical Director at DICE for Battlefield 4, when asked about what he wanted to see in the next generation of GPUs from all hardware vendors, even mentioned PixelSync: “We have a pretty long list…but one very concrete thing we’d like to see, and actually Intel has already done this on their hardware, they call it PixelSync, which is their method of synchronizing the graphics pipeline in a very efficient way on a per-pixel basis. You can do a lot of cool techniques with it, such as order independent transparency for hair rendering or for foliage rendering. And they can do programmable blending where you want to have full control over the blending instead of using the fixed-function units in the GPU. There’s a lot of cool components that can be enabled by such a programmability primitive there…“. ROVs now bring a standard way of accessing the PixelSync functionality that Johan liked across a wide variety of hardware from different vendors.

DirectX Pipeline and the limitations of UAVs

As mentioned in the introduction, rasterizer order views are important because they allow you to read and modify data in a deterministic manner inside a shader using UAVs. So why isn’t this possible without ROVs? To understand that, you need to understand how data passes through the various stages in the graphics pipeline.

DirectX rendering follows a strict set of rule that ensure triangles are always rendered in the order they are submitted: if two triangles are overlapping on the screen, the hardware guarantees that Triangle 1 will have its color result blended to the screen before Triangle 2 is processed and blended.

DirectX rendering follows a strict set of rule that ensure triangles are always rendered in the order they are submitted

When triangles are submitted, they run through the input assembler then through the vertex shader, hulls shader, or geometry shader depending on which parts are being enabled, before reaching the rasterizer and then to the pixel shader. These shaders are run on programmable units called EUs (execution units) on Intel® hardware, with the system designed to run many shaders in parallel across different EUs. Hardware on the backend, called the raster operations pipeline (ROP), is responsible for enforcing this ordering requirement ensuring pixels from Triangle 1 are rendered before Triangle 2.

However it’s not programmable; you can only do a fixed menu of operations on the color, z, and stencil buffers. Another major limitation of the pipeline shown above is the fact the input data sources have to be different than the output render targets—a shader can’t modify its own incoming data. DirectX 11 introduced a way around this particular limitation with the introduction of UAVs.

UAVs are resources (which include buffers, textures, and texture arrays) that are directly connected to one of the shaders and are processed therefore before the output merger. They’re processed before the pipeline part that specifies the in-order behavior between individual triangles.

UAVs processed before the pipeline part between individual triangles

So what are the UAV limitations stemming from operating before the output merge? Imagine two triangles entering the pipeline. The first pixel shader from Triangle 1 does a read/modify/write (r/m/w) operation on data in a UAV using the screen location as an index, then the second comes along and tries to do a r/m/w. If the triangles overlap, they potentially could access the exact same location.

UAV limitations stemming from operating before the output merge

This would set up a data race condition and a situation where some data may get dropped from either Triangle 1 or 2. Triangle 1 could have read some data from the UAV and be in the middle of processing before writing its value back. Triangle 2, because of the race condition, could potentially read the same starting data, do its own set of calculations, and write another result back to the UAV surface wiping over the result from Triangle 1. The effect would the same as if Triangle 1 was never run.

Data race condition between triangles

Even if the data is safe, there still are potential issues. Although Triangle 1 starts processing first, there’s no guarantee that the shader that accesses the UAV referred to by Triangle 1 will run first. For example, if Triangle 1 happened to access new data from outside the local cache or took a more complex dynamic path through its shaders, Triangle 2 could actually run the code that accesses the UAV first.

The nondeterministic order of execution can cause significant issues for certain algorithms

Even if the r/m/w operation doesn’t cause a race condition, the nondeterministic order of execution can cause significant issues for certain algorithms. Here’s a video that shows an example of what happens if there isn’t pixel sync and you use UAVs to implement order independent transparency using a lossy compression algorithm. You can actually see flickering occur because the pixels are getting operated on in the wrong order and the written values are changing between frames even though the scene isn’t changing. It’s very obvious to the viewer, and frequently the more powerful the hardware (and the more threads executed in parallel), the more flicker will be seen.

We need the hardware to detect dependencies among fragments writing to the same x/y screen coordinate and enforce the same ordering rules normally used by the ROP, but at the point of access to the UAV. This would avoid data races and guarantee primitive ordering for r/m/w operations.

So, what happens when you have the ability to enforce order between triangles? If Triangle 1 is in middle of its r/m/w operation and Triangle 2 hits the same point in the shader, rather than starting the algorithm it will block and wait until the first triangle is finished to avoid any race conditions. Even if Triangle 2 gets there first, it will wait until Triangle 1 is run and finished before starting Triangle 2, which creates a determined order and also avoids race conditions.

A determined order which avoids race conditions

When the fragments are not overlapping and don’t write the same x/y coordinate, there is no performance impact. Even if two fragments are in flight and reference the same  x/y coordinate, the performance cost is minimal if the hardware executes them in the original submitted order.

ROV API

The ROV API is a High Level Shading Language (HLSL) construct that builds on the existing UAV support, so the only code changes to make are within the HLSL shaders themselves. Below is a list of resource types that can be defined within HLSL:

  • RasterizerOrderedBuffer
  • RasterizerOrderedByteAddressBuffer
  • RasterizerOrderedStructuredBuffer
  • RasterizerOrderedTexture1D
  • RasterizerOrderedTexture1DArray
  • RasterizerOrderedTexture2D
  • RasterizerOrderedTexture2DArray
  • RasterizerOrderedTexture3D

Each of the declarations corresponds to a normal UAV resource, such as a RWBuffer or a RWTexture2D, etc. Outside of the HLSL shader, no code changes calling C++ code are required. Resources are created as normal and a UnorderedAccessView is created and bound using OMSetRenderTargetsAndUnorderedAccessViews.

Porting between PixelSync and ROVs

While there are a couple of subtle differences between writing shader code using Raster Order Views in DirectX 12 and Intel’s PixelSync, they basically work the same. An algorithm created in one is easily transferable to the other. That’s useful for developers, as for the last three years Intel has been writing code samples for PixelSync. All of these samples are easily transferable to ROVs, enabling support for not just Intel hardware but all hardware vendors.  Many of these samples have been used in shipping games. For example, both Grid* 2 from Codemasters and the Total War* series from Sega used OIT for improving foliage. Many of these game development houses continue to use these algorithms in their engines proving ROVs have the ability to make a real visual difference on current hardware.

The great outdoors in GRID* 2 by Codemasters with OIT applied to the foliage and chain link of fencing 
The great outdoors in GRID* 2 by Codemasters with OIT applied to the foliage and chain link of fencing

Any PixelSync sample on the Intel web site is easy to transfer over to ROV code. Inside all PixelSync code is an include file titled IntelExtensions.hlsl, a declaration to a UAV resources such as RWTexture2D along with its defined binding slot, and a couple of predefined functions such as IntelExt_Init() and IntelExt_BeginPixelOrderingonUAV. The latter is used to define the actual synchronization point and the UAV surface affected.

The ROV syntax is more simplified. An external include file is not needed, and instead the RGBE buffer gets declared as a raster ordered texture. The shader compiler will automatically insert a synch point at the first instance of the RGBE buffer read. It even knows which UAV slot has been assigned. It’s very quick and simple to move from PixelSync to ROV.

Simple steps to move from PixelSync to ROV

ROV use cases

The ability to have deterministic order access to r/w buffers within a shader opens up a lot of interesting solutions to current graphic problems. One of the most obvious use cases is for programmable blending to replace the fixed function hardware, which opens the possibility for using custom data types as a render target using a 32-bit render surface to store RGBE data, allowing for greater precision as is shown in this Intel code sample. A simple extension to programmable blending is g-buffer blending, where surfaces containing nonlinear data such as a surface normal can be correctly combined, which is a real benefit for deferred renderers.

A more complex use case is to create a k-buffer that is a generalization of the traditional z-buffer-based framebuffer. Instead of restricting the framebuffers to a single value, the k-buffer uses memory as a r/m/w pool of k entries whose use is programmatically defined by k-buffer operations. Using ROVs to generate a k-buffer allows single pass implementations for depth-peeling, order independent transparency, and depth-of-field and motion blur effects.

In DirectX 11, r/m/w operations had undefined behavior. These algorithms were frequently designed around per-pixel linked lists, which had unbounded memory requirements. In many cases the unbounded memory requirement can be completely removed, as various forms of lossy compression can be done while the data is inserted to keep the data set within a fixed size. An implementation of order independent transparency used in GRID 2 and GRID Autosport did exactly this—the first k pixels were stored in the k buffer and once its limit was reached any additional transparent pixels were merged with the current data using a routine to minimize the variation in the result from the ground truth.

In addition to the algorithms already mentioned and used in games, ROVs might prove useful in several R&D topics, such as using ROVs for custom anti-aliasing solutions especially when combined with another DirectX 12 feature in conservative rasterization and voxelisation.

One problem voxelisation routines often have is the insertion of the data into the 3D voxel grid. This is normally accomplished using atomic operations. ROVs allow much more complex data structures to be modified safe from race conditions with other triangles updating the mesh. Using ROVs for voxelisation does require the geometry to be rasterized into the three planes as separate calls, rather than a single draw submission with the geometry shader choosing the plane to project into. Fragment dependencies cannot be tracked over multiple 2D planes in a single render call, so it offers an interesting tradeoff for the more complex data that can be managed within the structure.

Summary

Rasterizer order views are a new set of tools to help developers control the 3D pipeline. It’s very simple to use and offers new solutions for many long standing problems like order independent transparency, depth peeling, and volume rendering and blending, all while improving game performance. As a starting point for experimenting with ROVs, refer to articles on related topics like PixelSync samples or even the OpenGL* extensions that duplicate ROV behavior. With two generations of Intel hardware supporting ROVs and broad support from the rest of the industry, ROVs offer a great addition to the transitional pipeline, one you can use now. Good luck with your game coding!

About the Author 

Leigh Davies is a senior application engineer at Intel with over 15 years of programming experience in the PC gaming industry, originally working with several developers in the UK and then with Intel. He is currently a member of the European Visual Computing Software Enabling Team, providing technical support to game developers. Over the last few years Leigh has worked on a wide variety of enabling areas from graphics (optimization, Order Independent Transparency, and Adaptive Volumetric Shadow Mapping) to multi-core, enabling platform optimizations like touch and sensors controls. Over the last 2 years Leigh has worked on Windows* (DirectX 11 and 12) and Android* (GLES 3.1).
 

References

Viewing all 3384 articles
Browse latest View live




Latest Images