Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Innovative Puzzle Game From Salmi Games Wins Intel® Level Up Contest Game of the Year

$
0
0

If there’s one thing that stands out when playing Ellipsis, a touch-enabled action-puzzler title from Salmi Games, it’s the strict adherence to a time-tested motto for developers: keep it simple. The interface is easy to understand, the premise is natural, the color scheme is clean and crisp with no text to clutter the view, and the environment is vivid and engaging. That’s a perfect formula for success, and the awards just keep piling up.

Recently, Ellipsis was voted Best Action Game and Game of the Year in the 2016 Intel® Level Up Game Developer Contest, making this a great opportunity to dive deeper into the path Salmi Games took to produce this winning title. What makes their case interesting is that the developers were able to apply the lessons they learned from developing a mobile app toward their contest entry, which ran on PCs. Salmi Games is bringing the phrase “iterating for success” to a whole new level.

Clean and Simple Rules the Day

For a two-man team, Salmi Games packs a lot of punch. Stefan Hell and Yacine Salmi share coding, design, and art responsibilities. They also work closely with Filippo Beck Peccoz, a Munich-based audio composer, for the game’s audio design. For such a small team they get a lot done, so we recently spoke with Yacine about his background, the games he played as a kid, and what influenced him the most as he co-developed this award-winning game.

Yacine always knew he wanted to work on video games, calling himself a true “Nintendo* boy.” Growing up, he enjoyed such games as the futuristic shooter Geometry Wars*, the puzzle-game Osmos*, and traditional titles such as Super Mario Brothers* and Legend of Zelda*. He especially liked how clean and simple a game like Zelda could be. “They focus on creating a little sandbox for you to play around in, and then just keep adding new things constantly,” he said. “They had a really big influence on me, and that’s what we did with Ellipsis.”

Yacine focused his education on computer science, and traveled a lot to complete his studies. He grew up in the United States, but completed his undergraduate work in Toronto, Canada, and earned a master’s degree in game programming in the UK. He’s been living in Munich, Germany, for the last nine years, giving him a truly international outlook.

After completing school, he coded at Electronic Arts* for a year, then moved on to Sony*, where he worked on the PS3* launch title MotorStorm*. After a couple of years working on PS3 titles, he switched to engineering on the Havok* Physics* engine team. About five years ago, Yacine started his first game studio, but ran out of money and shelved that project. Two years ago, he started Salmi Games—dedicated to creating innovative, minimalist games—mixing work on contract assignments with developing in-house titles.

Yacine met his partner Stefan Hell (then a local student) at a design event in Munich, where they coded a prototype web app, and bonded over subroutines. Even before Stefan graduated, Yacine would bring him in to help on small projects, and they teamed up formally after that.

“You have to get to know somebody for a while before you do a deeper project together,” Yacine advises. “You can learn their style, and how to communicate with them, and whether their strengths and weaknesses complement yours. Then you can engage in the creative process together.”

Figure 1. Yacine Salmi, right, and Stefan Hell.

Colorful, Creative, and Fun

Because Ellipsis is such a streamlined game, it’s easy to describe how it works. Players start out faced with a very simple screen, which is dominated by a pulsing ring that invites you to touch it and drag it around. When you drag the ring over a token, you release smaller targets that you want to quickly scoop up. When there is nothing left, you move the token to a sanctuary gate, which pulls you to welcome safety.

Figure 2. The open, white circle in the lower-left corner is where you place your finger, and you try to capture the tokens that have smaller circles inside. The vertical white bar is the “gate”, where you escape to safety when you’re done.

As the layers progress, there are things chasing you, firing at you and collisions to avoid, and docking at the gate provides a rush of completion. The first time you complete a level, there’s no time clock or high score, but if you roll back through a second time, you get additional challenges.

Stylistically, the game offers tantalizing neon colors in an explosion of tracers and light particles, with a subtle collection of rewarding sound effects, and an easy system of tracking your progress. It practically defines elegance—there are no help files, no tutorials, and no strategy guide to memorize. The puzzles are addictive, and you can forge through sixty levels in just a few minutes.

Figure 3. Ellipsis features vivid, neon colors, and plenty of action.

Work on Ellipsis started in 2014, and it was always the goal to create a touch-based game that would tap into mobile markets. Yacine and Stefan prototyped some possibilities, but, one day they came at things from a different angle. As Yacine recalls, “How can we approach touch-based games with direct control, instead of just doing a swipe or a tap? What would happen if you put your finger on the tablet and whatever you’re controlling directly follows your finger all the time?” There were a few issues to work out, but they soon found that the basic gameplay idea was clean, simple, and fun. They spent the next year exploring that concept as deeply as possible, but never 100 percent full time. They sandwiched the work in between contracts, working nights and weekends, juggling coding, life, and family with great care.

Yacine gave himself the challenge to polish the game without ever adding any explanatory text. That would make sure there were no translation issues, and keep the screen uncluttered and simple. “I won’t be there to explain the game to people,” he reasoned, so it was very important to just hand a tablet over to a player and say nothing, then watch them start. “I want to embrace the simplicity of the game as much as possible,” Yacine explained.

Unity* Engine Speeds Up Development

Getting to such a clean and polished feel was no easy feat. The team dabbled with creating their own engine, but soon dropped that work and adopted the Unity* engine. “Unity was just the easiest to get going with,” Yacine said. “I've written my own engine before, and it can become a huge time sink, where you’re spending more time on the technology side than on the game itself. Unity was very liberating in that sense.”

Not that there weren’t edge cases that gave the team headaches. They struggled with developing for multiple Android* versions, for example. Some devices suffered from lighting issues that made them unplayable, and the team spent days trying to figure out the problem. Finally, a Unity update arrived that magically fixed the issues, and they moved on. Other times, a new update would cause performance issues that they had to painstakingly sleuth by themselves.

“There’s a give-and-take when you’re working with another engine,” Yacine said. “It’s fantastic to get going quickly, but when you want to do something that it’s not capable of doing, then you start to run into problems.”

Even though he had experience with the Havok Physics engine, the physics in Unity provided all the technology Yacine needed for collisions. But there were issues there, too. During early testing, when the avatar would run into a wall but the player’s finger kept going, the game would lose the connection. Resolving that without making the game slow down significantly took some work. “We had to create a lot of different techniques, with catch-up systems to bring you back to your finger without breaking everything, or without it feeling very sluggish. That was a lot of work to get the ”feeling” just right.”

The team used Adobe* Photoshop* for graphics, after playing around with Adobe Illustrator*, and rejecting it. “We thought our game style lent itself to vector art, but then we’d do a vector render and it was a pain. We just went back to using high-quality sprites.”

Figure 4. Puzzles get increasingly challenging as you progress through the game.

Another technical challenge was to support an easy mechanism to save the game. All the player has to do is lift their finger and the game pauses. Synchronizing the saves to deal with cloud saves and offline saves, and merging those two paths correctly, took a lot of time to work out. It’s now possible to start playing on one device and continue on another, which modern gamers expect.

Finally, they had to work on their shader code to get the right flare effect. They adapted a shader that ships with Unity, enhancing it, and adding their own twists.

Ready, Set, Level Up

Because touch latencies have continued to fall—making touch-based games more feasible—the team always planned for mobile gaming. Yacine pointed to newer tablets and phones, which have reduced touch latencies to only a few milliseconds of delay. The market seemed wide open, so they jumped in.

Yacine heard about the Intel® Level Up Contest a year prior to entering, but the team wasn’t ready for that year’s deadline. Using the website PromoterApp*, Yacine was already tracking important festivals, conferences, and local events, and when it flagged him about the next Intel Level Up Contest, he was ready.

“You enter these contests, and you just hope they will see the quality in your game,” he said. “When the Intel contest came back up, and we saw it was focused on touch devices and 2-in-1 PCs, we thought this could be a really cool opportunity.”

That was another advantage of Unity—they could quickly port from an Android to a PC version. They polished the game up for the contest so that PCs with touch screens could easily handle it, as well as PCs with a mouse, and hit the Intel Level Up Contest deadline.

When Yacine learned Ellipsis had not just been named the Best Action Game, but also overall Game of the Year, he admits he started dancing around with joy. All through the day he would get distracted, then remember he’d won the contest, and it would hit him all over again. “It was a really cool surprise. Not only being nominated, but to be considered the Game of the Year is just a really huge and humbling honor.”

Grand-prize winners receive a USD $5,000 cash award, which is always welcome, but there’s much more. They also receive an agency-driven digital marketing creative campaign package, valued at USD $12,000. The campaign will drive targeted visibility through Facebook* or YouTube* over a continuous four-week period.

Yacine plans to coordinate the marketing campaign with the PC release of Ellipsis to maximize exposure. He showed the game off at the 2016 PAX West conference in Seattle. He’s got plenty of judging comments to go through, and is looking forward to incorporating some new libraries and utilities put forward by his Intel account manager to help with recommended specs for more devices.

Lessons Learned, and Next Steps

Yacine expects those Intel optimizations for multiple devices to help with the testing regimen. During their release on iOS* 7, Salmi Games linked to a library that wasn’t supported on earlier versions, and they spent hours trying to figure out why reports came in of serious crashes. So they learned to target the weakest devices, and make that experience as smooth as possible, knowing that the higher-end devices would take care of themselves.

“For the PC version that we’re working on, we’re going to be doing the same thing: targeting low-end devices first,” Yacine said. “We’ll make sure it runs at least 30 frames-per-second on those less-powerful systems, if not 60, and hopefully from that we’ll have a smooth experience across the board.”

Yacine also obsessed over the game’s footprint, cleaning it up so that the game didn’t take up too many megabytes. When the game started to reach a size of 60 megabytes, Yacine decided he needed to spend some serious time looking at file sizes and extra code. “We only had textures and audio, basically, so it shouldn’t get that big. We do have high-quality textures, but throughout development we would keep an eye on these things, and do an optimization pass whenever it was reasonable. Why should it be 60 or 70 megabytes when it could be 30 or 40?”

Figure 5. The Ellipsis display at PAX West 2016 pulled in lots of gamers.

Conclusion

Salmi Games shows yet another variant on the developer’s journey—how to jumpstart your marketing by winning prestigious contests. By taking Game of the Year honors in the Intel Level Up Contest, the team will gain access to more Intel tools and expertise for optimization, particularly when it comes to recommended specs and aspect ratios for various devices. The PC version of their game is due out on general release in early 2017, and will reach a whole new audience.

Figure 6. Yacine left the PAX West 2016 conference in Seattle with some nice new hardware.

Meanwhile, like any good executive, Yacine is already thinking about next steps. Between the concise feedback from the judges, the technical assistance from Intel engineers, and the marketing boost that comes with winning, Yacine feels he is primed for success going forward. The team is playing around with a VR version of Ellipsis on Oculus* Touch* and Samsung* Gear VR* devices, and this entrancing, colorful game combined with an innately immersive environment could well be the perfect match.

Additional Resources

Purchase Ellipsis on Steam*: http://store.steampowered.com/app/514620

Play Ellipsis main site: http://playellipsis.com

2016 Intel Level Up Contest: https://software.intel.com/en-us/blogs/2016/05/27/2016-intel-level-up-contest-by-the-numbers

Unity engine: https://unity3d.com


Driver Support Matrix for Intel® Media SDK and OpenCL™

$
0
0

 

Developers can access Intel's processor graphics GPU capabilities through the Intel® Media SDK and Intel® SDK for OpenCL™ Applications. This article provides more information on how the software, driver, and hardware layers map together.

 

Delivery Models


There are two different packaging/delivery models:

  1. For Windows* Client: all components needed to run applications written with these SDKs are distributed with the Intel graphics driver. These components are intended to be updated on a separate cadence than Media SDK/OpenCL installs.  Drivers are released separately and moving to the latest available driver is usually encouraged. Use Intel® Driver Update Utility to keep your system up-to-date with latest graphics drivers or manually update from downloadcenter.intel.com. To verify driver version installed on the machine, use the system analyzer tool.
     
  2. For Linux* and Windows Server*:Intel® Media Server Studio is an integrated software tools suite that includes both SDKs, plus a specific version of the driver validated with each release.

Driver Branches

Driver development uses branches covering specific hardware generations, as described in the table below. The general pattern is that each branch covers only the two latest architectures (N and N-1). This means there are two driver branches for each architecture except the newest one. Intel recommends using the most recent branch. If issues are found it is easier to get fixes for newer branches. The most recent branch has the most resources and gets the most frequent updates. Older branches/architectures get successively fewer resources and updates.

Driver Support Matrix
Processor ArchitectureIntel® Integrated Graphics WindowsLinux

3rd Generation Core,

4th Generation Core

(Ivybridge/Haswell)

LEGACY ONLY, downloads available but not updated

Ivybridge - Gen 7 Graphics
 

Haswell - Gen 7.5 graphics 
 

15.33

Operating Systems:

Client: Windows 7, 8, 8.1, 10

Server: Windows Server 2012 r2

16.3 (Media Server Studio 2015 R1)

Gold Operating Systems:

Ubuntu 12.04, SLES 11.3

4th Generation Core,

5th Generation Core

(Haswell/Broadwell)

LEGACY 

Haswell - Gen 7.5 graphics 
 

Broadwell - Gen 8 graphics
 

15.36

Operating Systems:

Client: Windows 7, 8, 8.1, 10

Server: Windows Server 2012 r2

16.4 (Media Server Studio 2015/2016)

Gold Operating Systems:

CentOS 7.1

Generic kernel: 3.14.5

5th Generation Core 

6th Generation Core

(Broadwell/Skylake)

CURRENT RELEASE

 

Broadwell - Gen 8 graphics
 

Skylake - Gen 9 graphics
 

15.40 (Broadwell/Skylake Media Server Studio 2017)

15.45 (Skylake + forward, client)

Operating Systems:

Client: Windows 7, 8, 8.1, 10

Server: Windows Server 2012 r2

 

16.5 (Media Server Studio 2017)

Gold Operating Systems:

CentOS 7.2

Generic kernel: 4.4.0

 

Windows client note: Many OEMs have specialized drivers with additional validation. If you see a warning during install please check with your OEM for supported drivers for your machine.

 

Hardware details

 

Ivybridge (IVB) codename for 3rd generation Intel processor based on 22nm manufacturing technology and Gen 7 graphics architecture. 

Ivybridge

Gen7

3rd Generation Core

 

GT2: Intel® HD Graphics 2500

GT2: Intel® HD Graphics 4000



Haswell (HSW) codename for 4th generation Intel processor based on 22nm manufacturing technology and Gen 7.5 graphics architecture. Available in multiple graphics versions- GT2(20 Execution Units), GT3(40 Execution Units) and GT3e(40 Execution Units + eDRAM to provide faster secondary cache).

Haswell

Gen 7.5

4th Generation Core

 

GT2: Intel® HD Graphics 4200

GT2: Intel® HD Graphics 4400

GT2: Intel® HD Graphics 4600

 

GT3: Intel® Iris™ Graphics 5000

GT3: Intel® Iris™ Graphics 5100

 

GT3e: Intel® Iris™ Pro Graphics 5200

 

Broadwell (BDW) codename for 5th generation Intel processor based on 14nm die shrink of Haswell architecture and Gen 8 graphics architecture. Available in multiple graphics versions - GT2(24 Execution Units), GT3(48 Execution Units) and GT3e(48 Execution Units + eDRAM to provide faster secondary cache).

Broadwell

Gen8

5th Generation Core

GT2: Intel® HD Graphics 5500

GT2: Intel® HD Graphics 5600

GT2: Intel® HD Graphics 5700

 

GT3: Intel® Iris™ Graphics 6100

GT3e: Intel® Iris™ Pro Graphics 6200

Skylake (SKL) codename for 6th generation Intel processor based on 14nm manufacturing technology and Gen 9 graphics architecture. Available in multiple graphics versions - GT1 (12 Execution Units), GT2(24 Execution Units), GT3(48 Execution Units) and GT3e(48 Execution Units + eDRAM), GT4e (72 Execution Units + eDRAM to provide faster secondary cache).

Skylake

Gen9

6th Generation Core

GT1: Intel® HD Graphics 510 (12 EUs)

GT2: Intel® HD Graphics 520 (24 EUs, 1050MHz)

GT2: Intel® HD Graphics 530 (24 EUs, 1150MHz)

 

GT3e: Intel® Iris™ Graphics 540 (48 EUs, 1050MHz, 64 MB eDRAM)

GT3e: Intel® Iris™ Graphics 550 (48 EUs, 1100MHz, 64 MB eDRAM)

 

GT4e: Intel® Iris™ Pro Graphics 580 (72 EUs, 1050 MHz, 128 MB eDRAM)

GT4e: Intel® Iris™ Pro Graphics p580 (72 EUs, 1100 MHz, 128 MB eDRAM)

For more details please check

 

 

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Kronos.

 

Intel® XDK FAQs - App Designer

$
0
0

Which App Designer framework should I use? Which Intel XDK layout framework is best?

There is no "best" UI framework for your application. Each UI framework has pros and cons. You should choose that UI framework which serves your application needs the best. Using App Designer to create your UI is not a requirement to building a mobile app with the Intel XDK. You can create your layout by hand or using any UI framework (by hand) that is compatible with the Cordova CLI (aka PhoneGap) webview environment.

At this time there is only one "non-deprecated" UI framework supported for the creation of new App Designer projects. Existing applications that were created using a deprecated UI framework can continue to be modified with the App Designer UI editor; however, they are no longer supported or maintained. In a future release of the Intel XDK, the App Designer UI editor will no longer recognize those existing projects, even for editing.

  • Twitter Bootstrap 3 -- a very clean UI framework that relies primarily on CSS with very little JavaScript trickery. There is a thriving third-party ecosystem with many plugins and add-ons, including themes. This framework is the best place to start, especially for UI beginners. Some advanced mobile UI mechanisms (like swipe delete) are not part of this framework.

  • Framework7 -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Framework7 please visit the Framework7 project page and the Framework7 GitHub repo for documentation and help.

  • Ionic -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Ionic please visit the Ionic project page and the Ionic GitHub repo for documentation and help.

  • App Framework 3 -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using App Framework please visit the App Framework project page and the App Framework GitHub repo for documentation and help.

  • Topcoat -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Topcoat please visit the Topcoat project page and the Topcoat GitHub repo for documentation and help.

  • Ratchet -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using Ratchet please visit the Ratchet project page and the Ratchet GitHub repo for documentation and help.

  • jQuery Mobile -- This UI framework has been deprecated and will be retired from App Designer in a future release of the Intel XDK. You can always use this (or any mobile) framework with the Intel XDK, but you will have to do so manually, without the help of the Intel XDK App Designer UI layout tool. If you wish to continue using jQuery Mobile please visit the jQuery Mobile API page and jQuery Mobile GitHub page for documentation and help.

What does the Google* Map widget’s "center type" attribute and its values "Auto calculate,""Address" and "Lat/Long" mean?

The "center type" parameter defines how the map view is centered in your div. It is used to initialize the map as follows:

  • Lat/Long: center the map on a specific latitude and longitude (that you provide on the properties page)
  • Address: center the map on a specific address (that you provide on the properties page)
  • Auto Calculate: center the map on a collection of markers

This is just for initialization of the map widget. Beyond that you must use the standard Google maps APIs to move and/or modify the map. See the "google_maps.js" code for initialization of the widget and some calls to the Google maps APIs. There is also a pointer to the Google maps API at the beginning of the JS file.

To get the current position, you have to use the Geo API, and then push that into the Maps API to display it. The Google Maps API will not give you any device data, it will only display information for you. Please refer to the Intel XDK "Hello, Cordova" sample app for some help with the Geo API. There are a lot of useful comments and console.log messages.

How do I size UI elements in my project?

Trying to implement "pixel perfect" user interfaces with HTML5 apps is not recommended as there is a wide array of device resolutions and aspect ratios and it is impossible to insure you are sized properly for every device. Instead, you should use "responsive web design" techniques to build your UI so that it adapts to different sizes automatically. You can also use the CSS media query directive to build CSS rules that are specific to different screen dimensions.

Note:The viewport is sized in CSS pixels (aka virtual pixels or device independent pixels) and so the physical pixel dimensions are not what you will normally be designing for.

How do I create lists, buttons and other UI elements with the Intel XDK?

The Intel XDK provides you with a way to build HTML5 apps that are run in a webview on the target device. This is analogous to running in an embedded browser (refer to this blog for details). Thus, the programming techniques are the same as those you would use inside a browser, when writing a single-page client-side HTML5 app. You can use the Intel XDK App Designer tool to drag and drop UI elements.

Why is the user interface for Chrome on Android* unresponsive?

It could be that you are using an outdated version of the App Framework* files. You can find the recent versions here. You can safely replace any App Framework files that App Designer installed in your project with more recent copies as App Designer will not overwrite the new files.

How do I work with more recent versions of App Framework* since the latest Intel XDK release?

You can replace the App Framework* files that the Intel XDK automatically inserted with more recent versions that can be found here. App designer will not overwrite your replacement.

Is there a replacement to XPATH in App Framework* for selecting nodes from an XML document?

This FAQ applies only to App Framework 2. App Framework 3 no longer includes a replacement for the jQuery selector library, it expects that you are using standard jQuery.

App Framework is a UI library that implements a subset of the jQuery* selector library. If you wish to use jQuery for XPath manipulation, it is recommend that you use jQuery as your selector library and not App Framework. However, it is also possible to use jQuery with the UI components of App Framework. Please refer to this entry in the App Framework docs.

It would look similar to this:

<script src="lib/jq/jquery.js"></script><script src="lib/af/jq.appframework.js"></script><script src="lib/af/appframework.ui.js"></script>

Why does my App Framework* app that was previously working suddenly start having issues with Android* 4.4?

Ensure you have upgraded to the latest version of App Framework. If your app was built with the now retired Intel XDK "legacy" build system be sure to set the "Targeted Android Version" to 19 in the Android-Crosswalk build settings. The legacy build targeted Android 4.2.

How do I manually set a theme?

If you want to, for example, change the theme only on Android*, you can add the following lines of code:

  1. $.ui.autoLaunch = false; //Stop the App Framework* auto launch right after you load App Framework*
  2. Detect the underlying platform using either navigator.userAgent or intel.xdk.device.platform or window.device.platform. If the platform detected is Android*, set $.ui.useOSThemes=false todisable custom themes and set <div id=”afui” class=”android light”>
  3. Otherwise, set $.ui.useOSThemes=true;
  4. When device ready and document ready have been detected, add $.ui.launch();

How does page background color work in App Framework?

In App Framework the BODY is in the background and the page is in the foreground. If you set the background color on the body, you will see the page's background color. If you set the theme to default App Framework uses a native-like theme based on the device at runtime. Otherwise, it uses the App Framework Theme. This is normally done using the following:

<script>
$(document).ready(function(){ $.ui.useOSThemes = false; });</script>

Please see Customizing App Framework UI Skin for additional details.

What kind of templates can I use to create App Designer projects?

Currently, you can only create App Designer projects by selecting the blank 'HTML5+Cordova' template with app designer (select the app designer check box at the bottom of the template box) and the blank 'Standard HTML5' template with app designer. 

There were app designer versions of the layout and user interface templates that were removed in the Intel XDK 3088 version. 

My AJAX calls do not work on Android; I'm getting valid JSON data with an invalid return code.

The jQuery 1 library appears to be incompatible with the latest versions of the cordova-android framework. To fix this issue you can either upgrade your jQuery library to jQuery 2 or use a technique similar to that shown in the following test code fragment to check your AJAX return codes. See this forum thread for more details. 

The jQuery site only tests jQuery 2 against Cordova/PhoneGap apps (the Intel XDK builds Cordova apps). See the How to Use It section of this jQuery project blog > https://blog.jquery.com/2013/04/18/jquery-2-0-released/ for more information.

If you built your app using App Designer, it may still be using jQuery 1.x rather than jQuery 2.x, in which case you need to replace the version of jQuery in your project. Simply download and replace the existing copy of jQuery 1.x in your project with the equivalent copy of jQuery 2.x.

Note, in particular, the switch case that checks for zero and 200. This test fragment does not cover all possible AJAX return codes, but should help you if you wish to continue to use a jQuery 1 library as part of your Cordova application.

function jqueryAjaxTest() {

     /* button  #botRunAjax */
     $(document).on("click", "#botRunAjax", function (evt) {
         console.log("function started");
         var wpost = "e=132&c=abcdef&s=demoBASICA";
         $.ajax({
             type: "POST",
             crossDomain: true, //;paf; see http://stackoverflow.com/a/25109061/2914328
             url: "http://your.server.url/address",
             data: wpost,
             dataType: 'json',
             timeout: 10000
         })
         .always(function (retorno, textStatus, jqXHR) { //;paf; see http://stackoverflow.com/a/19498463/2914328
             console.log("jQuery version: " + $.fn.jquery) ;
             console.log("arg1:", retorno) ;
             console.log("arg2:", textStatus) ;
             console.log("arg3:", jqXHR) ;
             if( parseInt($.fn.jquery) === 1 ) {
                 switch (retorno.status) {
                    case 0:
                    case 200:
                        console.log("exit OK");
                        console.log(JSON.stringify(retorno.responseJSON));
                        break;
                    case 404:
                        console.log("exit by FAIL");
                        console.log(JSON.stringify(retorno.responseJSON));
                        break;
                    default:
                        console.log("default switch happened") ;
                        console.log(JSON.stringify(retorno.responseJSON));
                        break ;
                 }
             }
             if( (parseInt($.fn.jquery) === 2) && (textStatus === "success") ) {
                 switch (jqXHR.status) {
                    case 0:
                    case 200:
                        console.log("exit OK");
                        console.log(JSON.stringify(jqXHR.responseJSON));
                        break;
                    case 404:
                        console.log("exit by FAIL");
                        console.log(JSON.stringify(jqXHR.responseJSON));
                        break;
                    default:
                        console.log("default switch happened") ;
                        console.log(JSON.stringify(jqXHR.responseJSON));
                        break ;
                 }
             }
             else {
                console.log("unknown") ;
             }
         });
     });
 }

What do the data-uib and data-ver properties do in an App Designer project?

App Designer adds the data-uib and data-ver properties to many of the UI elements it creates. These property names only appear in the index.html file on various UI elements. There are other similar data properties, like data-sm, that only are required when you are using a service method.

The data-uib and data-ver properties are used only by App Designer. They are not needed by the UI frameworks supported by App Designer; they are used by App Designer to correctly display and apply widget properties when you are operating in the "design" view within App Designer. These properties are not critical to the functioning of your app; however, removing them will cause problems with the "design" view of App Designer.

The data-sm property is inserted by App Designer, and it may be used by data_support.js, along with other support libraries. The data-sm property is relevant to the proper functioning of your app.

Unable to select App Designer UI option when I create a new App Designer project.

If you previously created an App Designer project named 'ui-test' that you then delete and then create another App Designer project using the same name (e.g., 'ui-test'), you will not be given the option to select the UI framework for the new project named 'ui-test.' This is because the Intel XDK remembers a framework name for each project name that has been used and does not delete that entry from the global-settings.xdk file when you delete a project (e.g. if you chose "Framework 7" the first time you created an App Designer project with the name 'ui-test' then deleting 'ui-test' and creating a new 'ui-test' will result in another "Framework 7" project).

Because the UI framework name is not removed from the global-settings.xdk file when you delete the project, you must either use a new unique project name or edit the global-settings.xdk file to delete that old UI framework association. This is a bug that has been reported, but has not been fixed. Following is a workaround:

"FILE-/C/Users/xxx/Downloads/pkg/ui-test/www/index.html": {"canvas_width": 320,"canvas_height": 480,"framework": "framework 7"
}
  • Remove the last line ("framework": "framework 7") from the JSON object (remember to remove the comma at the end of the preceding line or you won't have a proper JSON file and your global-settings.xdk file will be considered corrupt).
  • Save and close the global-settings.xdk file.
  • Launch the Intel XDK.
  • Create a new project with old name you are reusing.

You should now see the list of App Designer framework UI selection options when you create the new project with a previously used project name that you have deleted.

Back to FAQs Main

Product Licensing FAQ

$
0
0

Services

Q: What types of Attestation for Intel® Software Guard Extensions (Intel® SGX) exist and how do those relate to access?

A: There are two types of attestations for Intel SGX – local attestation and remote attestation. Local attestation occurs between two enclaves on the same client platform and does not require access to Intel’s provisioning or attestation services. Remote attestation involves an enclave proving its trustworthiness to a backend service. Depending on the stage of their development process (see Development / Production Services questions below), once a developer or Licensee obtains access to Intel’s services, they can use the attestation service to facilitate establishing a trust relationship between enclaves.

Q: How frequently should my application request attestation of its enclave?

A: The frequency of requests should be determined by the application requirements. There may be applications that perform attestation once or rarely – an example being a one-time attestation with an application server to download a product license key or other critical sensitive application material (sealed to the client platform). For closed environments – IT could perform a one-time provisioning / attestation of platforms when building the platforms prior to bringing the systems into the closed environment. On the other end of the spectrum, multiple attestations may occur in DRM- or transaction-based applications. Such applications will likely implement a periodic attestation challenge to client machines when they refresh encryption and licensing keys.

Q: How do “Development Services” differ from “Production Services”?

A: While Intel strives to provide a high level of uptime / availability for all Services environments, high availability of the Production Services environment is prioritized. Other differences include:

  • Requirements for access: Enrollment for Production Services requires a signed Commercial Use License Agreement and the completion of certain technical onboarding steps, including providing a certificate signed by a well-known certificate authority.
  • Utilization: Production services will only be leveraged by shipping applications, thus servicing a potentially large number of clients with a limited number of requests per client. Development services are intended for developers working on attestation service development usages and thus may have a relatively small number of clients but potentially a large number of requests per client – especially for the validation / stress testing platforms.

Q: Can I use my “Development Services access” for my production application?

A: Production applications / solutions should use the Production Services environment endpoints for making service calls.

Q: Does Intel provide a Service Level Agreement (SLA) for the provisioning and attestation services? How do I obtain one? Does Intel charge for the provisioning and attestation services for Intel SGX?

A: A basic level of service is provided for both Development (target 99.0%) and Production (target 99.9%) Services, but is not guaranteed. If your business or company requires a higher service level than the basic level of service, please contact your Intel representative for other options.

Q: I want to run my own attestation service (or infrastructure) rather than use Intel’s. Can I do that?

A: Yes. If you can securely inject a key into an enclave, you can build an attestation infrastructure atop that. Intel does not prevent this type of development. A downside is that if you need to complete a Trusted Computing Base (TCB) recovery another secure key injection may be required.

Q: How do I enroll into Production Services for Intel SGX?

A. Use the online process to submit the necessary information to Intel. After obtaining a commercial use license agreement and technical onboarding, licensees will be provided with the production service endpoints. Get started here.

Commercial Use License Agreement

Q: What does a commercial use license agreement grant me access to?

A. Having a commercial use license agreement in place and completion of technical onboarding places the licensee’s key on the whitelist, which in turn allows the licensee’s application to create and run production enclaves.

Q: Why do I have to apply for a commercial use license?

A: A commercial use license agreement is required to use the Production Services environment endpoints. Intel enters into a commercial use license agreement with companies that meet defined development and security standards. This entitles users of Licensee products utilizing Intel SGX to make certain assumptions about the software they are relying upon.

Q: Do I need to have a commercial use license agreement in place if the debug enclave meets my needs?

A: No. But note, the SGX debug instructions (EDBGRD / EDBGWR) could be used to step into the debug enclave and expose / modify enclave content and behavior.

Q: How long does it typically take Intel to review and disposition a commercial use license agreement request?

A: Intel treats requests for establishing a commercial use license agreement as a high priority and will work with you to establish an estimated timeframe for completion once we receive the details of your application. Please note that the actual time to disposition a request depends on the volume of requests being received, the accuracy of the information provided, and responsiveness to any follow-up / clarifications that may be needed.

Q: What criteria does Intel use to determine whether or not someone is granted a commercial use license agreement?

A: Intel is interested in empowering developers to better protect their code and user / application data. Criteria include a developer’s ability to follow industry secure development practices and confirmation of the type of application being developed (avoiding malware, spyware, or other nuisance software). Please refer to the Intel SGX Licensee Guide for additional items.

Q: What can I do if Intel does not approve my commercial use license agreement request?

A: If you believe your application and company have satisfied the listed standards, you can provide the appropriate data to your Intel contact and your application for a commercial use license agreement will be re-examined.

Whitelist

Q: How is the whitelist related to the commercial use license agreement?

A: The whitelist is used as a control point to ensure that an application has authorization to create an enclave (trusted execution environment). Accounts enter into a commercial use license agreement with Intel must be added to the whitelist before their application leveraging Intel SGX technology is released.

Q: How will I be notified of revocation actions?

A: If necessary, Intel will contact the authorized technical contact at the Licensee.

Intel® SGX Licensee Guide

$
0
0

Licensee Guide

1-Sept-2016
Subject to Revision

Executive Summary

As an Intel SGX licensed developer (Licensee), your use of technology for Intel SGX is expected to comply with the following guidelines. These guidelines have been developed so that users of Intel SGX enabled software can make certain assumptions about the software they are relying upon. Failure to meet these guidelines can result in your license being terminated. Intel may update these guidelines from time to time without notice.

Terms used in this document

Licensee– the developer organization which has accepted the license conditions for Intel SGX.

Platform Software (PSW)– Software for Intel SGX, which includes but is not limited to: enclaves used for EPID provisioning, attestation, launching and platform services, that are required for the SGX architecture to work.

Software Development Kit (SDK)– Software for Intel SGX, used by a Licensee to create their own Intel SGX enabled application software.

Enclave Signing Key– A public/private key pair used by the Licensee to generate a signature over their software enclave in the form of a SIGSTRUCT.

Attestation Servicefor Intel SGX– A RESTful web service provided and operated by Intel which allows a Licensee to verify attestation evidence (enclave QUOTES) submitted by Service Providers

Production Attestation Service for Intel SGX– The production version of the Attestation Service, intended for business-relevant traffic. Access requires a signed Commercial Use License Agreement.

Development Attestation Servicefor Intel SGX– The test version (intended for test traffic and functional validation) of the Attestation Service. Does not require a signed Commercial Use License Agreement, but does require providing some basic configuration information to Intel. Customer traffic may be throttled, service level and availability may be lower than the Production Attestation Service.

Trusted Computing Base (TCB)– the critical parts of the platform that maintain the security of the Intel SGX enabled platform.

Secure SW Development Practices

For the most part software inside an enclave is just like any other software. To achieve optimal security it should be developed and validated with care. Placing your software in an Intel SGX protected environment does not relieve the need for a software developer to follow good development techniques and secure programming practices. Specific enclave development issues are highlighted in the Developer Guide for Intel SGX that accompanies the Software Development Kit for Intel SGX. In addition, Licensees should:

  • Observe industry secure coding best practices for software development to avoid vulnerabilities (e.g., a secure software development framework, coding standards, data input validation, least access possible, secure logging, and so on).
  • Address and fix significant security vulnerabilities within a reasonable time after becoming aware of the vulnerability.
  • Ensure that the licensed application installer includes the Platform Software (PSW) Installer for Intel SGX.
  • Ensure that end users receive PSW updates via application update mechanism.
  • Observe best industry practices to: (i) not write malware, spyware or other nuisance software; (ii) not write poorly designed software that contains significant security vulnerabilities or that fails to deliver its security promise.
  • Write their application code to minimize the Enclave memory footprint for Intel SGX. Do not stress the enclave memory space or consume an unreasonable amount of it, such that the enclave space or Services for Intel SGX are effectively made unavailable to other Licensees.
  • Construct their Licensed Software Applications to enable complete removal on end user request, including removal of any sealed data.

Responsible Reporting to Intel

Where you, the Licensee, suspect that Intel software may have a bug or security flaw, you must privately inform Intel of the issue in a timely fashion and work with Intel to address the potential issue. Intel will assess the issue and, if necessary, devise a remedy, taking into account the severity of the issue. You must refrain from any public disclosure of the issue prior to reaching agreement with Intel on the timetable and content for such disclosure. For further details on reporting security issues to Intel see https://security-center.intel.com.

System Components for Intel SGX

Platform Software Installer for Intel SGX

The Platform Software (PSW) Installer for Intel SGX is a prerequisite for running Production Applications for Intel SGX on Intel SGX capable systems. As the PSW for Intel SGX may not be pre-installed by OEMs or the Operating System provider, all Licensed Applications are required to include it as a component within their product installer. The PSW installer for Intel SGX performs standard version checking to determine if a current or newer version has been previously installed on the client machine.

Services for Intel SGX

Intel will provide an Attestation Service for Intel SGX for parties relying on Intel SGX to verify quotes received from an Intel SGX-capable platform. Developers should leverage the Development Attestation Service endpoint when developing and validating support for Intel SGX in their application. As a developer transitions to become a Licensee, the Licensee should migrate their application servers to use the Production Attestation Service endpoint.

Intel may introduce new versions of the Attestation Service over time, and will support a specified number of older versions. The Licensee is responsible for transitioning their application / solution to a supported version prior to the “end of support” date of an older version.

Trusted Computing Base (TCB) Updates

Intel may issue periodic updates to parts of the platform that are critical to Intel SGX. Some parts of the TCB are included in the Platform SW Installer, and therefore the Licensee is responsible for the delivery of this update.

Expect communication from Intel on the definition and scope of issues related to a TCB update, including a list of affected components with a SLA for updated component delivery.

A TCB Recovery Event is a strong indicator that an exploitable vulnerability existed and has been remediated. Such an update will coincide with Intel being able to issue a new attestation key to updated platforms.

The Licensee should consider the impact of a TCB update to their relying party software which makes use of the Attestation Service for Intel SGX.

Updates and Communication

As Intel produces updated versions of the PSW for Intel SGX, developers are expected to deploy the PSW update via their application’s product update delivery mechanism. The schedule for performing the update depends on the type.

For Security Updates – Intel will issue a standard information advisory on https://security-center.intel.com should we discover a vulnerability with the distributable components of the PSW or SDK for Intel SGX.

For TCB Updates: evaluate the impact to the Application as well as the Application Server and take necessary action.

For Technology Updates – Intel will from time to time update the PSW, SDK, and Services. As Intel releases updates, we will work with Licensees to determine a transition period.

In general, Developers should subscribe to Intel SGX communications and promptly take the recommended action(s) (e.g., patch/update information, TCB recovery information, SDK / Platform Software updates, Service version updates), notify Intel of issues and follow the documented customer escalation process.

Enclave Signing Key Management

A Licensee’s enclave private signing key, if lost, could be used by malware to sign enclaves, including one that may be used to attest to the developer’s attestation server. Thus, loss of the Licensee’s enclave private signing key not only impacts the Licensee’s reputation but could also pose a significant risk of IP theft.

Licensees must inform Intel as soon as reasonably possible after becoming aware of any risk to, theft, or compromise of an enclave private signing key.

Intel recommends that developers and Licensees use a protected environment such as an HSM-managed enclave signing system for production signing (NOTE: This is only needed for production signing, not development (debug) signing. Production signing enclaves should be performed after completing code reviews, security reviews, and functional validation).

Intel also recommends that developers and Licensees use industry best practices for key management to protect the enclave signing private key from theft (by insiders or external agents), compromise (accidental release or discovery due to negligence) or other abuse.

A reference for best practices:

https://www.thawte.com/code-signing/whitepaper/best-practices-for-code-signing-certificates.pdf

Intel® XDK FAQs - Debug & Test

$
0
0

Why is the Debug tab being deprecated and removed from the Intel XDK?

The Debug tab is being retired because, as previously announced and noted in the release notes, future editions of the Intel XDK will focus on the development of IoT (Internet of Things) apps and IoT mobile companion apps. Since we introduced the Intel XDK IoT Edition in September of 2014, the need for accessible IoT app development tools has increased dramatically. At the same time, HTML5 mobile app development tools have matured significantly. Given the maturity of the free open-source HTML5 mobile app development tools, we feel you are best served by using those tools directly.

Similar reasoning applies to the hosted weinre server (on the Test tab) and the Live Development Pane on the Develop tab.

How do I do "rapid debugging" with remote CDT (or weinre or remote Web Inspector) in a built app?

Attempting to debug a built mobile app (with weinre, remote CDT or Safari Web Inspector) seems like a difficult or impossible task. There are, in fact, many things you can do with a built app that do not require rebuilding and reinstalling your app between each source code change.

You can continue to use the Simulate tab for debugging that does not depend on third-party plugins. Then switch to debugging a built app when you need to deal with third-party plugin issues that cannot be resolved using the Simulate tab. The best place to start is with a built Android app installed directly on-device, which provides full JavaScript and CSS debugging, by way of remote Chrome* DevTools*. For those who have access to a Mac, it is also possible to use remote web inspector with Safari to debug a built iOS app. Alternatively, you can use weinre to debug a built app by installing a weinre server directly onto your development system. For additional help on using weinre locally, watch Using WEINRE to Debug an Intel® XDK Cordova* App (beginning at about 14:30 in the video).

The interactive JavaScript console is your "best friend" when debugging with remote CDT, remote Web Inspector or weinre in a built app. Watch this video from ~19:30 for a technique that shows how to modify code during your debug session, without requiring a rebuild and reinstall of your app, via the JavaScript debug console. The video demonstrates this technique using weinre, but the same technique can also be used with a CDT console or a Web Inspector console.

Likewise, use the remote CDT CSS editor to try manipulate CSS rules in order to figure out how to best style your UI. Or, use the Simulate tab or the Brackets* Live Preview feature. The Brackets Live Preview feature utilizes your desktop browser to provide a feature similar to Intel XDK Live Layout Editing. If you use the Google* Chrome browser with Brackets Live Preview you can use the Chrome device emulation feature to simulate a variety of customizable device viewports.

The Intel XDK is not generating a debug module or is not starting my debug module.

There are a variety of things that can go wrong when attempting to use the Debug tab:

  • your test device cannot be seen by the Debug tab:
  • the debug module build fails 
  • the debug module builds, but fails to install onto your test device 
  • the debug module builds and installs, but fails to "auto-start" on your test device 
  • your test device has run out of memory or storage and needs to be cleared
  • there is a problem with the adb server on your development system

Other problems may also arise, but the above list represents the most common. Search this FAQ and the forum for solutions to these problems. Also, see the Debug tab documentation for some help with installing and configuring your system to use the adb debug driver with your device.

What are the requirements for Testing on Wi-Fi?

  1. Both Intel XDK and App Preview mobile app must be logged in with the same user credentials.
  2. Both devices must be on the same subnet.

Note: Your computer's Security Settings may be preventing Intel XDK from connecting with devices on your network. Double check your settings for allowing programs through your firewall. At this time, testing on Wi-Fi does not work within virtual machines.

How do I configure app preview to work over Wi-Fi?

  1. Ensure that both Intel XDK and App Preview mobile app are logged in with the same user credentials and are on the same subnet
  2. Launch App Preview on the device
  3. Log into your Intel XDK account
  4. Select "Local Apps" to see a list of all the projects in Intel XDK Projects tab
  5. Select desired app from the list to run over Wi-Fi

Note: Ensure the app source files are referenced from the right source directory. If it isn't, on the Projects Tab, change the 'source' directory so it is the same as the 'project' directory and move everything in the source directory to the project directory. Remove the source directory and try to debug over local Wi-Fi.

How do I clear app preview cache and memory?

[Android*] Simply kill the app running on your device as an Active App on Android* by swiping it away after clicking the "Recent" button in the navigation bar. Alternatively, you can clear data and cache for the app from under Settings App > Apps > ALL > App Preview.

[iOS*] By double tapping the Home button then swiping the app away.

[Windows*] You can use the Windows* Cache Cleaner app to do so.

What are the Android* devices supported by App Preview?

We officially only support and test Android* 4.x and higher, although you can use Cordova for Android* to build for Android* 2.3 and above. For older Android* devices, you can use the build system to build apps and then install and run them on the device to test. To help in your testing, you can include the weinre script tag from the Test tab in your app before you build your app. After your app starts up, you should see the Test tab console light up when it sees the weinre script tag contact the device (push the "begin debugging on device" button to see the console). Remember to remove the weinre script tag before you build for the store.

What do I do if Intel XDK stops detecting my Android* device?

When Intel XDK is not running, kill all adb processes that are running on your workstation and then restart Intel XDK as conflicts between different versions of adb frequently causes such issues. Ensure that applications such as Eclipse that run copies of adb are not running. You may scan your disk for copies of adb:

[Linux*/OS X*]:

$ sudo find / -name adb -type f 

[Windows*]:

> cd \> dir /s adb.exe

For more information on Android* USB debug, visit the Intel XDK documentation on debugging and testing.

How do I debug an app that contains third party Cordova plugins?

See the Debug and Test Overview doc page for a more complete overview of your debug options.

When using the Test tab with Intel App Preview your app will not include any third-party plugins, only the "core" Cordova plugins.

The Emulate tab will load the JavaScript layer of your third-party plugins, but does not include a simulation of the native code part of those plugins, so it will present you with a generic "return" dialog box to allow you to execute code associated with third-party plugins.

When debugging Android devices with the Debug tab, the Intel XDK creates a custom debug module that is then loaded onto your USB-connected Android device, allowing you to debug your app AND its third-party Cordova plugins. When using the Debug tab with an iOS device only the "core" Cordova plugins are available in the debug module on your USB-connected iOS device.

If the solutions above do not work for you, then your best bet for debugging an app that contains a third-party plugin is to build it and debug the built app installed and running on your device. 

[Android*]

1) For Crosswalk* or Cordova for Android* build, create an intelxdk.config.additions.xml file that contains the following lines:

<!-- Change the debuggable preference to true to build a remote CDT debuggable app for --><!-- Crosswalk* apps on Android* 4.0+ devices and Cordova apps on Android* 4.4+ devices. --><preference name="debuggable" value="true" /><!-- Change the debuggable preference to false before you build for the store. --> 

and place it in the root directory of your project (in the same location as your other intelxdk.config.*.xml files). Note that this will only work with Crosswalk* on Android* 4.0 or newer devices or, if you use the standard Cordova for Android* build, on Android* 4.4 or greater devices.

2) Build the Android* app

3) Connect your device to your development system via USB and start app

4) Start Chrome on your development system and type "chrome://inspect" in the Chrome URL bar. You should see your app in the list of apps and tabs presented by Chrome, you can then push the "inspect" link to get a full remote CDT session to your built app. Be sure to close Intel XDK before you do this, sometimes there is interference between the version of adb used by Chrome and that used by Intel XDK, which can cause a crash. You might have to kill the adb process before you start Chrome (after you exit the Intel XDK).

[iOS*]

Refer to the instructions on the updated Debug tab docs to get on-device debugging. We do not have the ability to build a development version of your iOS* app yet, so you cannot use this technique to build iOS* apps. However, you can use the weinre script from the Test tab into your iOS* app when you build it and use the Test tab to remotely access your built iOS* app. This works best if you include a lot of console.log messages.

[Windows* 8]

You can use the test tab which would give you a weinre script. You can include it in the app that you build, run it and connect to the weinre server to work with the console.

Alternatively, you can use App Center to setup and access the weinre console (go here and use the "bug" icon).

Another approach is to write console.log messages to a <textarea> screen on your app. See either of these apps for an example of how to do that:

Why does my device show as offline on Intel XDK Debug?

“Media” mode is the default USB connection mode, but due to some unidentified reason, it frequently fails to work over USB on Windows* machines. Configure the USB connection mode on your device for "Camera" instead of "Media" mode.

What do I do if my remote debugger does not launch?

You can try the following to have your app run on the device via debug tab:

  • Place the intelxdk.js library before the </body> tag
  • Place your app specific JavaScript files after it
  • Place the call to initialize your app in the device ready event function

Why do I get an "error installing App Preview Crosswalk" message when trying to debug on device?

You may be running into a RAM or storage problem on your Android device; as in, not enough RAM available to load and install the special App Preview Crosswalk app (APX) that must be installed on your device. See this site (http://www.devicespecifications.com) for information regarding your device. If your device has only 512 MB of RAM, which is a marginal amount for use with the Intel XDK Debug tab, you may have difficulties getting APX to install.

You may have to do one or all of the following:

  • remove as many apps from RAM as possible before installing APX (reboot the device is the simplest approach)
  • make sure there is sufficient storage space in your device (uninstall any unneeded apps on the device)
  • install APX by hand

The last step is the hardest, but only if you are uncomfortable with the command-line:

  1. while attempting to install APX (above) the XDK downloaded a copy of the APK that must be installed on your Android device
  2. find that APK that contains APX
  3. install that APK manually onto your Android device using adb

To find the APK, on a Mac:

$ cd ~/Library/Application\ Support/XDK
$ find . -name *apk

To find the APK, on a Windows machine:

> cd %LocalAppData%\XDK> dir /s *.apk

For each version of Crosswalk that you have attempted to use (via the Debug tab), you will find a copy of the APK file (but only if you have attempted to use the Debug tab and the XDK has successfully downloaded the corresponding version of APX). You should find something similar to:

./apx_download/12.0/AppAnalyzer.apk

following the searches, above. Notice the directory that specifies the Crosswalk version (12.0 in this example). The file named AppAnalyzer.apk is APX and is what you need to install onto your Android device.

Before you install onto your Android device, you can double-check to see if APX is already installed:

  • find "Apps" or "Applications" in your Android device's "settings" section
  • find "App Preview Crosswalk" in the list of apps on your device (there can be more than one)

If you found one or more App Preview Crosswalk apps on your device, you can see which versions they are by using adb at the command-line (this assumes, of course, that your device is connected via USB and you can communicate with it using adb):

  1. type adb devices at the command-line to confirm you can see your device
  2. type adb shell 'pm list packages -f' at the command-line
  3. search the output for the word app_analyzer

The specific version(s) of APX installed on your device end with a version ID. For example:com.intel.app_analyzer.v12 means you have APX for Crosswalk 12 installed on your device.

To install a copy of APX manually, cd to the directory containing the version of APX you want to install and then use the following adb command:

$ adb install AppAnalyzer.apk

If you need to remove the v12 copy of APX, due to crowding of available storage space, you can remove it using the following adb command:

$ adb uninstall com.intel.app_analyzer.v12

or

$ adb shell am start -a android.intent.action.DELETE -d package:com.intel.app_analyzer.v12

The second one uses the Android undelete tool to remove the app. You'll have to respond to a request to undelete on the Android device's screen. See this SO issue for details. Obviously, if you want to uninstall a different version of APX, specify the package ID corresponding to that version of APX.

Why is Chrome remote debug not working with my Android or Crosswalk app?

For a detailed discussion regarding how to use Chrome on your desktop to debug an app running on a USB-connected device, please read this doc page Remote Chrome* DevTools* (CDT).

Check to be sure the following conditions have been met:

  • The version of Chrome on your desktop is greater than or equal to the version of the Chrome webview in which you are debugging your app.

    For example, Crosswalk 12 uses the Chrome 41 webview, so you must be running Chrome 41 or greater on your desktop to successfully attach a remote Chrome debug session to an app built with Crosswalk 12. The native Chrome webview in an Android 4.4.2 device is Chrome 30, so your desktop Chrome must be greater than or equal to Chrome version 30 to debug an app that is running on that native webview.
  • Your Android device is running Android 4.4 or higher, if you are trying to remote debug an app running in the device's native webview, and it is running Android 4.0 or higher if you are trying to remote debug an app running Crosswalk.

    When debugging against the native webview, remote debug with Chrome requires that the remote webview is also Chrome; this is not guaranteed to be the case if your Android device does not include a license for Google services. Some manufacturers do not have a license agreement with Google for distribution of the Google services on their devices and, therefore, may not include Chrome as their native webview, even if they are an Android 4.4 or greater device.
  • Your app has been built to allow for remote debug.

    Within the intelxdk.config.additions.xml file you must include this line: <preference name="debuggable" value="true" /> to build your app for remote debug. Without this option your app cannot be attached to for remote debug by Chrome on your desktop.

How do I detect if my code is running in the Emulate tab?

In the obsolete intel.xdk apis there is a property you can test to detect if your app is running within the Emulate tab or on a device. That property is intel.xdk.isxdk. A simple alternative is to perform the following test:

if( window.tinyHippos )

If the test passes (the result is true) you are executing in the Emulate tab.

Never ending "Transferring your project files to the Testing Device" message from Debug tab; results in no Chrome DevTools debug console.

This is a known issue but a resolution for the problem has not yet been determined. If you find yourself facing this issue you can do the following to help resolve it.

On a Windows machine, exit the Intel XDK and open a "command prompt" window:

> cd %LocalAppData%\XDK\> rmdir cdt_depot /s/q

On a Mac or Linux machine, exit the Intel XDK and open a "terminal" window:

$ find ~ -name global-settings.xdk
$ cd <location-found-above>
$ rm -Rf cdt_depot

Restart the Intel XDK and try the Debug tab again. This procedure is deleting the cached copies of the Chrome DevTools that were retrieved from the corresponding App Preview debug module that was installed on your test device.

One observation that causes this problem is the act of removing one device from your USB and attaching a new device for debug. A workaround that helps sometimes, when switching between devices, is to:

  • switch to the Develop tab
  • close the XDK
  • detach the old device from the USB
  • attach the new device to your USB
  • restart the XDK
  • switch to the Debug tab

Can you integrate the iOS Simulator as a testing platform for Intel XDK projects?

The iOS simulator only runs on Apple Macs... We're trying to make the Intel XDK accessible to developers on the most popular platforms: Windows, Mac and Linux. Additionally, the iOS simulator requires a specially built version of your app to run, you can't just load an IPA onto it for simulation.

What is the purpose of having only a partial emulation or simulation in the Emulate tab?

There's no purpose behind it, it's simply difficult to emulate/simulate every feature and quirk of every device.

Not everyone can afford hardware for testing, especially iOS devices; what can I do?

You can buy a used iPod and that works quite well for testing iOS apps. Of course, the screen is smaller and there is no compass or phone feature, but just about everything else works like an iPhone. If you need to do a lot of iOS testing it is worth the investment. A new iPod costs $200 in the US. Used ones should cost less than that. Make sure you get one that can run iOS 8.

Is testing on Crosswalk on a virtual Android device inside VirtualBox good enough?

When you run the Android emulator you are running on a fictitious device, but it is a better emulation than what you get with the iOS simulator and the Intel XDK Emulate tab. The Crosswalk webview further abstracts the system so you get a very good simulation of a real device. However, considering how inexpensive and easy Android devices are to obtain, we highly recommend you use a real device (with the Debug tab), it will be much faster and even more accurate than using the Android emulator.

Why isn't the Intel XDK emulation as good as running on a real device?

Because the Intel XDK Emulate tab is a Chromium browser, so what you get is the behavior inside that Chromium browser along with some conveniences that make it appear to be a hybrid device. It's poorly named as an emulator, but that was the name given to it by the original Ripple Emulator project. What it is most useful for is simulating most of the core Cordova APIs and your basic application logic. After that, it's best to use real devices with the Debug tab.

Why doesn't my custom splash screen does not show in the emulator or App Preview?

Ensure the splash screen plugin is selected. Custom splash screens only get displayed on a built app. The emulator and app preview will always use Intel XDK splash screens. Please refer to the 9-Patch Splash Screen sample for a better understanding of how splash screens work.

Is there a way to detect if my program has stopped due to using uninitialized variable or an undefined method call?

This is where the remote debug features of the Debug tab are extremely valuable. Using a remote CDT (or remote Safari with a Mac and iOS device) are the only real options for finding such issues. WEINRE and the Test tab do not work well in that situation because when the script stops WEINRE stops.

Why doesn't the Intel XDK go directly to Debug assuming that I have a device connected via USB?

We are working on streamlining the debug process. There are still obstacles that need to be overcome to insure the process of connecting to a device over USB is painless.

Can a custom debug module that supports USB debug with third-party plugins be built for iOS devices, or only for Android devices?

The Debug tab, for remote debug over USB can be used with both Android and iOS devices. Android devices work best. However, at this time, debugging with the Debug tab and third-party plugins is only supported with Android devices (running in a Crosswalk webview). We are working on making the iOS option also support debug with third-party plugins, like what you currently get with Android.

Why does my Android debug session not start when I'm using the Debug tab?

Some Android devices include a feature that prevents some applications and services from auto-starting, as a means of conserving power and maximizing available RAM. On Asus devices, for example, there is an app called the "Auto-start Manager" that manages apps that include a service that needs to start when the Android device starts.

If this is the case on your test device, you need to enable the Intel App Preview application as an app that is allowed to auto-start. See the image below for an example of the Asus Auto-start Manager:

Another thing you can try is manually starting Intel App Preview on your test device before starting a debug session with the Debug tab.

How do I share my app for testing in App Preview?

The only way to retrieve a list of apps in App Preview is to login. If you do not wish to share your credentials, you can create an alternate account and push your app to the cloud using App Preview and share that account's credentials, instead.

I am trying to use Live Layout Editing but I get a message saying Chrome is not installed on my system.

The Live Layout Editing feature of the Intel XDK is built on top of the Brackets Live Preview feature. Most of the issues you may experience with Live Layout Editing can be addressed by reviewing this Live Preview Isn't Working FAQ from the Brackets Troubleshooting wiki. In particular, see the section regarding using Chrome with Live Preview.

My AJAX or XHR or Angular $http calls are returning an incorrect return code in App Preview.

Some versions of App Preview include an XHR override library that is designed to deal with issues related to loading file:// URLs outside of the local app filesystem (this is something that is unique to App Preview). Unfortunately, this override appears to cause problems with the return codes for some AJAX, XHR and Angular $http calls. This XHR special handling code can be disabled by adding a "data-noxhrfix" property to your app's <head> tag, in your app's index.html file. For example:

<!DOCTYPE html><html><head data-noxhrfix><meta charset="UTF-8">
...

This override should only apply to situations where the result status is zero and the responseURL is not empty.

Back to FAQs Main

Hybrid Parallelism: A MiniFE* Case Study

$
0
0

In my first article, Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing, I discussed the chief forms of parallelism: shared memory parallel programming and distributed memory message passing parallel programming. That article explained the basics of threading for shared memory programming and Message Passing Interface* (MPI) message passing. It included an analysis of a hybrid hierarchical version of one of the NAS Parallel Benchmarks*. In that case study the parallelism for threading was done at a lower level than the parallelism for MPI message passing.

This case study examines the situation where the problem decomposition is the same for threading as it is for MPI; that is, the threading parallelism is elevated to the same level as MPI parallelism. The reasons to elevate threading to the same parallel level as MPI message passing is to see if performance can improve because of less overhead in thread libraries than in MPI calls and to check whether the memory consumption is reduced by using threads rather voluminous numbers of MPI invocations.

This paper provides a brief overview of threading and MPI followed by a discussion of the changes in the miniFE*. Performance results are also shared. The performance gains are minimal, but threading consumed less memory, and data sets 15 percent larger could be solved using the threaded model. Indications are that the main benefit is memory usage. This article will be of most interest to those who want to optimize for memory footprint, especially those working on first generation Intel® Xeon Phi™ coprocessors (code-named Knights Corner), where available memory on the card is limited. 

Many examples of hybrid distributed/shared memory parallel programming follow a hierarchical approach MPI distributed memory programming is done at the top and shared memory parallel programming is introduced at multiple regions of the software underneath (OpenMP* is a popular choice). Many software developers design good problem decomposition for MPI programming. However, when using OpenMP, some developers fall back to simply placing pragmas around do or for loops without considering overall problem decomposition and data locality. For this reason some say that if they want good parallel threaded code, they would write the MPI code first and then port it to OpenMP, because they are confident this would force them to implement good problem decomposition with good data locality and such.

This leads to another question: if good performance can be obtained either way does it matter whether a threading model or MPI is used? There are two things to consider. One is performance of course: is MPI or threading inherently faster than the other? The second consideration is memory consumption. When an MPI job begins, an initial number of MPI processes or ranks that will cooperate to complete the overall work are specified. As ever larger problem sizes or data sets are run, the number of systems in a cluster dedicated to a particular job increases, thus the number of MPI ranks increases. As the number of MPI ranks increases, the MPI runtime libraries consume more memory in order to be ready to handle a larger number of potential messages (see Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing).

This case study compares both performance and memory consumption. The code used in this case study is miniFE. MiniFE was developed at Sandia National Labs and is now distributed as part of the Montevo project (see montevo.org).

Shared Memory Threaded Parallel Programming

In the threading model, all the resources belong to the same process. Memory belongs to the process so sharing memory between threads can be easy. Each thread must be given a pointer to the common shared memory location. This means that there must be at least one common pointer or address passed into each thread so each thread can access the shared memory regions. Each thread has its own instruction pointer and stack.

A problem with threaded software is the potential for data race conditions. A data race occurs when two or more threads access the same memory address and at least one of the threads alters the value in memory. Whether the writing thread completes its write before or after the reading thread reads the value can alter the results of the computation. Mutexes, barriers, and locks were designed to control execution flow, protect memory, and prevent data races. This creates other problems, because deadlock can happen preventing any forward progression in the code, or contention for mutexes or locks restricts execution flow becoming a bottleneck. Mutexes and locks are not a simple cure-all. If not used correctly, data races can still exist. Placing locks that protect code segments rather than memory references is the most common error.

Distributed Memory MPI Parallel Programming

Distributed memory parallel programming models offer a range of methods with MPI. The discussion in this case study uses the traditional explicit message passing interface of MPI. The most commonly used elements of MPI are message passing constructs. The discussion in this case study uses the traditional explicit message passing interface of MPI. Any data that one MPI rank has that may be needed by another MPI rank must explicitly be sent by the first MPI rank to other ranks that need that data. In addition, the receiving MPI rank must explicitly request the data be received before it can access and use the data sent. The developer must define the buffers used to send and receive data as well as pack or unpack them if necessary; if data is received into its desired location, it doesn't need to be unpacked.  

Finite Element Analysis

In finite element analysis a physical domain whose behavior is modeled by partial differential equations is divided into very small regions called elements. A set of basis functions (often polynomials) is defined for each element. The parameters of basis functions approximate the solution to the partial differential equations within each element. The solution phase typically is a step of minimizing the difference between the true value of the physical property and the value approximated by the basis functions. The operations form a linear system of equations for each finite element known as an element stiffness matrix. Each of these element stiffness matrices are added into a global stiffness matrix, which is solved to determine the values of interest. The solution values represent a physical property: displacement, stress, density, velocity, and so on. 

 

The miniFE is representative of more general finite element analysis packages. In a general finite element program the elements may be irregular and of varying size and different physical properties. Only one element type and one domain are used within miniFE. The domain selected is always a rectangular prism that is divided into an integral number of elements along the three major axes: x, y, and z. The rectangular prism is then sliced parallel to the principal plans recursively to create smaller sets of the finite elements. This is illustrated in Figure 1.

MiniFE* Case Study

Figure 1: A rectangular prism divided into finite elements and then split into four subdomains.

 

The domain is divided into several regions or subdomains containing a set of finite elements. Figure 1 illustrates the recursive splitting. The dark purple line near the center of the prism on the left shows where the domain may be divided into two subdomains. Further splitting may occur as shown by the two green lines. The figure on the right shows the global domain split into four subdomains. This example shows the splitting occurring perpendicular to the z-axis. However, the splitting may be done parallel to any plane. The splitting is done recursively to obtain the desired number of subdomains. In the original miniFE code, each of these subdomains is assigned to an MPI rank: one subdomain per each MPI rank. Each MPI rank determines a local numbering of its subdomain and maps that numbering to the numbering of the global domain. For example, an MPI rank may hold a 10×10×10 subdomain. Locally it would number this from 0–9 in the x, y, and z directions. Globally, however, this may belong to the region numbered 100–109 on the x-axis, 220–229 on the y-axis, and 710–719 along the z axis. Each MPI rank determines the MPI ranks with which it shares edges or faces, and then initializes the buffers it will use to send and receive data during the conjugate gradient solver phase used to solve the linear system of equations. There are additional MPI communication costs; when a dot product is formed, each rank calculates its local contribution to the dot product and then the value must be reduced across all MPI ranks.

The miniFE code has options using threading models to parallelize the for loops surrounding many activities. In this mode the MPI ranks do not further subdivide their subdomain recursively into multiple smaller subdomains for each thread. Instead, for loops within the calculations for each subdomain are divided into parallel tasks using OpenMP*, Intel® Cilk™ Plus, or qthreads*. Initial performance data on the original reference code showed that running calculations for a specified problem set on 10 MPI ranks without threading was much faster than running one MPI rank with 10 threads. 

So the division among threads was not as efficient as the division between MPI ranks. Software optimizations should begin with software performance analysis. I used the TAU* Performance System for software performance analysis. The data showed that the waxpy (vector _axpy operations) operations consumed much more time in the hybrid thread/MPI version than the MPI-only version. The waxpy operation is inherently parallel. It doesn't involve any reduction like a dot product, and there is no potential data-sharing problems that would complicate threading. The only reason for the waxpy operation to consume more time is because the threading models used are all fork-join models. That is, the work for each thread is forked off at the beginning of a for loop, and then all the threads join back again at the end. The effort to initiate computation at the fork and then synchronize at the end adds considerable overhead, which was not present in an MPI-only version of the code.

The original miniFE code divided the domain into a number of subdomains that matches the number of MPI ranks (This number is called numprocs). The identifier for each subdomain was the MPI rank number, called myproc. In the new hybrid code the original domain is divided into a number of subdomains that match the total number of threads globally; this is the number of MPI ranks times the number of threads per rank (numthreads). This global count of subdomains is called idivs (idivs = numprocs * numthreads). Each thread is given a local identifier, mythread (beginning at zero of course). The identifier of each subdomain changes from myproc to mypos (mypos = myproc * numthreads + mythread). When there is only one thread per mpi rank, mypos and myproc are equal. Code changes were implemented in each file to change references to numprocs to idivs, and myproc to mypos. A new routine was written between the main program and the routine driver. The main program forks off the number of threads indicated. Each thread begins execution of this new routine, which then calls the driver, and in turn each thread calls all of the subroutines that execute the full code path below the routine driver.

The principle of dividing work into compact local regions, or subdomains, remains the same. For example when a subdomain needs to share data with an adjacent subdomain it loops through all its neighbors and shares the necessary data. The code snippets below show the loops for sharing data from one subdomain to another in the original code and the new code. In these code snippets each subdomain is sending data to its adjacent neighbors with which it shares faces or edges. In the original code each subdomain maps to an MPI rank. These code snippets come from the file exchange_externals.hpp.  The original code is shown below in the first text box. Comments are added to increase clarity. 

Original code showing sends for data exchanges:

// prepare data to send to neighbors by copying data into send buffers
for(size_t i=0; i<total_to_be_sent; ++i) {
  send_buffer[i] = x.coefs[elements_to_send[i]];
}
//loop over all adjacent subdomain neighbors – Send data to each neighbor
Scalar* s_buffer = &send_buffer[0];
for(int i=0; i<num_neighbors; ++i) {
  int n_send = send_length[i];
  MPI_Send(s_buffer, n_send, mpi_dtype, neighbors[i],
MPI_MY_TAG, MPI_COMM_WORLD);
  s_buffer += n_send;
}


New code showing sends for data exchanges:

//loop over all adjacent subdomain neighbors – communicate data to each neighbor
for(int i=0; i<num_neighbors; ++i) {
int n_send = send_length[i];
if (neighbors[i]/numthreads != myproc)
{// neighbor is in different MPI rank pack and send data
for (int ij = ibuf ; ij < ibuf + n_send ; ++ij)
send_buffer[ij] = x.coefs[elements_to_send[ij]] ;
MPI_Send(s_buffer, n_send, mpi_dtype,
neighbors[i]/numthreads,
MPI_MY_TAG+(neighbors[i]*numthreads)+mythread, MPI_COMM_WORLD);
} else
{//neighbor is another thread in this mpi rank wait until recipient flags it is safe to write then write
while (sg_sends[neighbors[i]%numthreads][mythread]);
stmp = (Scalar *) (sg_recvs[neighbors[i]%numthreads][mythread]);
for (int ij = ibuf ; ij < ibuf + n_send ; ++ij)
stmp[ij-ibuf] = x.coefs[elements_to_send[ij]] ;
// set flag that write completed
sg_sends[neighbors[i]%numthreads][mythread] = 2 ;
}
s_buffer += n_send;
ibuf += n_send ;
}


In the new code each subdomain maps to a thread. So each thread now communicates with threads responsible for neighboring subdomains. These other threads may or may not be in the same MPI rank. The setup of communicating data remains nearly the same. When communication mapping is set up, a vector of pointers is shared within each MPI rank. When communication is between threads in the same MPI rank (process), a buffer is allocated and both threads have access to the pointer to that buffer. When it is time to exchange data, a thread loops through all its neighbors. If the recipient is in another MPI rank, the thread makes a regular MPI send call. If the recipient is in the same process as the sender, the sending thread writes the data to the shared buffer and marks a flag that it completed the write.

Additional changes were also required. By default, MPI assumes only one thread in a process or the MPI rank sends and receives messages. In this new miniFE thread layout each thread may send or receive data from another MPI rank. This required changing MPI_Init to MPI_Init_thread with the setting MPI_THREAD_MULTIPLE. This sets up the MPI runtime library to behave in a thread-safe manner. It is important to remember that MPI message passing is between processes (MPI ranks) not threads, so by default when a thread sends a message to a remote system there is no distinction made between threads on the remote system. One method to handle this would be to create multiple MPI communicators. If there were a separate communicator for each thread in an MPI rank, a developer could control which thread received the message in the other MPI rank by its selection of the communicator. Another method would be to use different tags for each thread so that the tags identify which thread should receive a particular message. The latter was used in this implementation; MPI message tags were used to control which thread received messages. The changes in MPI message tags can be seen in the code snippets as well. In miniFE the sender fills the send buffer in the order the receiver prefers. Thus the receiver does not need to unpack data on receipt and can use the data directly from the receive buffer destination.  

Dot products are more complicated, because they are handled in a hierarchical fashion. First, a local sum is made by all threads in the MPI rank. One thread makes the MPI Allreduce call, while the other threads stop at a thread barrier waiting for the MPI Allreduce to be completed and the data recorded in an appropriate location for all threads to get a copy.

In this initial port all of the data collected here used Intel® Threading Building Blocks (Intel® TBB) thread. This closely match the C++ thread specifications, so it will be trivial to test using standard C++ threads.

Optimizations

The initial threading port achieved the goal matching vector axpy operation execution time. Even though this metric improved, prior to some tuning, the threading model was initially slower than the MPI version. Three principle optimization steps were applied to improve the threaded code performance.

The first step was to improve parallel operations like dot products. The initial port used a simple method of each thread accumulating results such as dot products using simple locks. The first attempt replaced Posix* mutexes with Intel TBB locks and then atomic operations as flags. These steps made no appreciable improvement. Although the simple lock method worked for reductions or gathers for quick development using four threads, it did not scale well when there were a couple of hundred threads. A simple tree was created to add some thread parallelism to reductions such as the dot products. Implementing a simple tree for a parallel reduction offered a significant performance improvement; further improvements may offer small incremental improvements.

The second optimization was to make copies of some global arrays for each thread (this came from an MPI_Allgather). Because none of the threads alters the array values there is no opportunity for race conditions or cache invalidation. From that point the array was used in a read-only mode. So the initial port shared one copy of the array among all threads. For performance purposes, this proved to be wrong; creating a private copy of the array for each thread improved performance. Even after these optimizations, performance of the code with lots of threads still lagged behind the case with only one thread per MPI rank.
This leads to the third and last optimization step. The slow region was in problem setup and initialization. I then realized that the bottleneck was in dynamic memory allocation and a better memory allocator would resolve the bottleneck. The default memory allocation libraries on Linux* do not scale for numerous threads. Several third-party scalable memory allocation libraries are available to resolve this problem. All of them work better than the Linux default memory allocation runtime libraries. I used Intel TBB memory allocator because I am familiar with it and it can be adopted without any code modification by simply using LD_PRELOAD. So at runtime LD_PRELOAD was defined to use the Intel TBB memory allocator designed for parallel software. This change of the memory allocation runtime libraries closed the performance gap. This step substituted the Intel TBB memory allocator for all of the object and dynamic memory creation. This single step provided the biggest performance improvement.   

Performance

This new hybrid miniFE code ran on both the Intel® Xeon Phi™ coprocessor and the Intel® Xeon Phi™ processor. The data collected varied the number of MPI ranks and threads using a problem size that nearly consumed all of the system memory for the two platforms. For the first-generation Intel Xeon Phi coprocessor the MPI rank/thread ratio varied from 1:244 and 244:1. A problem size of 256×256×512 was used for the tests. The results are shown in Figure 2.

MiniFE* Case Study
Figure 2. MiniFE* performance on an Intel® Xeon Phi™ coprocessor.

 

The results show variations in performance based on the different ratios of the MPI-to-thread ratio. Each ratio of MPI to thread ran at least twice, and the fastest time was selected for reporting. More runs were collected for the ratios with slower execution time. The differences in time proved repeatable. Figure 3 shows the same tests on the Intel Xeon Phi Processor using a larger problem size.

MiniFE* Case Study
Figure 3: MiniFE* performance on Intel® Xeon Phi™ Processor

 


The performance on the Intel Xeon Phi Processor showed less performance variation than performance of miniFE on the Intel Xeon Phi coprocessor. No explanation is offered for the differences between the runtime and number of MPI ranks to the number of threads. It may be possible to close those differences by explicitly pinning threads and MPI ranks to specific cores. These tests left process and thread assignment to the OS.

There is much less variation for miniFE performance than was reported for the NAS SP-MZ* benchmark hybrid code as discussed in Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing. The NAS benchmark code though did not create subdomains for each thread as was done in this investigation of miniFE. The NAS SP-MZ code did not scale as well with threads as it did with MPI. This case study shows that following the same decomposition, threads do as well as MPI ranks. On the Intel® Xeon Phi™ Product family, miniFE performance was slightly better for using the maximum number of threads and only one MPI rank rather than using the maximum number of MPI ranks with only one thread each. Best performance was achieved with a mixture of MPI ranks and threads.

Memory consumption proves to be the most interesting aspect. The Intel Xeon Phi coprocessor is frequently not set up with a disk to swap pages to virtual memory, which provides an ideal platform to evaluate the size of a problem that can be run with the associated runtime libraries. When running the miniFE hybrid code on the Intel Xeon Phi coprocessor, the largest problem size that ran successfully with one MPI rank for each core was 256×256×512. This is a problem size of 33,554,432 elements. The associated global stiffness matrix contained 908,921,857 nonzero entries. When running with only 1 MPI rank and creating a number of threads that match the number of cores, the same number of subdomains are created and a larger problem size—256×296×512—runs to completion. This larger problem contained 38,797,312 elements, and the corresponding global stiffness matrix had 1,050,756,217 nonzero elements. Based on the number of finite elements, the threading model allows developers to run a model 15 percent larger. Based on nonzero elements in the global stiffness matrix, the model solved a matrix that is 15.6 percent larger. The ability to run a larger problem size is a significant advantage that may appeal to some project teams.

There are further opportunities for optimization of the threaded software (for example, pinning threads and MPI ranks to specific cores and improving parallel reductions). It is felt that the principal tuning has been done and further tuning would probably have minimal changes in performance. The principal motivation to follow the same problem decomposition for threading as for MPI is for the improvement in memory consumption.

Summary

The effort to write code for both threads and MPI is time consuming. Projects such as the Multi-Processor Computer (MPC) framework (see mpc.hpcframework.paratools.com) may make writing code in MPI and running via threads just as efficient in the future. The one-sided communication features of MPI-3 may allow developers to write code more like the threaded version of miniFE, where one thread writes the necessary data to the other threads' desired locations, minimizing the need for MPI runtime libraries to hold so much memory in reserve. When add threading to MPI code, remember the best practices such as watching for scalable runtime libraries and system calls that may not be thread-friendly by default, such as memory allocation or rand(). 

Performance of threaded software performs comparably with MPI when they both follow the same parallel layout: subdomain per MPI rank and subdomain per thread. In cases like miniFE, threading consumes less memory than MPI runtime libraries and allows larger problem sizes to be solved on the same system. For this implementation of miniFE, problem sizes 15 percent larger could be run on the same platform. Those seeking to optimize for memory consumption should consider the same parallel layout for both threading and MPI and will likely benefit from the transition.

Notes

Data collected using the Intel® C++ Compiler 16.0 and Intel® MPI Library 5.1.

How Disc Jam* Reached 60 fps on Intel® Processor Graphics using Unreal Engine* 4

$
0
0

Hi everyone! I’m Jay from High Horse Entertainment, a two-man team based out of Los Angeles. We started High Horse to build arcade games with modern graphics, control schemes, and an emphasis on robust online competitive play. Our first project, Disc Jam*, is an arcade action sport in which timing and reflexes are critical to success. If you want to check it out, you can grab a free Steam* key for our pre-alpha release at www.discjamgame.com!

Performance is a key focus for this project because maintaining 60 frames per second is an integral part of Disc Jam’s responsive and fluid gameplay style. As a result, we’ve learned a lot of lessons targeting this framerate using Unreal Engine* 4. Below, I discuss our experience working with integrated graphics processing units from Intel and how we ultimately achieved our performance target without raising our minimum system requirements.

Why Target Integrated GPUs?

One thing that makes PC development tricky when compared with consoles is the lack of hardware standardization. A lot of people play games on their PCs and some of them have dedicated graphics chips, but a significant portion of PC owners rely on Integrated GPUs for gaming. It’s difficult to know exactly how large that market is, but the current hardware statistics made available by Unity Technologies show that around 40 percent of machines are playing using GPUs from Intel, which is higher than any other vendor. While many PC games can get away with high minimum system requirements, it’s critical that Disc Jam scale as low as possible for two major reasons:

Concurrency

Multiplayer games like Disc Jam live and die by their concurrency. If no one is playing, no one can find a match, and our player base will shrink until it disappears entirely. For this reason, it’s important that we support as many hardware configurations as possible.

Performance

Disc Jam is designed to be played at 60 frames per second (fps). If someone is playing on a system that’s unable to sustain that framerate, they’re not experiencing the game as we’ve intended. This may also impact their teammate’s and opponent’s experiences, because Disc Jam is an online game first and foremost.

Unreal Engine 4 Scalability and Performance

When deciding on an approach, we first looked at Unreal Engine 4’s performance “out of the box” on our target hardware. For this test, we used the binary release of Unreal Engine 4.12.5 and the Shooter Game* example. All tests were run on a laptop equipped with an Intel® Core™ i7-4720 HQ processor and Intel® HD Graphics 4600 GPU. Running Shooter Game in the Sanctuary map at 720p yields the following results:


Figure 1. Epic Quality Settings - ~20fps


Figure 2. Low Quality Settings: ~40fps

If we were making a game targeting 30 fps this would be great news. Unfortunately, we really want to hit that 60 fps mark, even on our minimum spec. Since it’s difficult to develop scenes that are more optimized than the Shooter Game example, we felt that the Unreal Engine 4 (UE4) desktop renderer had too high of a base performance cost for our hardware target. Luckily, if you’re willing to be creative and get your hands a little dirty, UE4 provides an alternative.

Unreal Engine 4’s Mobile Preview Renderer

Unreal Engine doesn’t just produce high-end desktop and console games. It produces high-end mobile games too! To do so, it features a few different rendering paths in order to support the myriad of mobile devices out there in the market. The highest-end path, the one designed for OpenGL* for Embedded Systems (ES) 3.1 + Android* Extension Pack* (AEP), is the one we’re most interested in. From our tests, this rendering path boasts the best performance-to-quality ratio on integrated GPUs from Intel.

The key here is that UE4 has a feature called Mobile Preview. This feature is designed to reduce iteration time by previewing what a game will look like on mobile devices without having to deploy it. It effectively allows you to render the game on the desktop using the mobile rendering path instead of the full-blown deferred renderer that Unreal typically uses. Using this feature, we see the following results:


Figure 3. OpenGL* ES 3.1 + AEP* Mobile Preview: ~100fps

Running in Mobile Preview results in a ~2.5x speed up over the desktop renderer on its lowest settings. We are now well in range of hitting our target of 720p @ 60 fps on integrated GPUs! You’ll notice that there are some visual differences between the screenshots taken with the desktop renderer versus the mobile renderer. This is because the mobile renderer has some limitations especially in regards to lighting and shadowing. See the Epic* documentation on Lighting for Mobile Platforms and Performance Guidelines for Mobile Devices for more information.

Multiple Lighting Rigs

In order to solve the problem above and light the court consistently in Disc Jam, we’ve opted to use multiple lighting rigs. We use one rig when rendering with the traditional renderer and another when rendering in Mobile Preview. Every game’s lighting needs will be different, but Disc Jam actually uses the same set of lights for both, with the only difference being the mobility of the primary shadow-casting light. In the high-end version, our primary light is a stationary spotlight. In Mobile Preview, we use a static spotlight so that all of the lighting is pre-baked, which helps us squeeze out even more performance.


Figure 4. Disc Jam* High-End Renderer and Lighting


Figure 5. Disc Jam Low-End Renderer and Lighting

The first thing you’ll encounter when attempting to use multiple light rigs in UE4 is that the baked lighting is stored along with the geometry in the map rather than with the lights. Unfortunately, this means that you will need to duplicate all of your geometry into a second map in order to bake an alternate set of lights.

For Disc Jam, we set up a persistent level in which we placed all of the actors unaffected by lighting. These actors are shared across the high-end and low-end versions of the map and include things like spawn points and collision volumes. We then have both a high-end map and a low-end map that contain the same geometry and differ only in lighting. When the level loads we stream in the correct version:


Figure 6. Disc Jam’s Persistent Level Blueprint

The “Is in Mobile Preview” node used above is a custom C++ function defined as follows:

bool UDiscJamBlueprintFunctionLibrary::IsInMobilePreview()
{
   return GMaxRHIFeatureLevel <= ERHIFeatureLevel::ES3_1;
}

Packaging and Deployment

Please note: The following sections discuss packaging and deployment on Windows*. The steps shown should be relatively similar for other operating systems.

After packaging the game and attempting to run with the command line argument “-FeatureLevelES31” it becomes immediately clear that the necessary shaders haven’t been included in the package. Under Project Settings → Platforms → Windows → Targeted RHIs, you can see that there are checkboxes for selecting which shader variants to package, but unfortunately the OpenGL ES 3.1 shaders are not among them. Adding this requires two simple code changes.

In GenericWindowsTargetPlatform.h, the GetAllPossibleShaderFormats function must be amended to include the OpenGL ES 3.1 shaders:

virtual void GetAllPossibleShaderFormats( TArray<FName>& OutFormats ) const override
{
	// no shaders needed for dedicated server target
	if (!IS_DEDICATED_SERVER)
	{
		static FName NAME_PCD3D_SM5(TEXT("PCD3D_SM5"));
		static FName NAME_PCD3D_SM4( TEXT( "PCD3D_SM4" ) );
		static FName NAME_PCD3D_ES3_1( TEXT( "PCD3D_ES31" ) );
		static FName NAME_GLSL_150(TEXT("GLSL_150"));
		static FName NAME_GLSL_430(TEXT("GLSL_430"));

		OutFormats.AddUnique(NAME_PCD3D_SM5);
		OutFormats.AddUnique(NAME_PCD3D_SM4);
		OutFormats.AddUnique(NAME_PCD3D_ES3_1);
OutFormats.AddUnique(NAME_GLSL_150);
		OutFormats.AddUnique(NAME_GLSL_430);
	}
}

Then in WindowsTargetSettingsDetails.cpp, we add a friendly name to display in the UI by amending the GetFriendlyNameFromRHIName function:

FText GetFriendlyNameFromRHIName(const FString& InRHIName)
{
	FText FriendlyRHIName = LOCTEXT("UnknownRHI", "UnknownRHI");
	if (InRHIName == TEXT("PCD3D_SM5"))
	{
		FriendlyRHIName = LOCTEXT("DirectX11", "DirectX 11 (SM5)");
	}
	else if (InRHIName == TEXT("PCD3D_SM4"))
	{
		FriendlyRHIName = LOCTEXT("DirectX10", "DirectX 10 (SM4)");
	}
	else if (InRHIName == TEXT("PCD3D_ES31"))
	{
		FriendlyRHIName = LOCTEXT("DirectXES31", "DirectX Mobile Emulation (ES3.1)");
	}
	else if (InRHIName == TEXT("GLSL_150"))
	{
		FriendlyRHIName = LOCTEXT("OpenGL3", "OpenGL 3 (SM4)");
	}
	else if (InRHIName == TEXT("GLSL_430"))
	{
		FriendlyRHIName = LOCTEXT("OpenGL4", "OpenGL 4 (SM5, Experimental)");
	}
	else if (InRHIName == TEXT("SF_VKES31"))
	{
		FriendlyRHIName = LOCTEXT("Vulkan ES31", "Vulkan Mobile (ES3.1, Experimental)");
	}
	else if (InRHIName == TEXT("SF_VULKAN_SM4"))
	{
		FriendlyRHIName = LOCTEXT("VulkanSM4", "Vulkan (SM4)");
	}
	else if (InRHIName == TEXT("SF_VULKAN_SM5"))
	{
		FriendlyRHIName = LOCTEXT("VulkanSM5", "Vulkan (SM5)");
	}

	return FriendlyRHIName;
}

After making those changes and recompiling the engine, all that’s left to do is check the box under the Windows platform settings:


Figure 7. The newly created ‘DirectX* Mobile Emulation (ES3.1)’ now appears

Bonus: Automatically Activate Mobile Preview on GPUs from Intel

Disc Jam allows players to choose which renderer to use on startup through Steam launch options. The low-end renderer choice simply launches the game with the “-FeatureLevelES31’ command-line option.


Figure 8. Disc Jam* Steam* Launch Options

However, in the case of GPUs from Intel we have the game default to the Mobile Preview renderer. This is done with another simple code change. In WindowsD3D11Device.cpp, the function FD3D11DynamicRHI::InitD3DDevice() initializes the video adapter. About 100 lines down in that function it checks whether or not you’re using a GPU from Intel in order to correctly configure the video memory. Inside that block, we can set the renderer like so:

if ( IsRHIDeviceIntel() )
{
	// It's all system memory.
	FD3D11GlobalStats::GTotalGraphicsMemory = FD3D11GlobalStats::GDedicatedVideoMemory;
	FD3D11GlobalStats::GTotalGraphicsMemory += FD3D11GlobalStats::GDedicatedSystemMemory;
	FD3D11GlobalStats::GTotalGraphicsMemory += ConsideredSharedSystemMemory;

	GMaxRHIFeatureLevel = ERHIFeatureLevel::ES3_1;
	GMaxRHIShaderPlatform = SP_PCD3D_ES3_1;
}

And that’s all there is to it!

Let us know if this was helpful to you by dropping us a line @HighHorseGames on Twitter. To keep following Disc Jam and its development you can check out our blog at http://www.discjamgame.com.


Smarter Security Camera: A POC Using the Intel® IoT Gateway

$
0
0

Intro

Internet of Things (IoT) is enabling our lives in new and interesting ways, but with that comes the challenge of analyzing and bringing meaning to the stream of continuously generated data. One IoT trend in the home is the use of multiple security cameras for monitoring purposes, resulting in large amounts of data generated from images and video. For example, one house with twelve cameras taking 180,000 images per day can easily generate 5 GB of data. These large amounts of data make manual analysis impractical. Some cameras have built-in motion sensors to only take images when change is detected, and while this helps to reduce the data, light changes and other insignificant movement will still be picked up and have to be sorted through. To monitor the home for what is wanted, OpenCV* presents a promising solution. For the purposes of this paper it is people and faces. OpenCV already has a number of pre-defined algorithms to search images for faces, people, and objects and can also be trained to recognize new ones. 

This article is a proof of concept to explore quickly prototyping an analytics solution at the edge using Intel® IoT Gateway computing power to create a Smarter Security Camera.

 

Analyzed image from webcam with OpenCV detection markers

Figure 1.    Analyzed image from webcam with OpenCV* detection markers

Set-up

It starts with a Logitech* C270 Webcam with HD 720P resolution and 2.4 GHz Intel® Core™ 2 Duo processor. The webcam plugs into the USB port of Intel® Edison development board which turns it into an IP webcam streaming video to a website. Using the webcam with the Intel® Edison development board allows for the camera “sensor” to be easily propagated to different locations around a home. The Intel® IoT Gateway captures images from the stream and uses OpenCV to analyze them. If the algorithms detect that there is a face or a person in view, it uploads the image to Twitter*. 

Intel Edison and Webcam setup

Figure 2. Intel® Edison board and Webcam setup

Intel Gateway Device

Figure 3. Intel® IoT Gateway Device

Capturing the image

The webcam must be USB video class (UVC) compliant to ensure that it is compatible with the Intel® Edison USB drivers. In this case a Logitech C270 Webcam is used. For a list of UVC compliant devices, go here: http://www.ideasonboard.org/uvc/#devices. To use the USB slot, the micro switch on Intel® Edison development board must be toggled up towards the USB slot. Note that this will disable the micro-USB below to it and disable Ethernet, power (the external power supply must be plugged in now instead of using the micro-USB slot as a power source), and Arduino* sketch uploads. Connect the Intel® Edison development board to the Gateway’s Wi-Fi* hotspot to ensure it can see the webcam.

To check that the USB webcam is working, type the following into a serial connection.

ls -l /dev/video0

A line similar to this one should appear:

crw-rw---- 1 root video 81, 0 May  6 22:36 /dev/video0

Otherwise, this line will appear indicating the camera is not found. 

ls: cannot access /dev/video0: No such file or directory

In the early stages of the project, the Intel® Edison development board was using the FFMEG library to capture an image and then send it over MQTT to the Gateway. This method has drawbacks as each image takes a few seconds to be saved which is too slow for practical application. To resolve this problem and make images ready to the Gateway on-demand, the setup switched to have the Intel® Edison development board continuously stream a feed that the Gateway could capture from at any time. This was accomplished using the mjpeg-streamer library. To install it on the Intel® Edison development board, add the following lines to base-feeds.conf with the following command:

echo "src/gz all http://repo.opkg.net/edison/repo/all
src/gz edison http://repo.opkg.net/edison/repo/edison
src/gz core2-32 http://repo.opkg.net/edison/repo/core2-32">> /etc/opkg/base-feeds.conf

Update the repository index:

opkg update

And install:

opkg install mjpg-streamer

To the start the stream:

mjpg_streamer -i "input_uvc.so -n -f 30 -r 800x600" -o "output_http.so -p 8080 -w ./www"

MJEG compressed format is used to keep the frame rate high for this project. However, YUV format is uncompressed which leaves more detail for OpenCV. Experiment with the tradeoffs to see which one fits best. 

To view the stream while on the same Wi-Fi network, visit: http://localhost:8080/?action=stream, a still image of the feed can also be viewed by going to: http://localhost:8080/?action=snapshot. Change localhost to the IP address of Intel® Edison development board that should be connected to the Gateway’s Wi-Fi. The Intel® IoT Gateway sends an http request to the snapshot address and then saves the image to disk. 

Gateway

The brains of the whole security camera is on the Gateway. OpenCV was installed into a virtual Python* environment to create a clean and segmented environment for OpenCV and not interfere with the Gateway’s Python version and packages. Basic install instructions for OpenCV linux can be found here: http://docs.opencv.org/2.4/doc/tutorials/introduction/linux_install/linux_install.html. These instructions need to be modified in order to install OpenCV and its dependencies on the Intel® Wind River* Gateway. 

GCC, Git, and python2.7-dev are already installed. 

Install CMake 2.6 or higher:

 

wget http://www.cmake.org/files/v3.2/cmake-3.2.2.tar.gz
tar xf cmake-3.2.2.tar.gz
cd cmake-3.2.2
./configure
make
make install

As the Wind River Linux* environment has no apt-get command, it can be a challenge to install the needed development packages. A workaround for this is to first install them on another 64-bit Linux* machine (running Ubuntu* in this case) and then manually copy the files to the Gateway. The full file list can be found on the Ubuntu site here: http://packages.ubuntu.com/. For example, for the libtiff4-dev package, files in /usr/include/<file> should go to the same location on the Gateway and files in /usr/lib/x86_64-linux-gnu/<file> should got into /usr/lib/<file>. The full list of files can be found here: http://packages.ubuntu.com/precise/amd64/libtiff4-dev/filelist

Install and copy the files over for packages listed below.

sudo apt-get install  libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install libjpeg8-dev libpng12-dev libtiff4-dev libjasper-dev  libv4l-dev

Install pip, this will help install a number of other dependencies.

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

Install the virutalenv, this will create a separate environment for OpenCV.

pip install virtualenv virtualenvwrapper

Once the virtualenv has been installed, create one called “cv.”

export WORKON_HOME=$HOME/.virtualenvs
mkvirtualenv cv

Note that all the following steps are done while the “cv” environment is activated. Once “cv” has been created, it will activate the environment automatically in the current session. This can be seen in the command prompt at the beginning eg. (cv) root@WR-IDP-NAME. For future sessions it can be activated with the following command:

. ~/.virtualenvs/cv/bin/activate

And similarly be deactivated using this command (do not deactivate it yet):

deactivate

Install numpy:

pip install numpy

Get the OpenCV Source Code:

cd ~
git clone https://github.com/Itseez/opencv.git
cd opencv
git checkout 3.0.0

And make it:

mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=/usr/local \
-D INSTALL_C_EXAMPLES=ON \
-D INSTALL_PYTHON_EXAMPLES=ON \
-D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
-D BUILD_EXAMPLES=ON \
-D PYTHON_INCLUDE_DIR=/usr/include/python2.7/ \
-D PYTHON_INCLUDE_DIR2=/usr/include/python2.7 \
-D PYTHON_LIBRARY=/usr/lib64/libpython2.7.so \
-D PYTHON_PACKAGES_PATH=/usr/lib64/python2.7/site-packages/ \
-D BUILD_NEW_PYTHON_SUPPORT=ON \
-D PYTHON2_LIBRARY=/usr/lib64/libpython2.7.so \
-D BUILD_opencv_python3=OFF \
-D BUILD_opencv_python2=ON ..

If the cv2.so file is not created, make OpenCV on the host Linux machine as well and copy the file over to /usr/lib64/python2.7/site-packages.

Webcam capture of people outside with OpenCV detection markers

Figure 4. Webcam capture of people outside with OpenCV detection markers

To quickly create a program and connect a large number of capabilities and services together as with this project, Node-RED* was used. Node-RED is a quick prototyping tool that allows the user to visually wire together hardware devices, APIs, and various services. It also comes pre-installed on the Gateway. Make sure to update to the latest version. 

Node-Red Flow

Figure 5. Node-RED Flow

Once a message is injected in at the “Start” node (by clicking on it), the script will loop continuously after processing the image or encountering an error. A few nodes of note for the setup are the http request, the python script, and the function message for the tweet. The “Repeat” node is to visually simplify the repeat flow into one node instead of pointing all three flows back to the beginning. 

The “http request” node sends a GET message to the IP webcam’s snapshot URL. If it is successful, the flow saves the image. Otherwise, it tweets an error message about the webcam. 

 

Node-Red http GET request node details

Figure 6: Node-RED http GET request node details

To run the python script, create an “exec” node from the advanced section with the command “/root/.virtualenvs/cv/bin/python2.7 /root/PeopleDetection.py”. This allows the script to run in the virtual python environment where OpenCV is installed. 

Node-Red exec node details

Figure 7: Node-RED exec node details

The python script itself is fairly simple. It checks the image for people using the HOG algorithm and then looks for faces using the haarcasade frontal face alt algorithm that comes installed with OpenCV. It also saves an image with boxes drawn around found people and faces. The code provided below is not optimized for our proof of concept beyond the optional scaling the image down before analyzing it and tweaking some of the algorithm inputs to suit our purposes. It takes the Gateway approximately 0.33 seconds to process an image. In comparison, the Intel® Edison module takes around 10 seconds to process the same image.  Depending on where the camera is located, and how far or close people are expected to be to it, the OpenCV algorithm parameters may need to change to better fit the situation. 

import numpy as np
import cv2
import sys
import datetime

def draw_detections(img, rects, rects2, thickness = 2):
  for x, y, w, h in rects:
    pad_w, pad_h = int(0.15*w), int(0.05*h)
    cv2.rectangle(img, (x+pad_w, y+pad_h), (x+w-pad_w, y+h-pad_h), (0, 255, 0), thickness)
    print("Person Detected")
  for (x,y,w,h) in rects2:
    cv2.rectangle(img,(x,y),(x+w,y+h),(255,0,0),thickness)
    print("Face Detected")

total = datetime.datetime.now()

img = cv2.imread('/root/incoming.jpg')
#optional resize of image to make processing faster
#img = cv2.resize(img, (0,0), fx=0.5, fy=0.5)

hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
peopleFound,a=hog.detectMultiScale(img, winStride=(8,8), padding=(16,16), scale=1.3)

faceCascade = cv2.CascadeClassifier('/root/haarcascade_frontalface_alt.xml')
facesFound = faceCascade.detectMultiScale(img,scaleFactor=1.1,minNeighbors=5,minSize=(30,30), flags = cv2.CASCADE_SCALE_IMAGE)

draw_detections(img,peopleFound,facesFound)

cv2.imwrite('/root/out_faceandpeople.jpg',img)

print("[INFO] total took: {}s".format(
 (datetime.datetime.now() - total).total_seconds()))

To send an image to Twitter, the tweet is constructed in a function node using the msg.media as the image variable and the msg.payload as the tweet string. 

Node-Red function message node details

Figure 8: Node-RED function message node details

The system can take pictures on demand as well. Node-RED monitors the same twitter feed for posts that contain “spy” or “Spy” and will post a current picture to Twitter. Posting a tweet with the word “spy” in it will trigger the Gateway to take a picture. 

Node-Red flow for taking pictures on demand

Figure 8: Node-RED flow for taking pictures on demand

Summary

This concludes the proof of concept to to create a Smarter Security Camer using Intel® IoT Gateway computing. The Wind River Linux Gateway comes with a number of tools pre-installed and ready to prototype quickly. From here the project can be further optimized, made more robust with security features, and even expanded to create smart lighting for rooms when a person is detected. 

About the author

Whitney Foster is a software engineer at Intel in the Software Solutions Group working on scale enabling projects for Internet of Things.

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, Wind River, the Intel logo, and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries. 

*Other names and brands may be claimed as the property of others

© 2016 Intel Corporation

 

Intel® 64 and IA-32 Architectures Software Developer Manuals

$
0
0

SDM

These manuals describe the architecture and programming environment of the Intel® 64 and IA-32 architectures.

Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals
Three-Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals
Nine-Volume Set of Intel® 64 and IA-32 Architectures Software Developer's Manuals
Software Optimization Reference Manual
Related Specifications, Application Notes, and White Papers

Electronic versions of these documents allow you to quickly get to the information you need and print only the pages you want. The Intel® 64 and IA-32 architectures software developer's manuals are now available for download via one combined volume, a three volume set or a nine volume set. All content is identical in each set; see details below.

At present, downloadable PDFs of all volumes are at version 060. The downloadable PDF of the Intel® 64 and IA-32 architectures optimization reference manual is at version 033. Additional related specifications, application notes, and white papers are also available for download.

Note: If you would like to be notified of updates to the Intel® 64 and IA-32 architectures software developer's manuals, you may utilize a third-party service, such as http://www.changedetection.com to be notified of changes to this page (please reference 1 below).

Note: We are no longer offering the Intel® 64 and IA-32 architectures software developer’s manuals on CD-ROM. Hard copy versions of the manual are available for purchase via a print-on-demand fulfillment model through a third-party vendor, Lulu (please reference 1 and 2 below):http://www.lulu.com/spotlight/IntelSDM.

  1. Terms of use
  2. The order price of each volume is set by the print vendor; Intel uploads the finalized master with zero royalty.

Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals

 

DocumentDescription
Intel® 64 and IA-32 architectures software developer’s manual combined volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, and 3D

This document contains the following:

Volume 1: Describes the architecture and programming environment of processors supporting IA-32 and Intel® 64 architectures.

Volume 2: Includes the full instruction set reference, A-Z, in one volume. Describes the format of the instruction and provides reference pages for instructions.

Volume 3: Includes the full system programming guide, Parts 1, 2, and 3, in one volume. Describes the operating-system support environment of Intel® 64 and IA-32 architectures, including: memory management, protection, task management, interrupt and exception handling, multi-processor support, thermal and power management features, debugging, performance monitoring, system management mode, virtual machine extensions (VMX) instructions, Intel® Virtualization Technology (Intel® VT), and Intel® Software Guard Extensions (Intel® SGX).

Intel® 64 and IA-32 architectures software developer's manual documentation changes

Describes bug fixes made to the Intel® 64 and IA-32 architectures software developer's manual between versions.

NOTE: This change document applies to all Intel® 64 and IA-32 architectures software developer’s manual sets (combined volume set, 3 volume set, and 9 volume set).


SDM

Three-Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals

 

This set consists of volume 1, volume 2 (combined 2A, 2B, 2C, and 2D), and volume 3 (combined 3A, 3B, 3C, and 3D). This set allows for easier navigation of the instruction set reference and system programming guide through functional cross-volume table of contents, references, and index.

DocumentDescription
Intel® 64 and IA-32 architectures software developer's manual volume 1: Basic architecture

Describes bug fixes made to the Intel® 64 and IA-32 architectures software developer's manual between versions.

Intel® 64 and IA-32 architectures software developer's manual combined volumes 2A, 2B, 2C, and 2D: Instruction set reference, A-Z

This document contains the full instruction set reference, A-Z, in one volume. Describes the format of the instruction and provides reference pages for instructions. This document allows for easy navigation of the instruction set reference through functional cross-volume table of contents, references, and index.

Intel® 64 and IA-32 architectures software developer's manual combined volumes 3A, 3B, 3C, and 3D: System programming guide

This document contains the full system programming guide, parts 1, 2, 3, and 4, in one volume. Describes the operating-system support environment of Intel® 64 and IA-32 architectures, including: Memory management, protection, task management, interrupt and exception handling, multi-processor support, thermal and power management features, debugging, performance monitoring, system management mode, virtual machine extensions (VMX) instructions, Intel® Virtualization Technology (Intel® VT), and Intel® Software Guard Extensions (Intel® SGX). This document allows for easy navigation of the system programming guide through functional cross-volume table of contents, references, and index.


SDM

Nine-Volume Set of Intel® 64 and IA-32 Architectures Software Developer's Manuals

 

This set contains the same information as the three-volume set, but separated into nine smaller PDFs: volume 1, volume 2A, volume 2B, volume 2C, volume 2D, volume 3A, volume 3B, volume 3C, and volume 3D. This set is better suited to those with slower connection speeds.

DocumentDescription
Intel® 64 and IA-32 architectures software developer's manual volume 1: Basic architectureDescribes the architecture and programming environment of processors supporting IA-32 and Intel® 64 architectures.
Intel® 64 and IA-32 architectures software developer's manual volume 2A: Instruction set reference, A-LDescribes the format of the instruction and provides reference pages for instructions (from A to L). This volume also contains the table of contents for volumes 2A, 2B, 2C, and 2D.
Intel® 64 and IA-32 architectures software developer's manual volume 2B: Instruction set reference, M-UProvides reference pages for instructions (from M to U).
Intel® 64 and IA-32 architectures software developer's manual volume 2C: Instruction set reference, V-ZProvides reference pages for instructions (from V to Z).
Intel® 64 and IA-32 architectures software developer's manual volume 2D: Instruction set referenceIncludes the safer mode extensions reference. This volume also contains the appendices and index support for volumes 2A, 2B, 2C, and 2D.
Intel® 64 and IA-32 architectures software developer's manual volume 3A: System programming guide, part 1Describes the operating-system support environment of an IA-32 and Intel® 64 architectures, including: memory management, protection, task management, interrupt and exception handling, and multi-processor support. This volume also contains the table of contents for volumes 3A, 3B, and 3C.
Intel® 64 and IA-32 architectures software developer's manual volume 3B: System programming guide, part 2Continues the coverage on system programming subjects begun in volume 3A. Volume 3B covers thermal and power management features, debugging, and performance monitoring.
Intel® 64 and IA-32 architectures software developer's manual volume 3C: System programming guide, part 3Continues the coverage on system programming subjects begun in volume 3A and volume 3B. Volume 3C covers system management mode, virtual machine extensions (VMX) instructions, and Intel® Virtualization Technology (Intel® VT).
Intel® 64 and IA-32 architectures software developer's manual volume 3D: System programming guide, part 4Volume 3D covers system programming with Intel® Software Guard Extensions (Intel® SGX). This volume also contains the appendices and indexing support for volumes 3A, 3B, 3C, and 3D.

SDM

Software Optimization Reference Manual

 

 

DocumentDescription
Intel® 64 and IA-32 architectures optimization reference manualIntel® 64 and IA-32 architectures optimization reference manual provides information on Intel® Core™ processors, NetBurst microarchitecture, and other recent Intel® microarchitectures. It describes code optimization techniques to enable you to tune your application for highly optimized results when run on Intel® Atom™, Intel® Core™ i7, Intel® Core™, Intel® Core™2 Duo, Intel® Core™ Duo, Intel® Xeon®, Intel® Pentium® 4, and Intel® Pentium® M processors.

SDM

Related Specifications, Application Notes, and White Papers

 

 

DocumentDescription
Intel® architecture instruction set extensions programming referenceThis document covers new instructions slated for future Intel® processors.
Timestamp-Counter Scaling for VirtualizationThis paper describes an Intel® Virtualization Technology (Intel® VT) enhancement for future Intel® processors. This feature, referred to as timestamp-counter scaling (TSC scaling), further extends the capability of virtual-machine monitor (VMM) software that employs the TSC-offsetting mechanism by allowing that software finer control over the value of the timestamp counter (TSC) read during guest virtual machine (VM) execution.
Intel® 64 architecture x2APIC specificationExtensions to the xAPIC architecture are intended primarily to increase processor addressability. The x2APIC architecture provides backward compatibility to the xAPIC architecture and forward extendability for future Intel platform innovations.
Intel® 64 and IA-32 architectures application note TLBs, paging-structure caches, and their invalidationThe information contained in this application note is now part of Intel® 64 and IA-32 architectures software developer's manual volumes 3A and 3B.
Intel® carry-less multiplication instruction and its usage for computing the GCM mode white paperThis paper provides information on the instruction, and its usage for computing the Galois Hash. It also provides code examples for the usage of PCLMULQDQ, together with the Intel® AES New Instructions (Intel® AES-NI) for efficient implementation of AES in Galois Counter Mode (AES-GCM).
Intel® 64 architecture memory ordering white paperThis document has been merged into Volume 3A of Intel® 64 and IA-32 architectures software developer’s manual.
Performance monitoring unit sharing guideThis paper provides a set of guidelines between multiple software agents sharing the PMU hardware on Intel® processors.
Intel® Virtualization Technology FlexMigration (Intel® VT FlexMigration) application noteThis application note discusses virtualization capabilities in Intel® processors that support Intel® VT FlexMigration usages.
Intel® Virtualization Technology for
Directed I/O architecture specification
This document describes the Intel® Virtualization Technology for Directed I/O.
Page Modification Logging for Virtual Machine Monitor white paperThis paper describes an Intel® Virtualization Technology (Intel® VT) enhancement for future Intel® processors.
Secure Access of Performance Monitoring Unit by User Space ProfilersThis paper proposes a software mechanism targeting performance profilers which would run at user space privilege to access performance monitoring hardware. The latter requires privileged access in kernel mode, in a secure manner without causing unintended interference to the software stack.

The New Issue of The Parallel Universe is Out: Modernize Your Code for Intel® Xeon Phi™ Processors

$
0
0

Are you ready for the future of programming?

High-performance computing is changing fast, with trends like machine learning and next-generation hardware like the Intel® Xeon Phi™ processor. To help developers maximize the possibilities, Intel® Parallel Studio XE 2017 delivers a host of new capabilities to support important trends like machine learning.

To learn all about it, don’t miss the new issue of The Parallel Universe, Intel’s quarterly magazine for developers. Articles include:

  • Modernize Your Code for Intel® Xeon Phi™ Processors: Explore new Intel® Parallel Studio XE 2017 capabilities.
  • Unleash the Power of Big Data Analytics and Machine Learning:  Solve big data era application challenges with Intel® Performance Libraries.
  • Overcome Python* Performance Barriers for Machine Learning: Accelerate and optimize Python machine learning applications.
  • Profiling Java* and Python Code using Intel® VTune™ Amplifier: Get more CPU capability for Java- and Python-based applications.
  • Lightning-Fast R* Machine Learning Algorithms: Get results with the Intel® Data Analytics Acceleration Library and the latest Intel® Xeon Phi™ processor.
  • A Performance Library for Data Analytics and Machine Learning: See how the Intel® Data Analytics Acceleration Library impacts C++ coding for handwritten digit recognition.
  • MeritData Speeds Up its Tempo* Big Data Platform Using Intel® High-Performance Libraries: Case study finds performance improvements and potential for big data algorithms and visualization.

Read it now >

Performance Analysis and Optimization for PC-Based VR Applications: From the CPU’s Perspective

$
0
0

Download Now (PDF 1.61MB)

Virtual Reality (VR) is becoming more and more popular these days as technology advancement following Moore’s Law continues to make this brand new experience technically possible. While VR brings a fantastic immersive experience to users, it also puts significantly greater computing workloads on both the CPU and GPU compared to traditional applications due to dual-screen rendering, low latency, high resolution and high frame rate requirements. As a result, performance issues are especially critical in VR applications since a non-optimized VR experience with insufficient frame rate and high latency could cause nausea for users. In this article, we’ll introduce a general methodology to profile, analyze, and tackle bottlenecks and hotspots in a PC-based VR application regardless of the underlying engine or VR runtime used. We use a PC VR game from Tencent* called Pangu* as an example to showcase the analysis flow.

The rendering pipeline in VR games and conventional games

Before digging into the details of the analysis, we want to explain why the CPU plays an important role in VR and how it affects VR performance. Figure 1 shows the rendering pipeline in conventional games where CPU and GPU are processed in parallel in order to maximize the hardware utilization. However, the scheme cannot be applied to VR since VR requires a low and stable rendering latency, the rendering pipeline in conventional games doesn’t meet this requirement.

Let’s take Figure 1 as an example, if we look at the rendering latency of Frame N+2, we find that the latency is much longer than normal because GPU has to finish the workload of Frame N+1 before starts working on the workload of Frame N+2, thus introducing a significant latency to Frame N+2. Besides, the rendering latency is varying for Frame N, Frame N+1 and Frame N+2 due to different execution circumstances, which is also unfavorable in VR since it will introduce simulation sickness to users.

Figure 1

Figure 1: The rendering pipeline in conventional games.

As a result, the rendering pipeline in VR is changed to Figure 2 in order to achieve a shortest latency for each frame. In Figure 2, the CPU/GPU parallelism is intentionally broken in order to exchange efficiency for a low and stable rendering latency for each frame. In this case, CPU could be a bottleneck in VR since GPU has to wait for the CPU to finish pre-rendering jobs (drawcall preparation, initialization of dynamic shadowing, occlusion culling, etc.), optimization on CPU can help reduce the GPU bubbles and improve the performance.

Figure 2

Figure 2: The rendering pipeline in VR games.

Background of the Pangu* VR workload

Pangu* is a PC-based VR title from Tencent*, it’s a DirectX* 11 FPS VR game developed with Unreal Engine* 4 and supports both Oculus Rift* and HTC Vive*. We worked with Tencent* to improve the performance and user experience of the game in order to achieve a best- in-class gaming experience on Intel® Core™ i7 processors. Our result shows that during the development work outlined in this article the frame rate was significantly improved from 36.4 frames per second (fps) on Oculus Rift* DK2 (1920x1080) during early testing to 71.4 fps on HTC Vive* (2160x1200) at the time of this article. Here are the engines and VR runtimes used at the start and end of the development work:

  • Initial development platform: Oculus v0.8 x64 runtime and Unreal 4.10.2
  • Final development platform: SteamVR* v1463169981 and Unreal 4.11.2

The reason why different VR runtimes were used during development is that Pangu was initially developed on Oculus Rift DK2 since both Oculus Rift CV1 and HTC Vive have not been released yet at that time. Pangu was then migrated to HTC Vive once the device had been officially released. The adoption of different VR runtimes was evaluated and didn’t make a significant difference in the performance since both Oculus and SteamVR runtimes adopted the same VR rendering pipeline as shown in Figure 2, and the rendering performance is mainly determined by the game engine in this situation. It can also be verified in Figure 5 and Figure 14 that both Oculus and SteamVR runtimes inserted GPU work(for distortion pass) after the GPU rendering of each frame, which consumed only a small proportion of time with respect to the rendering.

Here shows the screenshots of the game before and after the optimization work, note that the number of drawcalls was reduced by 5X after optimization, and the GPU execution period for each frame was also reduced from 15.1ms to 9.6ms in average in order to fit the 90fps requirement on HTC Vive*, as seen in Figure 12 and 13:

Figure 3

Figure 3: Screenshots of the game before(left) and after(right) optimization.

The specifications of the test platform:

  • Intel® Core™ i7-6820HK processor (4 cores, 8 threads) @ 2.7GHz
  • NVIDIA GeForce* GTX980 16GB GDDR5
  • Graphics Driver Version:  364.72
  • 16 GB DDR4 RAM
  • Windows* 10 RTM Build 10586.164

Spotting the performance issues

In order to better understand the potential performance issues of Pangu*, we first collected the basic performance metrics of the game, shown in Table 1. All the data in this table were collected using various tools including GPU-Z, TypePerf, and Unreal Frontend. If we compare the data to system idle, several observation can be made:

  • Relatively low GPU utilization (49.64 percent on GTX980) with respect to the low frame rate (36.4 fps). If the GPU utilization were improved, a higher frame rate could be achieved.
  • High numbers of draw calls. The rendering in DirectX 11 is single threaded and has relatively high draw call overhead in the render thread as compared to DirectX 12. Since the game was developed on DirectX 11 and VR rendering pipeline breaks the CPU/GPU concurrency in order to achieve a shorter Motion-to-Photon(MTP) latency, the performance will be significantly decreased if the game is render thread bound. Less draw calls can help relief the render thread bound in this case.
  • CPU utilization doesn’t seem to be an issue in this table since it is only 13.6 percent on average. In the following session we show that this statement is not true, that the workload is actually bounded by some CPU threads.
 System IdlePangu* on Oculus Rift* DK2 (before optimization)
GPU Core Clock (MHz)1351337.6
GPU Memory Clock (MHz)1621749.6
GPU Memory Used (MB)1841727.71
GPU Load (%)049.64
Average Frame Rate (fps)N/A36.4
Draw Calls (/frame)04437
Processor(_Total)\Processor Time (%)1.04 (5.73/0.93/0.49/0.29/ 0.7/0.37/0.24/0.2)13.58 (30.20/10.54/26.72/3.76/ 12.72/8.16/12.27/4.29)
Processor Information(_Total)\Processor Frequency (MHz)8002700

Table 1:Basic performance metrics of the game before optimization.

In the following section, we use GPUView and Windows Performance Analyzer (WPA) from the Windows Assessment Development Kit (ADK) [1] to profile and analyze the bottlenecks in the VR workload.

A deeper look into the performance issues

GPUView [2] is a tool that can be used to investigate the performance interaction between graphics applications, CPU threads, graphics driver, Windows graphics kernel, and related interactions. This tool can also show whether an application is CPU bound or GPU bound in the timeline view. On the other hand, WPA [3] is an analysis tool that creates graphs and data tables of Event Tracing for Windows (ETW) events. It has a flexible UI that can be pivoted to view call stacks, CPU hotspots, context switches, and so on. It can also be used to explore the root cause of performance issues. Both GPUView and WPA can be used to analyze the event trace log (ETL) file captured by Windows Performance Recorder (WPR), which can be run from the user interface (UI) or from the command line, and have built-in profiles that can be used to select the events to be recorded.

For a VR application, it’s better to determine whether the application is bounded by the CPU, GPU, or both. We can focus our optimization efforts on the most critical part of the performance bottlenecks, thus achieving as much performance gain as possible with minimum effort.

Figure 4 shows the timeline view of Pangu* in GPUView before optimization, where the GPU work queue, CPU context queues, and CPU threads are all shown in Figure 4. Several facts can be concluded from the chart:

  • The frame rate is about 37 fps.
  • GPU utilization is about 50 percent.
  • The user experience of this VR workload is bad since the frame rate is far less than 90 fps, which is easy to induce motion sickness and nausea to end users.
  • As seen in the GPU work queue, only two processes submitted tasks to the GPU: Oculus VR runtime and VR workload. Oculus VR runtime performed works including distortion, chroma aberration, and time warp at the last stage of frame rendering.
  • The VR workload was bounded by both the CPU and GPU:
    • For CPU bound, the GPU was idle for 50 percent of the time (GPU bubbles) and was bounded by the execution of some CPU threads (T1864, T8292, T8288, T4672, T8308), which means that GPU works could not be submitted and executed as long as the CPU tasks in these threads had not been finished. If CPU tasks were optimized, GPU utilization could be greatly improved to allow more works to be accomplished in the GPU, thus achieving a higher frame rate.
    • For GPU bound, we can see that even if we could eliminate all the GPU bubbles, the GPU execution period of a single frame was still larger than 11.1ms (about 14.7ms in this workload), which means that without further optimization on the GPU side, the VR workload is not able to run at 90 fps, which is the required frame rate for premier VR head-mounted displays (HMDs) including Oculus Rift* CV1 and HTC Vive*.

Figure 4

Figure 4: A timeline view of Pangu* in GPUView.

 Preliminary recommendations for improving the frame rate and GPU utilization:

  • Some non-urgent CPU work such as physics and AI could be deferred to let graphics rendering jobs get submitted earlier, in order to reduce GPU bubbles during CPU bottlenecks
  • Apply multithreading techniques efficiently to increase the amount of parallel execution and reduce the CPU bottleneck in the game
  • Reduce tasks that lead to CPU bottleneck such as draw calls, dynamic shadowing, cloth simulation, physics and AI navigation, etc..
  • Submit the CPU task of the next frame earlier to reduce GPU gaps. Although motion-to-photon latency might be slightly increased, performance and efficiency could be greatly improved.
  • DirectX 11 has a high drawcall and driver overheads, having too much drawcalls will lead to serious CPU bound caused by the render thread, consider migrating to DirectX 12 if possible.
  • Have to optimize GPU workloads as well(e.g. overdraw, bandwidth, texture fillrate, etc.) since GPU active period for a single frame is longer than a vsync period, leading to frames dropping.

In order to take a deeper look into the bottleneck, we can use WPA to explore the same ETL file analyzed with GPUView. WPA can also be used to identify CPU hotspots in terms of CPU utilization or context switches; readers who are interested in this topic can refer to [4] for more details. Here we introduce the main methodology for CPU bottleneck analysis and optimization.

Look at a single frame of the VR workload that has performance issues. Since the present packet is submitted to the GPU once per frame after rendering, the timing between two succeeding present packets is the period of a single frame, as shown in Figure 5 (26.78 ms, which is equivalent to 37.34 fps).

Figure 5

Figure 5: A timeline view of Pangu* in GPUView for a single frame. Note the CPU threads that lead to GPU bubble.

Note that there are GPU bubbles in the GPU work queue (for example, 7.37 ms at the beginning of a frame) which were actually caused by the CPU thread bound in the VR workload, as marked in the red rectangle. It is because CPU tasks such as draw call preparation, culling, and the like must finish before GPU commands are submitted for rendering.

If we use WPA to look at the CPU bound periods shown in GPUView, we are able to find out the key CPU hotspots that prevent the GPU from execution. Figures 6–11 show the utilization and the call stacks of CPU threads in WPA, within the same time period in GPUView.

Figure 6

Figure 6: A timeline view of Pangu* in WPA with the same period as Figure 5.

 Let’s look at the bottleneck of each CPU thread.

Figure 7

Figure 7:The call stack of the render thread T1864.

As seen in the call stack, the top three bottlenecks in the render thread are

  1. Base pass rendering for static meshes (50 percent)
  2. Initialization of dynamic shadows (17 percent)
  3. Compute view visibility (17 percent)

These bottlenecks are caused by too many draw calls, state changes, and shadow map rendering in the render thread. Some suggestions to optimize the render thread performance:

  • Apply batching in Unity* or actor merging in Unreal to reduce static mesh drawing. Combine close objects together and use Level of Details (LOD). Using fewer materials and putting separate textures into a larger texture atlas can also help.
  • Use Double Wide Rendering in Unity or Instanced Stereo Rendering in Unreal to reduce draw call submission overhead for stereo rendering.
  • Reduce or turn off real-time shadows. Objects that receive dynamic shadowing will not be batched, thus incurring a severe draw call penalty.
  • Avoid using effects that cause objects to be rendered multiple times (reflections, per-pixel lights, transparent, and multi-material objects). 

Figure 8

Figure 8: The call stack of the game thread T8292.

For the game thread, the top three bottlenecks are

  1. Set up pre-requirements for parallel processing of animation evaluation (36.4 percent)
  2. Redraw view ports (21.2 percent)
  3. Process Mouse Move Event (21.2 percent)

These bottlenecks can be optimized by reducing the number of view ports and the overhead of parallel animation evaluation at the CPU side. Use single-thread processing instead if only a few number of animation nodes are used, and examine the usage of mouse control at the CPU side.

Task threads (T8288, T4672, T8308):

Figure 9

Figure 9:The call stack of the task thread T8288.

Figure 10

Figure 10:The call stack of the task thread T4672.

Figure 11

Figure 11: The call stack of the task thread T8308.

For the task threads, bottlenecks are mostly located in physics-related simulations such as cloth simulation, animation evaluation, and particle system update.

Table 2 shows a summary of the CPU hotspots (percent of clockticks) during GPU bubble periods.

THREADFUNCTIONCLOCKTICK %
Render threadBase pass rendering for static meshes13.1%22.1%
Initialization of dynamic shadows4.5%
Compute view visibility4.5%
Game threadSet up pre-requirements for parallel processing of animation evaluation7.7%16.7%
Redraw view ports4.5%
Process Mouse Move Event4.5%
PhysicsCloth simulation13.5%22%
Animation evaluation4%
particle system
4.5%
Driver4.4%

Table 2: CPU hotspots during GPU bubble periods before optimization.

Optimization

After implementation of some of the optimization including Level of Detail (LOD), instanced stereo rendering, dynamic shadow removal, deferred CPU tasks and optimized physics, the frame rate was increased from 36.4 fps on Oculus Rift* DK2 (1920x1080) to 71.4 fps on HTC Vive* (2160x1200); the GPU utilization was also increased from 54.7 percent to 74.3 percent due to fewer CPU bottlenecks.

Figures 12 and 13 show the GPU utilization of Pangu* before and after optimization, respectively, as seen from the GPU work queue.

Figure 12

Figure 12:The GPU utilization of Pangu* before optimization.

Figure 13

Figure 13:The GPU utilization of Pangu* after optimization.

Figure 14

Figure 14:A timeline view of Pangu* in GPUView after optimization.

Figure 14 shows the Pangu* VR workload viewed from the GPUView after optimization. The CPU bottleneck period was decreased from 7.37 ms to 2.62 ms after optimization, which is achieved by the following optimizations:

  • Running start of the render thread(a method that reduces CPU bottleneck by introducing an extra MTP latency) [5]
  • Reduction on the number of draw call and overheads, including the adoption of LOD, Instanced Stereo Rendering, and the removal of dynamic shadowing
  • Works in game thread and task threads are deferred to process

Figures 15 shows the call stack of the CPU render thread in the CPU bottleneck period, as marked in the red rectangle shown in Figure 14.

Figure 15

Figure 15:The call stack of the render thread T10404.

Table 3 shows a summary of the CPU hotspots (percent of clockticks) during GPU bubble periods after optimization. Note that many of the hotspots and threads were removed from the CPU bottleneck as compared to Table 2.

THREADFUNCTIONCLOCKTICK %
Render threadBase pass rendering for static meshes44.3%52.2%
Render occlusion7.9%
Driver38.5%

Table 3:CPU hotspots during GPU bubble periods after optimization.

More optimizations, such as actor merging or using fewer materials, can be done to optimize the static mesh rendering in the render thread and further improve the frame rate. If CPU tasks were fully optimized, the processing time of a single frame could be further reduced by 2.62 ms (the period of CPU bottleneck in a single frame) to 11.38 ms, which is equivalent to 87.8 fps on average.

Table 4 shows the performance metrics before and after the optimization.

 System IdlePangu* on Oculus Rift* DK2(before optimization)Pangu* on HTC Vive*(after optimization)
GPU Core Clock (MHz)1351337.61316.8
GPU Memory Clock (MHz)1621749.61749.6
GPU Memory Used (MB)1841727.712253.03
GPU Load (%)049.6478.29
Average Frame Rate (fps)N/A36.471.4
Draw Calls (/frame)04437845
Processor(_Total)\Processor Time (%)1.04 (5.73/0.93/0.49/0.29/ 0.7/0.37/0.24/0.2)13.58 (30.20/10.54/26.72/3.76/ 12.72/8.16/12.27/4.29)31.37 (46.63/27.72/33.34/18.42/ 39.77/19.04/46.29/19.76)
Processor Information(_Total)\Processor Frequency (MHz)80027002700

Table 4: Basic performance metrics of the game before and after optimization.

Conclusion

In this article, we worked closely with Tencent* to profile and optimize the Pangu* VR workload on premier HMDs in order to achieve 90 fps on Intel® Core™ i7 processors. After implementing some of our recommendations, the frame rate was increased from 36.4 fps on Oculus Rift* DK2 (1920x1080) to 71.4 fps on HTC Vive* (2160x1200), the GPU utilization was also increased from 54.7 percent to 74.3 percent on average due to fewer CPU bottlenecks. The CPU bound period in a single frame was also reduced from 7.37 ms to 2.62 ms. Additional optimizations such as actor merging and texture atlasing could be done to further optimize the performance.

Profiling and analyzing a VR application with various tools gives insights on the behaviors and bottlenecks of the application, and it is essential to VR performance optimization since performance metrics alone might not reflect the real bottlenecks. The methodology and tools discussed in this article can be used to analyze VR applications developed with different game engines and VR runtimes, and determine whether the workload is bounded by CPU, GPU, or both. Sometimes the CPU has a larger impact to VR performance than the GPU due to drawcall preparation, physics simulation, lighting, or shadowing. After analyzing various VR workloads with performance issues, we found that many of them were CPU bounded, implying that CPU optimization can help improve the GPU utilization, performance, and the user experience of the applications.

Reference

[1] https://developer.microsoft.com/en-us/windows/hardware/windows-assessment-deployment-kit

[2] http://graphics.stanford.edu/~mdfisher/GPUView.html

[3] https://msdn.microsoft.com/en-us/library/windows/hardware/hh162981.aspx

[4] https://randomascii.wordpress.com/2015/09/24/etw-central/

[5] http://www.gdcvault.com/play/1021771/Advanced-VR

About the author

Finn Wong is a senior application engineer in the Intel Software and Solutions Group (SSG), Developer Relations Division (DRD), Advanced Graphics Enabling Team (AGE Team). He joined Intel in 2012 and has been actively enabling third-party media, graphics and perceptual computing applications for the company’s PC products since then. Before joining Intel, Finn has seven years of experience and expertise in the fields of video coding, digital image processing, computer vision, algorithms and performance optimization, with several academic papers published in the literature as well. Finn holds a bachelor's degree in electrical engineering and a master's degree in communication engineering, all from National Taiwan University.

Creating a B2B App for Enterprise: Start from Pain Points

$
0
0

In our last article, we started talking about B2B, and specifically, how to market a B2B app for small business. But what if your sights are set on bigger businesses or corporations? Before you do anything else, you'll need to understand the pain points of two sets of customers. Once you've done that, you can create a proof of concept and scale up for a successful launch. In this article, the first of a three-article series, we'll focus on how to connect with your customers and really understand their pain points. After all, what you're selling is a business solution—and to do that successfully, you'll need a solid understanding of what the business needs to solve.

Instead of a scheduling app for dentists, the example we used last time, what if you want to build an e-commerce portal for marketers, something to help them manage inventory, maintain a high-level of presentation, and increase transactions? Your target customer is no longer a singular dentist or an office manager, who you can drive across town and talk to—instead, your customer is actually multiple people within the context of a larger organization, so you not only need to understand who those people are to reach them, you need to make sure that you’re building something that directly addresses their complicated—and varied—pain points. 

Not Just One Customer, But Two

Knowing your customer is always important, and most of the standard principles apply, but this is even more critical when it comes to enterprise B2B. Far from one-off, one-click impulse purchases, these are big sales—which means you’ll spend a lot more time working with the organization at every step of the process—from research, to possible custom features, to ongoing support and maintenance.

Not only that, but your “customer” in this scenario is actually multiple people. Most B2B apps will have two main customers or customer groups you’ll need to consider. The first is the user, the person within the corporation who will actually be using the app, and the second is the check-writer, or the executive who will approve the purchase. Those people are working together, and will share bigger organizational philosophies and goals, but when they consider whether or not to purchase your app, they’ll be looking at it from different angles and will have a different take on the key benefits.

In the e-commerce example, your user might be the e-commerce analyst who will actually implement and use the portal in order to manage their online sales, while the check-writer is the executive or CMO who will approve it. At a basic level, the check-writer will be focused on ROI—how will this app pay for itself in terms of increased transactions, or reduced labor costs? The user, on the other hand, will be focused on simplicity and usability—how will this app improve their experience and allow them to do better work?

You’ll need to understand both points of view, because they’ll both need to be on board if you’re going to be successful. A high-priced app that makes the user’s life much easier, but doesn’t affect the bottom line isn’t going to be approved by the check-writer. On the other hand, if something looks good on paper, but doesn’t actually solve the user’s problems, then it won’t result in the ROI the executive is expecting.

What Are the Pain Points? Take Time to Ask--And Listen

Knowing your customer and understanding their pain points isn’t something that begins at the sales or marketing stage, of course. It’s important to research your product as early in the process as possible, even before you write a line of code, if possible. Because the price point for enterprise B2B is so much higher, your product needs to be worth paying a lot for—otherwise it’s not worth making. That means there needs to be a clear ROI, and your product needs to solve meaningful pain points for not just one, but both of your main customers.

The approach here is similar to small business B2B, but in this case you’ll need to make sure you’re contacting and connecting with both users and check-writers. For the e-commerce portal, you would want to reach out to both ecommerce analysts and executives. Talk to at least ten different people at different companies, and spend a couple of hours with each of them. Ask questions, but more than anything else: listen, listen, listen. The conversations you have with them, and the insights they’ll be able to provide will be invaluable as you continue along the product development and sales process.

Now You Have a Plan in Place

At the end of this process you should have a strong start—clarity around the pain points for both of your customer groups, as well as insight into how to communicate the ROI, how to get budget approved, and how to get buy-in from the end user who will interact with your app day to day. You’ll also be laying the groundwork for good working relationships, which will be increasingly important. Your next step will be to line up a couple of reference customers and create a proof of concept. Check back soon for the next article in this series.

Intel® IoT Gateway Developer Hub and Software Suite/Pro Software Suite Release Notes

$
0
0

This is the latest release notes for the Intel® IoT Gateway Developer Hub, Intel® IoT Gateway Software Suite, and Intel® IoT Gateway Pro Software Suite. 

Intel® IoT Gateway Developer Hub and Software Suite/Pro Software Suite Release Notes ARCHIVE

$
0
0

Use this ZIP file to access each available version of the release notes for the Intel® IoT Gateway Developer Hub, Intel® IoT Gateway Software Suite, and Intel® IoT Gateway Pro Software Suite, beginning with production version 3.1.0.17 through the currently released version. The release notes include information about the products, new and updated features, compatibility, known issues, and bug fixes.


Using LibRealSense and OpenCV to stream RGB and Depth Data

$
0
0

Table of Contents

Introduction 

In this document I will show you how you can use LibRealSense and OpenCV to stream RGB and depth data. This article assumes you have already downloaded, installed both LibRealSense, OpenCV and have them setup properly in Ubuntu. In this article I will be on Ubuntu 16.04 using the Eclipse Neon IDE though most likely earlier versions will work fine. It just happens to be what version of Eclipse I was working with when this sample was created.

In this article I make the following assumptions that the reader:

  1. Is somewhat familiar with using the Eclipse IDE. The reader should know how to open Eclipse and create a brand new empty C++ project.
  2. Is familiar with C++
  3. Knows how to get around Linux.
  4. Knows what Github is and knows how to at least download a project from a Github repository.

In the end you will have a nice starting point where you use this code base to build upon to create your own LibRealSense / OpenCV applications.

Conventions 

LRS = LibRealSense. I get tired of writing it out. It’s that simple. So, if you see LRS, you know what it means.

Software Requirements 

Supported Cameras 

  • RealSense R200

In theory all the RealSense cameras (R200, F200, SR300) should work with this code sample, however, this was only tested with the R200

Setting up the Eclipse Project 

As mentioned, I’m going to assume that the reader already is familiar with opening up Eclipse and creating a brand new empty C++ project.

What I would like to show you is the various C++ header and linker settings I used for creating my Eclipse project.

Header file includes 

The following image shows which header directories I’ve included. If you followed the steps for installing LRS, you should have your LibRealSense header files located in the proper location. The same goes for OpenCV

Header file includes

Library file includes 

This image shows you the libraries that are needed at runtime. The one LRS library and three OpenCV libraries. Again, I’m taking the assumption you have already setup LRS and OpenCV properly.

Library file includes

The main.cpp source code file contents 

Here is the source code for the example application.

/////////////////////////////////////////////////////////////////////////////

// License: Apache 2.0. See LICENSE file in root directory.

// Copyright(c) 2016 Intel Corporation. All Rights Reserved.

//

//

//

/////////////////////////////////////////////////////////////////////////////

// Authors
// * Rudy Cazabon
// * Rick Blacker
//
// Dependencies
// * LibRealSense
// * OpenCV
//
/////////////////////////////////////////////////////////////////////////////
// This code sample shows how you can use LibRealSense and OpenCV to display
// both an RGB stream as well as Depth stream into two separate OpenCV
// created windows.
//
/////////////////////////////////////////////////////////////////////////////

#include <librealsense/rs.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/highgui.hpp>

using namespace std;
using namespace rs;


// Window size and frame rate
int const INPUT_WIDTH      = 320;
int const INPUT_HEIGHT     = 240;
int const FRAMERATE        = 60;

// Named windows
char* const WINDOW_DEPTH = "Depth Image";
char* const WINDOW_RGB     = "RGB Image";


context      _rs_ctx;
device&      _rs_camera = *_rs_ctx.get_device( 0 );
intrinsics   _depth_intrin;
intrinsics  _color_intrin;
bool         _loop = true;


// Initialize the application state. Upon success will return the static app_state vars address

bool initialize_streaming( )
{
       bool success = false;
       if( _rs_ctx.get_device_count( ) > 0 )
       {
             _rs_camera.enable_stream( rs::stream::color, INPUT_WIDTH, INPUT_HEIGHT, rs::format::rgb8, FRAMERATE );
             _rs_camera.enable_stream( rs::stream::depth, INPUT_WIDTH, INPUT_HEIGHT, rs::format::z16, FRAMERATE );
             _rs_camera.start( );

             success = true;
       }
       return success;
}




/////////////////////////////////////////////////////////////////////////////
// If the left mouse button was clicked on either image, stop streaming and close windows.
/////////////////////////////////////////////////////////////////////////////
static void onMouse( int event, int x, int y, int, void* window_name )
{
       if( event == cv::EVENT_LBUTTONDOWN )
       {
             _loop = false;
       }
}


/////////////////////////////////////////////////////////////////////////////
// Create the depth and RGB windows, set their mouse callbacks.
// Required if we want to create a window and have the ability to use it in
// different functions
/////////////////////////////////////////////////////////////////////////////
void setup_windows( )
{
       cv::namedWindow( WINDOW_DEPTH, 0 );
       cv::namedWindow( WINDOW_RGB, 0 );

       cv::setMouseCallback( WINDOW_DEPTH, onMouse, WINDOW_DEPTH );
       cv::setMouseCallback( WINDOW_RGB, onMouse, WINDOW_RGB );
}


/////////////////////////////////////////////////////////////////////////////
// Called every frame gets the data from streams and displays them using OpenCV.
/////////////////////////////////////////////////////////////////////////////
bool display_next_frame( )
{

       _depth_intrin       = _rs_camera.get_stream_intrinsics( rs::stream::depth );
       _color_intrin       = _rs_camera.get_stream_intrinsics( rs::stream::color );


       // Create depth image
       cv::Mat depth16( _depth_intrin.height,
                                  _depth_intrin.width,
                                  CV_16U,
                                  (uchar *)_rs_camera.get_frame_data( rs::stream::depth ) );

       // Create color image
       cv::Mat rgb( _color_intrin.height,
                            _color_intrin.width,
                            CV_8UC3,
                            (uchar *)_rs_camera.get_frame_data( rs::stream::color ) );

       // < 800
       cv::Mat depth8u = depth16;
       depth8u.convertTo( depth8u, CV_8UC1, 255.0/1000 );

       imshow( WINDOW_DEPTH, depth8u );
       cvWaitKey( 1 );

       cv::cvtColor( rgb, rgb, cv::COLOR_BGR2RGB );
       imshow( WINDOW_RGB, rgb );
       cvWaitKey( 1 );

       return true;
}

/////////////////////////////////////////////////////////////////////////////
// Main function
/////////////////////////////////////////////////////////////////////////////
int main( ) try
{
       rs::log_to_console( rs::log_severity::warn );

       if( !initialize_streaming( ) )
       {
             std::cout << "Unable to locate a camera"<< std::endl;
             rs::log_to_console( rs::log_severity::fatal );
             return EXIT_FAILURE;
       }

       setup_windows( );

       // Loop until someone left clicks on either of the images in either window.
       while( _loop )
       {
             if( _rs_camera.is_streaming( ) )
                    _rs_camera.wait_for_frames( );

             display_next_frame( );
       }


       _rs_camera.stop( );
       cv::destroyAllWindows( );


       return EXIT_SUCCESS;

}
catch( const rs::error & e )
{
       std::cerr << "RealSense error calling "<< e.get_failed_function() << "("<< e.get_failed_args() << "):\n    "<< e.what() << std::endl;
       return EXIT_FAILURE;
}
catch( const std::exception & e )
{
       std::cerr << e.what() << std::endl;
       return EXIT_FAILURE;
}

Source code explained 

Overview 

The structure is pretty simplistic. It’s a one source code file containing everything we need for the sample. We have our header includes at the top. Because this is a sample application, we are not going to worry too much about “best practices” in defensive software engineering. Yes, we could have better error checking, however the goal here is to make this sample application as easy to read and comprehend as possible.

Constants 

Here you can see that we have various constant values for the width, height, framerate. Basic values used for dictating the size of the image we want to stream and size of the window we want to display the stream in as well as the framerate we want. After that we have two string constants. These are used for naming our OpenCV windows.

Global variables 

While I’m not a fan of global variables per-say, in a streaming app such as this I don’t mind bending the rules a little bit. And while simple streaming such as what is in this sample app may not be resource intensive, other things we could bring to the app could be. So, if we can squeeze out any performance now, it could be beneficial down the road.

  • _ctx is used to return a device (camera). Notice here that we are hard coding getting the first device. There are ways to detect all devices however that is out of scope for this article.
  • _rs_camera is the RealSense device(camera) that we are streaming from.
  • _dept_intrin this is a LRS intrinsics object that contains information about the current depth frame. In this case we are mostly interested in the size of the image.
  • _color_intrin this is a LRS intrinsics object that contains information about the current color frame. In this case we are mostly interested in the size of the image.
  • _loop is simply used to know when to stop the processing of images. Initially set to true, is set to false when a user clicks on an image in the OpenCV window.

I want point out that _dept_intrin and _color_intrin is not really necessary. They are not the product of calculations of any type. They are simply used for collecting intrinsic data in the display_next_frame( ) function, making it easier to read when creating the OpenCV Mat objects. These are global so we don’t have to create these two variables every single frame.

Functions 

main(…)

Obviously as the name implies, this is the main function. We don’t need any command line parameters so I’ve chosen to not include any parameters. The first thing that happens is showing how you can use a LRS to log to the console. Here we are asking LRS to print out any warnings to the console. Next we initialize the helper structure _app_state by calling initialize_app_state(). If there is an error, print it out and exit. After that we make a call to setup_windows(). At this point everything is setup and we can begin streaming. This is done in the while loop. While _loop is true, we will see if the camera is streaming, if so wait for the frames. We call get_next_frame to get the next frame from the camera and populate it into the global _app_state variable and then display it.

Once _loop has been set to false, we fall out of the while loop, stop the camera and tell OpenCV to close all its windows. At this point, the app will then quit.

initialize_streaming(…)

This where we initially setup the camera for streaming. We will have two streams, one depth, one color. The images will be the size specified in the constants. We also must specify the format of the stream and framerate. For future expansion, it might be better to add some kind of error checking/handling here. However to keep things simplistic, we have chosen not to do anything fancy. Assuming the happy path.

setup_windows(…)

This is a pretty easy function to understand. We tell OpenCV to create two new named windows. We are using the string constants WINDOW_DEPTH and WINDOW_RGB for the names. Once we have created them we associate a mouse call back function “onMouse”.

onMouse(…)

onMouse will be triggered anytime a user clicks on the body of the window. In specific, where the image is being displayed. We are using this function as an easy way to stop the application. All it does is check to see if the event was a left button click, if so, set the Boolean flag _loop to false. This will cause the code to exit out of the while loop in the main function.

display_next_frame(…)

This function is responsible for displaying the LRS data into OpenCV windows. We start off by getting the intrinsic data from camera. Next we create the depth and rgb OpenCV Mat objects. We specify their dimensions, their format and then assign their buffer to the cameras streams current frame. The depth Mat object gets the cameras depth data, the color Mat object gets the cameras color stream.

The next thing we do is create a new Mat object depth8u. This is to perform a scaling into a 0-255 range as required by OpenCVs imgshow() function which cannot display 16 bit depth images.

Once we have converted the depth image, we display it using the OpenCV function imgshow. We are telling it what named widow to use via the WINDOW_DEPTH constant and giving it the depth image. cvWaitKey(1) tells OpenCV to stop for a brief time to allow other processing to take place, such as key presses. After the depth window, now we move onto the color/rgb window. cvtColor will convert the Mat rgb object from OpenCVs BGR to RGB colorspace. Once that has completed, we show the image and call waitkey again.

Wrap up 

In this article, I’ve attempted to show you just how easy it is to stream data from a RealSense camera using the LibRealSense open source library and display it into a window using OpenCV. While this sample is simple, it does help form a base application from which you can create more complex applications using OpenCV.

OpenCL™ Drivers and Runtimes for Intel® Architecture

$
0
0

What to Download

By downloading a package from this page, you accept the End User License Agreement.

Installation has two parts:

  1. Intel® SDK for OpenCL™ Applications Package
  2. Driver and library(runtime) packages

The SDK includes components to develop applications.  Usually on a development machine the driver/runtime package is also installed for testing.  For deployment you can pick the package that best matches the target environment.

The illustration below shows some example install configurations. 

 

SDK Packages

Please note: A GPU/CPU driver package or CPU-only runtime package is required in addition to the SDK to execute applications

Standalone:

Suite: (also includes driver and Intel® Media SDK)

 

 

Driver/Runtime Packages Available

GPU/CPU Driver Packages

CPU-only Runtime Packages  

Deprecated 

 


Intel® SDK for OpenCL™ Applications 2016 R2 for Linux* (64 bit)

This is a standalone release for customers who do not need integration with the Intel® Media Server Studio (MSS).  It provides  components to develop OpenCL applications for Intel processors. 

Visit https://software.intel.com/en-us/intel-opencl to download the version for your platform. For details check out the Release Notes.

Intel® SDK for OpenCL™ Applications 2016 R2 for Windows* (64 bit)

This is a standalone release for customers who do not need integration with the Intel® Media Server Studio (MSS).  The Windows* graphics driver contains the driver and runtime library components necessary to run OpenCL applications. This package provides components for OpenCL development. 

Visit https://software.intel.com/en-us/intel-opencl to download the version for your platform. For details check out Release Notes.


OpenCL™ 2.0 GPU/CPU driver package for Linux* (64-bit)

The Intel intel-opencl-r3.0 (SRB3) Linux driver package  provides access to the GPU and CPU components of these processors:

  • Intel® 5th, 6th or 7th Generation Core™
  • Intel Pentium J4000 and Intel Celeron J3000
  • Intel® Xeon® v4, or Intel® Xeon® v5 Processors with Intel® Graphics Technology (if enabled by OEM in BIOS and motherboard)

Installation instructions.

Intel has validated this package on CentOS 7.2 for the following 64-bit kernels.

  • Linux 4.7 kernel patched for OpenCL 2.0

Supported OpenCL devices:

  • Intel Graphics (GPU)
  • CPU

For detailed information please see the driver package Release Notes.

For Linux drivers covering earlier platforms such as 4th Generation Core please see the versions of Media Server Studio in the Driver Support Matrix.


OpenCL™ Driver for Intel® Iris™ and Intel® HD Graphics for Windows* OS (64-bit and 32-bit)

The Intel® Graphics driver includes components needed to run OpenCL* and Intel® Media SDK applications on processors with Intel® Iris™ Graphics or Intel® HD Graphics on Windows* OS.

You can use the Intel Driver Update Utility to automatically detect and update your drivers and software.  Using the latest available graphics driver for your processor is usually recommended.


See also Identifying your Intel® Graphics Controller.

Supported OpenCL devices:

  • Intel Graphics (GPU)
  • CPU

For the full list of Intel® Architecture processors with OpenCL support on Intel Graphics under Windows*, refer to the Release Notes.

 


OpenCL™ Runtime for Intel® Core™ and Intel® Xeon® Processors

This runtime software package adds OpenCL CPU device support on systems with Intel Core and Intel Xeon processors.

Supported OpenCL devices:

  • CPU

Latest release (16.1.1)

Previous Runtimes (16.1)

Previous Runtimes (15.1):

For the full list of supported Intel Architecture processors, refer to the OpenCL™ Runtime Release Notes.

 


 Deprecated Releases

Note: These releases are no longer maintained or supported by Intel

OpenCL™ Runtime 14.2 for Intel® CPU and Intel® Xeon Phi™ Coprocessors

This runtime software package adds OpenCL support to Intel Core and Xeon processors and Intel Xeon Phi coprocessors.

Supported OpenCL devices:

  • Intel Xeon Phi Coprocessor
  • CPU

Available Runtimes

For the full list of supported Intel Architecture processors, refer to the OpenCL™ Runtime Release Notes.

What's New? - Intel® VTune™ Amplifier XE 2017 Update 1

$
0
0

Intel® VTune™ Amplifier XE 2017 performance profiler

A performance profiler for serial and parallel performance analysis. Overviewtrainingsupport.

New for the 2017 Update 1! (Optional update unless you need...)

As compared to 2017 initial release

  • Support for the Average Latency metric in the Memory Access analysis based on the driverless collection
  • Support for locator hardware event metrics for the General Exploration analysis results in the Source/Assembly view that enable you to filter the data by a metric of interest and identify performance-critical code lines/instructions
  • Command line summary report for the HPC Performance Characterization analysis extended to show metrics for CPU, Memory and FPU performance aspects including performance issue descriptions for metrics that exceed the predefined threshold. To hide issue descriptions in the summary report, use a new report-knob show-issues option.
  • Summary view of the General Exploration analysis extended to explicitly display measure for the hardware metrics: Clockticks vs. Piepline Slots
  • GPU Hotspots analysis extended to detect hottest computing tasks bound by GPU L3 bandwidth
  • PREVIEW: New Full Compute event group added to the list of predefined GPU hardware event groups collected for Intel® HD Graphics and Intel Iris™ Graphics. This group combines metrics from the Overview and Compute Basic presets and allows to see all detected GPU stalled/idle issues in the same view.
  • Support for hotspot navigation and filtering of stack sampling analysis data by the Total type of values in the Source/Assembly view

Resources

  • Learn (“How to” videos, technical articles, documentation, …)
  • Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
  • Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

Contents

File: vtune_amplifier_xe_2017_update1.tar.gz

Installer for Intel® VTune™ Amplifier XE 2017 for Linux* Update 1 

File: VTune_Amplifier_XE_2017_update1_setup.exe

Installer for Intel® VTune™ Amplifier XE 2017 for Windows* Update 1 

File: vtune_amplifier_xe_2017_update1.dmg

Installer for Intel® VTune™ Amplifier XE 2017 - OS X* host only Update 1 

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

Running Intel® Parallel Studio XE Analysis Tools on Clusters with Slurm* / srun

$
0
0

Since HPC applications target high performance, users are interested in analyzing the runtime performance of such applications. In order to get a representative picture of that performance / behavior, it can be important to gather analysis data at the same scale as regular production runs. Doing so however, would imply that shared memory- focused analysis types would be done on each individual node of the run in parallel. This might not be in the user’s best interest, especially since the behavior of a well-balanced MPI application should be very similar across all nodes. Therefore, users need the ability to run individual shared memory- focused analysis types on subsets of MPI- ranks or compute nodes.

 

There are multiple ways to achieve this, e.g. through

  1. Separating environments for different ranks through the MPI runtime arguments
  2. MPI library specific environments for analysis tool attachment like “gtool” for the Intel®MPI Library
  3. Batch scheduler parameters that allow separating the environments for different MPI ranks

 

In this article, we want to focus on the third option by using the Slurm* workload manager, which allows us to stay independent of the MPI library implementation being utilized.

The Slurm batch scheduler comes with a job submission utility called srun. A very simple srun job submission could look like the following:

$ srun ./my_application

Now, attaching analysis tools such as - Intel® VTune Amplifier XE, Intel® Inspector XE or Intel® Advisor XE from the Intel Parallel Studio XE tools suite– could look like the following:

$ srun amplxe-cl –c hotspots –r my_result_1 -- ./my_application

The downside of this approach, however, is that the analysis tool - VTune in this case – will be attached to each individual MPI rank. Therefore, the user will get at least as many result directories as there are shared memory nodes within the run.

If the user is only interested in analyzing a subset of MPI ranks or shared memory nodes, they can leverage the multiple program configuration from srun. Therefore, the user needs to create a separate configuration file that will define which MPI ranks will be analyzed:

$ cat > srun_config.conf << EOF
0-98    ./my_application
99      amplxe-cl –c hotspots –r my_result_2 -- ./my_application
100-255 ./my_application
EOF

As one can see from this example configuration, the user runs the target application across 256 MPI ranks, where only the 100th MPI process (i.e., rank #99) will be analyzed with VTune while all other ranks remain unaffected.

Now, the user can execute srun leveraging the created configuration file by using the following command:

$ srun --multi-prog ./srun_config.conf

This way, only one result directory for rank #99 will be created.

*Other names and brands may be claimed as the property of others.

MeritData Speeds Up a Big Data Platform

$
0
0

Being able to analyze massive quantities of data is more important than ever before in today’s data-driven world. Chinese company MeritData helps its customers explore and exploit the value in their data using data analysis algorithms and other powerful tools for data processing, mining, and visualization.

To keep at the top of its game, MeritData has to ensure its data mining algorithms are as efficient as possible. And to do that, MeritData turned to Intel.  Intel worked with MeritData’s algorithm engineers to optimize the company’s multiple data mining algorithms using Intel® Data Analytics Acceleration Library (Intel® DAAL) and Intel® Math Kernel Library (Intel® MKL). The result was average performance improvements ranging from 3x all the way to 14x.

“Through close collaboration with Intel engineers, we adopted the Intel® Data Analytics Acceleration Library and Intel® Math Kernel Library for algorithm optimization in our big data analysis platform (Tempo*).” explained Jin Qiang, data mining algorithm architect at MeritData. “The performance―and customers’ experience―is improved significantly. We really appreciate the collaboration with Intel, and are looking forward to more collaboration.”

Get the whole story in our new case study.

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>