Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

CEMOSoft Delivers Data-Driven Retail from the IoT Edge to the Cloud

$
0
0

CEMOSoft’s customer experience management platform for the cloud — enabled on an Intel® architecture based IoT edge gateway.

Executive Summary

Retailers, along with many other vertical industries, seek to take advantage of the benefits of the Internet of Things (IoT) to improve customer experience management. However, quickly analyzing relevant data to inform decision-making and respond effectively to rapidly changing customer behaviors is challenging. The CEMOSoft* platform is designed to create opportunities, while addressing shifting demographics and an evolving IoT landscape. It offers a mobile customer engagement experience that can be dynamically modified on the fly, along with the increased security and intelligence of an Intel® architecture based IoT gateway and Windows® 10 IoT Core. The result is an affordable, flexible, scalable platform that brings ongoing customer insight to many aspects of daily operations

“Shoppers’ expectations are evolving faster than the retail industry can deliver. The biggest challenge facing retailers is in fact knowing what consumer expectations are and keeping pace with them.” Roop Gill, Senior Research Analyst, IMRA, Intel

Roop Gill - Senior Research Analyst, IMRA, Intel

Challenges

According to Gartner, customer experience management (CEM) is a top priority for retailers.1 Says Roop Gill, a senior research analyst at Intel, “Shoppers’ expectations are evolving faster than the retail industry can deliver. The biggest challenge facing retailers is in fact knowing what consumer expectations are and keeping pace with them.” Sixty-one percent of customers are more likely to buy from companies delivering custom content.2 Three-quarters of online customers said they expected help within fve minutes, have used comparison services for consumer goods, and trusted online reviews as much as personal recommendations.2 Twenty-fve percent of customers will defect after just one bad experience.2

A survey of retail CEOs by PWC found that the customer engagement paradigm has changed signifcantly.3 Customers have unprecedented insight into how a product was produced or a supply chain crafted. They expect and have access to the C-suite, as well as to social platforms such as Yelp where opinions can be quickly shared. Trust and integrity at the C-level are an equalizing force, moving power from top down to peer to peer.4 Retail PR and executive communications divisions need to be fluent in social media to manage brands. At the same time, all employees are potentially involved in the customer journey.

Digitization and the rising use of smartphones are establishing new standards for fast, seamless customer service in all settings. Real-time responsiveness and easy to-use apps for daily banking chores or ordering groceries are setting a high bar for speed and ease of doing business in business-to-consumer industries, and these expectations are migrating to business-to-business.5

There’s no question that changing demographics, data-driven analytics, and the rapid rise of connected IoT technologies are creating a landscape in which reaching customers and maintaining loyalty presents considerable challenges. These range from the plethora of new, often incompatible systems and technologies to the speed at which customer demands change due to social media networks and e-commerce to the high cost of cybercrime. Legacy infrastructure and processes, from scheduling to inventory to customer rewards programs, can make the cost of an end-to-end IoT solution prohibitive, both in terms of bottom line and business models.

Solution

CEMOSoft has created an innovative customer management platform powered by an Intel® architecture based IoT gateway that allows retailers to get on board extremely quickly and reap the benefits of IoT—without disrupting business operations or replacing legacy infrastructure. The platform is flexible, scalable, customizable, and strategic, directly addressing the needs of retailers and enabling business transformation. It runs Windows® 10 IoT Core, a version of Windows® 10 optimized for smaller devices.

The platform enables key goals of retail and industry CEOs including:

  • Understanding customer demographics and behavior patterns
  • Integrating emerging technologies without disrupting business operations
  • Managing and promoting brands
  • Building trust with customers
  • Engaging customers
  • Innovating business models and services

The CEMOSoft platform includes three standardized, integrated products: Insight*, Rewards*, and Analytics*. Together, they allow retailers to interact with customers in near-real time. This nearly instantaneous data can inform myriad facets of retail management, including planning, inventory, scheduling, customer service, brand awareness, and loyalty programs. A new retail account can be set up by CEMOSoft in under an hour.

The platform is designed to create a mobile experience where customers scan QR codes, enter a minimal amount of personal information, and respond to a few targeted questions. The software automatically generates an instant response, rewarding customers with a relevant coupon or offer. At the same time, the data from the customer interaction is sent to retail staff associates and managers, who can respond in kind: addressing negative feedback, answering questions, or cross-selling based on customer preferences. These preferences are also invaluable for longterm inventory planning. Retailers have direct communication with their customers and, thus, more accurate information about consumer behavior patterns and trends.

Experiences are dynamic: questions and responses can be changed on the fly at anytime from anywhere. They can be broad or specifc, targeting an overall experience across a chain or a particular inventory item in a single store at a defined time.


The CEMOSoft platform combines with the Intel® architecture based IoT gateway and Microsoft Windows® 10 IoT Core OS to deliver a seamless, dynamic, and smart consumer experience

Platform Benefits

With CEMOSoft, retailers can integrate emerging technologies within the framework of existing enterprise systems with minimal or no disruption to business processes.

  • Take advantage of the latest Intel® technology and Microsoft’s IoT operating system
  • Attract and retain current and new millennial and gen X customers
  • Build brand advocates
  • Deliver personalized experiences and customer service
  • Get near-real-time insight
  • Simplify customer rewards and loyalty programs
  • Engage in “reciprocity commerce”
  • Process data from the device edge to cloud
  • Engage digital consumers and customers via mobile smart devices
  • Help protect customer data and privacy

A Smart Way to Connect with Customers

CEMOSoft* Platform with the Intel® Architecture Based IoT Gateway and Windows® 10 IoT Core

CEMOSoft Insight* - With Insight, retailers determine questions they would like customer input on. Questions can be open ended or specific (e.g., yes/no). They can be used to rate experiences, products, and offers. Most important, questions can be changed on the fly, as retailers identify information that would be useful in a given venue or time period. In addition to being technology savvy and mobile, research indicates that millennials want to participate and have a voice in their interactions. The CEMOSoft paradigm is a conversation between retailer and customer that allows millennials to be heard.

CEMOSoft Rewards* - As with the interactive questions, rewards can be shaped to suit the circumstances. They can be seasonal or time-based, offer related merchandise, or shareable with friends and family. Rewards can be shared in the form of coupons or a newsletter.

CEMOSoft Analytics* - Near-real-time analytics send customer information and data directly to designated staff associates and managers. Both short- and long-term data and reports are available from the platform to support immediate responses and extended planning based on more accurate business intelligence

How it Works in Brief

phone and QRcode
1 The CEMOSoft platform integrates emerging big data, Software-as-a-Service (SaaS), cloud, and mobile technologies into a small form factor, giving retailers the value of analytics without requiring a large footprint in venues with limited space. The CEMOSoft application runs on an Intel® architecture based IoT gateway with Windows® 10 IoT Core.

laptop
2 CEMOSoft business manager dashboard simplifies management and analysis: allows account management for the Insight, Rewards and Analytics modules.

phone and QR code
3 Rewards can be quickly shared with participating customers through coupons: Questions and alerts are set up on Insights, coupons or newsletters uploaded to Rewards, and analytics viewed via Analytics.

The Foundation for IoT: Intel works closely with the ecosystem to deliver smart IoT solutions based on standardized, scalable, reliable Intel® architecture. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-to-end capabilities — performance, manageability, connectivity, analytics, and advanced security — to help accelerate innovation and increase revenue for enterprises, service providers, and vertical industries. Intel can help organizations use data to monitor, control, optimize, and benchmark, as well as to share historical and near-real-time information to improve decision-making.

Conclusion

Together, the CEMOSoft platform and Intel® architecture based IoT gateway have the potential to transform retail business models. With the platform’s focus on near real-time exchange and engagement between retailers and customers, whether online or in brick-and-mortar stores, it allows a more agile approach to service and more personalized customer experiences. It is above all a tool to support a process of continuous, relevant feedback well suited to a constantly evolving consumer climate. Running on the IoT gateway, it brings the performance and reliability to support near-real-time processing in a small form factor, help secure customer data and transactions, gather and filter data, conduct analytics at the edge, and scale to meet changing requirements.

The CEMOSoft platform can also be used by a wide range of vertical industries to effectively understand, engage, and serve customers, from healthcare and government to hospitality and consumer packaged goods (CPG).

About CEMOSoft

CEMOSoft is committed to excellence in product management and customer services. Its experts bring broad, diverse, and hands-on experiences across industries, businesses, and processes.

Learn More

More information about CEMOSoft   More information about Intel® IoT Technology and the Intel® Internet of Things Solutions Alliance

Intel® and CEMOsoft Logos

Reference

  1. Magic Quadrant for CRM and Customer Experience Implementation Services, Worldwide, Gartner, 2016, gartner.com/doc/3525671/magic-quadrant-crm-customer-experience.
  2. The Cost of Crappy Customer Experiences Infographic, July 5, 2015, thunderhead.com/the-cost-of-crappy-customer-experiences-infographic/.
  3. Total Retail Survey 2017, PWC, pwc.com/gx/en/industries/retail-consumer/total-retail.html.
  4. Four Concerns that Keep CEOs Awake at Night, World Economic Forum, 2017, CEO Survey, Davos, Switzerland, https://www.weforum.org/agenda/2017/01/4-concerns-that-keep-ceos-awake-at-night/.
  5. mckinsey.com, mckinsey.com/business-functions/marketing-and-sales/our-insights/improving-the-business-to-business-customer-experience.

Intel® technology features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at intel.com/iot. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.


Smart Transportation Robots Streamline Manufacturing Operations

$
0
0

SEIT* autonomous mobile robots, running on Intel® technology, enable manufacturers to improve flexibility and efficiency of intralogistics transportation.

Executive Summary

To remain competitive, manufacturers must focus on achieving new growth while driving down costs. Key to achieving this is greater flexibility and a dramatic upturn in operational efficiency across the manufacturing process. One area ripe for improvement is intralogistics transportation.

Many manufacturers still rely on autonomous guide vehicles (AGVs) to undertake repetitive transport tasks; but, rigid in nature, they do not support today’s demand-driven, dynamic manufacturing environments. Intelligent autonomous mobile robots (AMRs), like SEIT* from Milvus Robotics, offer a viable and cost-effective alternative.

This solution brief describes how to solve business challenges through investment in innovative technologies.

If you are responsible for…

  • Business strategy:
    You will better understand how autonomous mobile robots will enable you to successfully meet your business outcomes.

  • Technology decisions:
    You will learn how an autonomous mobile robot solution works to deliver IT and business value.


Figure 1. SEIT AMR from Milvus Robotics

Solution Benefits

  • Efficient operation - Fully autonomous rather than automated, SEIT* AMRs choose and decide the best route to take to optimize workflow and travel time.
  • Safe navigation - SEIT AMRs have the intelligence to navigate safely around people and objects with LiDAR and some additional sensors, and a built-in collision avoidance system.
  • Fast deployment - Do not depend on any physical infrastructure like wires or tapes meaning common failures like gaps in track lines do not occur, costs are reduced and robots can be up and running in just couple of hours.

Succeeding in a Fiercely Competitive Sector

Manufacturers operate in a highly challenging market segment. In some low-cost labor countries, wage rates are rising rapidly. Volatile resource prices, a looming shortage of highly skilled talent, and heightened supply-chain and regulatory risks create an environment that is far more uncertain than it was before the Great Recession1.

At the same time, customer expectations are rising and demand for high-quality customized products and services is greater than ever. To compound matters, competition in the manufacturing sector is fierce, particularly within and from Asia. Manufacturers must remain highly focused on achieving new growth and driving down costs to remain competitive.

To realize these ambitions, manufacturers need to dramatically improve operational efficiency. Inflexible legacy equipment struggles to respond quickly to consumer demand and sometimes unpredictable disruptions. Investing in digital technologies is crucial for driving down costs and creating demand-driven and responsive business models.

Industry 4.0, the latest phase in the digitization of the manufacturing sector, is creating new ways for manufacturers to deliver value. Harnessing the power of the Internet of Things (IoT), manufacturers can now automate and track every step of their production from the receipt of raw materials all the way through to delivery at the customer. They can monitor, collect, process and analyze huge volumes of data every step of the way. From this data, they can then derive insight to improve operational efficiency and productivity, increase flexibility and agility, and ultimately drive down costs.

Streamlining Intralogistics Transportation

Manufacturers work hard to optimize, automate and integrate the logistical flow of materials within the walls of their fulfillment centers, distribution centers, and warehouses. While some still rely on traditional methods of transportation – forklifts and pallets – many have sought to improve intralogistics by rolling out AGVs.

AGVs reduce the need for workers to carry out non-value add activities on the shop floor by undertaking repetitive transportation jobs. They follow magnetic or optical wires dug into the floor or take reference from reflectors placed on the walls and can tow objects behind them or carry materials on a bed. AGVs are used in nearly every industry, including pulp, paper, metals, newspaper, and general manufacturing.

While they offer many benefits, AGVs require large upfront infrastructure investment and are limited to predefined routes as they need fixed references to operate, all of which brings an innate rigidness. Today’s factories, however, are far from static. As manufacturers adapt to meet customers’ ever-changing desires and needs, flexibility is critical. AGVs, unfortunately, are unable to provide this. More recently AGV’s inability to keep up with the demands of the dynamic factory environment led to a surge in human intervention, which in turn, led to an increase in transportation costs. Manufacturers needed another solution.

Solution Value: Agile, Cost-effective, Autonomous Transportation

Using the sensory and processing powers enabled through Industry 4.0, SEIT AMRs from Milvus Robotics provide a much more flexible, efficient and integrated transportation compared to AGVs. Autonomous rather than automated, SEIT AMRs have the intelligence to decide and act according to changing environmental conditions. They choose and decide the best routes to take to optimize workflow and travel time, and can safely navigate around obstacles.

Capable of sharing a space with human workers, SEIT AMRs can integrate with existing management systems, and take orders from them. They can also communicate with robotic arms or a conveyor to undertake loading and unloading. Multiple SEIT AMRs can work harmoniously in the same facility, thanks to vehicle tracking and/ or fleet management systems. The best robot is selected for the job according to already programed jobs, distance to destination, and battery level. Thus, throughput can be optimized in facilities where there would otherwise be bottlenecks.

As they map the environment in which they are working in by a process of natural navigation, they do not need any sort of bands, rails or any other infrastructure investments. They can be up and running in a couple of hours. Technicians just need to create a map, define destination points and construct workflows. This process doesn’t require any third-party vendor intervention or additional training.

Built with industrial grade components, SEIT AMRs are designed to withstand the rigors of industrial environments and can safely handle payloads up to 1500 kg with a maximum speed of 1.5 m/s and a zero turning radius.

SEIT AMRs are controlled via Milvus Fleet Manager*, a Web-based platform built on RESTful* API, that allows users to request data, form new jobs and mission flows, and trigger actions by using any automation platform. It is the main interface to communicate with machines grouped as M2M network. Any authorized person can access the controls from any WiFi-connected device such as cell phone, tablet or computer. They can get real-time information and connect to the rest of the facility production orchestra players to create a fully trackable flow to optimize productivity. Factories can also create their own custom application modules for communication, data transfer, and tracking over internet, including conditional dynamic operations.

Depending on the implementation, AMRs can provide a return on investment after just one or two years, as they increase productivity, streamline operations, reduce accidents and eliminate CAPEX.

Manufacturers from all sectors from FMCG and home appliances, from Turkey to the United States, have rolled out SEIT AMRs.

Solution Architecture: SEIT AMRs, Running on Intel® Technology


Figure 2. SEIT AMRs, running on Intel® technology, improve operational efficiency

Milvus Robotics collaborates with Intel to optimize the operation of its SEIT AMRs.

Each robot contains an Intel® NUC, which provides the necessary processing power to drive the navigation system. The Intel NUC is a mini PC with the power of a desktop in a 4x4 form factor. It features a customizable board and chassis ready to accept the required memory, storage and operating system. Running on the Intel® Core™ i7 processor, small, light and battery-powered, it perfectly fits Milvus Robotics’ requirements. Built-in WiFi capability on the Intel NUC ensures fast and reliable data transfer and communication between each robot and all other systems for route optimization, while built-in Bluetooth is used to control more simple communications such as door opening.

SEIT AMRs use 2D Light Detection and Ranging (LiDAR) to underpin some safety elements but alone it is not enough. To ensure 3D space detection, each robot is also kitted out with Intel® RealSense™ technology. This provides the robot with computer vision so it can recognize objects or people while navigating fulfillment centers, distribution centers, and warehouses.

Conclusion

Manufacturers tasked with keeping pace with ever-changing customer demands for new and personalized products and services, while driving down costs, are looking for ways to increase agility and streamline operations.

Intralogistics transportation has relied on the use of AGVs for nearly fifty years, but they no longer support increasing requirements for highly adaptive manufacturing processes. Cognitive and capable of delivering dynamic and efficient transport in increasingly congested industrial operations, AMRs present a viable and cost-effective alternative to traditional material-handling systems like AGVs.

Solutions Proven By Your Peers

Intel Solutions Architects are technology experts who work with the world’s largest and most successful companies to design business solutions that solve pressing business challenges. These solutions are based on real-world experience gathered from customers who have successfully tested, piloted, and/or deployed these solutions in specific business use cases. Solutions architects and technology experts for this solution brief are listed on the front cover.

Learn More

Solution product company:

Intel products mentioned in the paper:

Find Out How You Could Harness the Power of the Internet of Things

Find the solution that is right for your organization. Contact your Intel representative or visit https://www.intel.co.uk/content/www/uk/en/internet-of-things/overview.html

Infosim’s StableNet* Based on Intel® Architecture Provides Any-to-Any Connectivity for IoT

$
0
0

Enabling data-driven insight and holistic visibility for Telco, service providers, and the enterprise

Perhaps the central challenge as the Internet of Things (IoT) becomes a driving force in the worldwide economy is enabling secure, manageable, seamless connectivity across diverse things, protocols, systems, and infrastructures. The difficulties and risks are similar for organizations from service providers and enterprises to Telco. On the one hand, there are myriad vendors, applications, protocols, configurations, and management systems that are neither integrated nor interoperable. On the other, the sheer volume of data is ever-increasing and a digitally savvy global population and workforce are placing new demands on businesses large and small. New management systems can take years to rollout, delaying new services and revenue streams, and increasing maintenance costs. These factors create a set of complex, costly obstacles facing organizations for whom modernization is a competitive necessity.

Infosim focuses on addressing the core IoT challenge with a flexible, innovative platform based on powerful, high-performance Intel® architecture. StableNet* is designed to connect “any-to-any,” providing new levels of assurance and interoperability to both legacy and modern IoT infrastructure. By enabling protocols, networks, databases, and applications to talk to each other securely, and providing holistic, end-to-end visibility, Infosim and Intel are enabling viable, cost-effective connectivity with all the accompanying business and end-customer advantages. StableNet is being utilized by managed service providers (MSPs), enterprises, and Telcos, as well as in energy and manufacturing implementations.


Figure 1.Infosim simplifes the “zoo” of management applications and increases

StableNet is a third-generation highly automated network and services management system. The key differentiator compared to other types of legacy Operational Support Systems (OSSs) is that StableNet is a unifed OSS system with three integrated functions that focus on configuration, fault, and performance management, with automated root cause analysis (RCA). StableNet can be deployed on a multi-tenant, multi-customer, or dedicated platform, and can be operated in a highly dynamic flex-compute environment. A modular licensing model allows companies to pay for what they need, and scale up or down as their business evolves.

StableNet benefits include:

  • Comprehensive management: Unified configuration, fault/RCA, and performance management in a single product.
  • Lower costs: Reduction in OPEX and CAPEX via product consolidation, step-by-step migration, and retirement of existing legacy element management solutions.
  • Streamlined access: Automated service delivery directly from your integrated service catalog.
  • Policy-based: Configuration and policy governance that maximizes service availability and reduces MTTR.
  • Rapid ROI: Due to reduction in OPEX and CAPEX, and customer service credits realized via greater service availability.
  • Service-oriented architecture (SOA): Enables high levels of integration and flexibility.
  • Built for business: An Intel® architecture foundation helps ensure high levels of security, performance, reliability, and scalability, along with a long-term roadmap to maximize investments.


Figure 2.The agile, holistic IoT platform for business and industry worldwide

StableNet is available in two versions

STABLENET* TELCOSTABLENET* ENTERPRISE
A comprehensive unifed management solution for telecom operators and ISPs.An advanced, unifed, and scalable network management solution for IT and managed service providers.
Services and network performance analysisAutomated fault and performance management
ICT and application performance monitoringAutomated IT infrastructure discovery via devices and/or via CMDB
Customer service monitoringAutomated root cause analysis
Service assurance and configuration managementNetwork configuration and change management
Mobile/4G LTE*Automated reporting
Inventory trackingSecurity, audit, integration, and compliance monitoring
Field monitoring 
Network configuration and change management 

Sample use cases

Telco Service Assurance
Telco companies are increasingly tasked with managing not only fixed and mobile networks, but also VoIP and IPTV, internal IT systems, applications, DBs, FWs, and key application services for business customers. Achieving convergence and cross-silo visibility with legacy systems requires time-intensive efforts and large upfront investments. With StableNet Telco running on Intel® platforms, Telco operators and managers gain integrated performance, fault, automated root cause, configuration, and inventory management within one consistent, modular Telco NG OSS solution.

The solution simplifies services-oriented management, including installation, deployment, and operations, and reduces total cost of ownership (TCO). Offerings include quad-play, mobile, high-speed Internet, VoIP (IPT, IPCC), IPTV across carrier Ethernet, metro Ethernet, MPLS, L2/L3 VPNs, multicustomer VRFs, Cloud, and FTTx environments. IPv4 and IPv6 are fully supported.

  • Supports handling of mass-CDRs, large numbers of IP and MPLS network elements, NetFlow, systems, and servers.
  • Scales to very large distributed Telco, MSP, SaaS, and cloud environments.
  • Supports distributed deployment of large numbers of small management agents at small sites. This allows cost-efficient, centralized, end-to-end monitoring for large numbers of distributed small sites and offices
  • Allows step-by-step implementation and modular licensing for phased replacement of legacy management solutions.

Enterprise Infrastructure Management
Today’s enterprises are competing in a 24/7 service economy, with an infrastructure and application mix that is steadily growing in size and complexity. Outages and service disruption are costly and result in a loss in productivity. Monitoring of networks, applications, and devices is more critical than ever before, with a mix of personal and business devices accessing corporate networks amidst rampant security and identity threats. With StableNet Enterprise running on Intel® platforms, organizations gain proactive management with near-real-time problem identification, remediation, and restoration. This highly automated, consistent, cross-silo IT monitoring and reporting management solution addresses complex large, distributed IT network infrastructures, IT systems, and ICT services challenges in an efficient and secure way.

How it Works in Brief

StableNet’s unified network and services management platform has been addressing IoT management for years. This scalable platform consolidates many discrete network management functions. It can discover, map, monitor, and manage large-scale networks, with reporting, alerts, and visualization of overall IT infrastructure.

It is ideally suited for cloud infrastructure management. StableNet provides an array of functional capabilities required for monitoring and managing the hosting platforms within a cloud environment.

Application performance monitoring (APM) products in the market place today have a wealth of functionality that brings additional complexities and ultimately results in a costly operating model. StableNet’s APM functionality is already integrated with other enriched capabilities, making troubleshooting and identification of root causes much easier.

Many applications today are browser- or web-based. StableNet has the ability to configure predefined scripts that interact with a web-based application and perform a suite of metric tests. These are collected and analyzed for specific performance-related issues, as well as problem determination, monitoring, and reporting requirements.

 


Figure 3.The StableNet platform provides a flexible service-oriented architecture

Conclusion

Now organizations can easily and cost-effectively overcome the main hurdle to taking full advantage of smart systems that deliver seamless connectivity and data-driven intelligence. Put simply, with Infosim’s StableNet based on Intel® architecture, the common “zoo” of management systems becomes manageable, interoperable, integrated, and effective. With StableNet, you can manage devices, networks, databases, and services; achieve visibility across systems, vendors, and protocols; and scale to meet evolving requirements. Together, Infosim and Intel are enabling organizations to deploy software-defined networking and SaaS so they can innovate and compete in a connected world.

About Infosim

Infosim is a leading manufacturer of automated service fulfillment and service assurance solutions for Telcos, ISPs, managed service providers, and corporations. Since 2003, Infosim has been developing and providing StableNet to Telco and enterprise customers. Infosim is privately held with offices in Germany (Wuerzburg, HQ; Muenster), United States (Austin), and Singapore.
infosim.net

Resources

The foundation for IoT

Intel works closely with the ecosystem to deliver smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-to-end capabilities—performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and the Telco industry. Intel can help organizations use data to monitor, control, optimize, and benchmark, as well as to share historical and near-real-time information to improve decision-making

Learn more

Infosim is a general member of the Intel® Internet of Things Solutions Alliance. From modular components to market-ready systems, Intel and the 400+ global member companies of the Alliance provide scalable, interoperable solutions that accelerate deployment of intelligent devices and end-to-end analytics. Close collaboration with Intel and each other enables Alliance member to innovate with the latest IoT technologies, helping developers deliver first-in-market solutions.

More information about Intel® IoT Technology and the Intel® Internet of Things Solutions Alliance

Infiswift Accelerates Connected Agriculture with Intel

$
0
0

The Infiswift* IoT platform based on high-performance Intel® architecture enables more efficient agricultural operations.

The UN Food and Agriculture Organization of the United Nations predicts that the world will need to produce 70 percent more food in order to feed the 9.6 billion people that will inhabit the planet by 2050.1 The agriculture industry faces considerable challenges as it strives to sustain itself today and prepare for future demands. These include an increasing need for fresh water — with agriculture consuming 70 percent of the world’s fresh water supply — and the impact of climate change.1

In order to meet these global demands, food producers and vendors need smarter data collection and simpler data integration. Most data now is of low quality, collected manually, and often cannot be accessed and shared between applications. It is critical for farmers and producers to gather higher quality data and control that data, particularly when third-parties are involved. Existing solutions are often not designed for agriculture and, as a result, do not meet the requirements of the industry.

Technology already plays a key role in the modern farm, accessing data from connected things, such as sensors, equipment, and even drones. Analyzing this data at the edge (where sensors, equipment, and devices are located) and in the cloud can help the agriculture industry to more accurately and quickly increase yield, monitor environmental conditions and farm equipment, control fertilizer and pesticide application, assess variables from nutrients to growth patterns, detect disease, and improve resource utilization.

Infiswift is an agriculture-specific IoT platform that provides the foundation and services to help farms build connected solutions that improve operations.

How it works


Figure 1. Infiswift connects physical products to each other and the cloud to enable the agriculture industry to gather, analyze, and act on relevant data.

The Infiswift* IoT Platform

Infiswift is a technology company that focuses on enabling connected services for the agriculture ecosystem. Infiswift’s expertise is in helping the agriculture industry build connected, data-rich solutions that help secure, gather, transmit, analyze, and act on key data. This is all based on the long-term Intel® architecture roadmap to help ensure critical end-to-end security, reliability, and performance.

Infiswift works closely with farms, manufacturers, vendors, service providers, and more to help identify the Internet of Things (IoT) hardware and software solutions that will best meet their specific requirements and address their challenges.

The Infiswift IoT platform combines an innovative edge-to-cloud connectivity and analytics software engine with robust Intel® architecture. It provides a streamlined solution for OEMs and service providers to build and deploy smart solutions for the agriculture industry. At the same time, these solutions bring considerable benefits to farmers, distributors, and others along the food supply chain.

With Infiswift, the agriculture industry can collect better data, make better decisions, and streamline operations. Infiswift provides the “plumbing” to connect and manage devices, users, and cloud-based services. A patent-pending platform architecture powers scalability to potentially billions of endpoints using world-class security. Pre-configured features allow businesses to focus on getting their solution to market, rather than on back-end details.

Key Characteristics Enabling Agriculture to Benefit from IoT

Interoperable
Connect any new or legacy device from any vendor
Secure
Use cutting-edge security from Intel, such as a trusted platform module or McAfee Embedded Control
Cost-effective
Minimize hardware and software costs with a lightweight IoT solution
Scalable
Powerful platform architecture and Intel® hardware foundation enable high-performance scalability
Simple
Easy-to-use dashboards and interfaces act as a central management portal for holistic visibility into multiple systems
Energy efficient
Minimal power requirements for operation of hardware and software at the edge
Flexible
A protocol-agnostic platform to connect any device at the edge
Integrated
Cloud, professional services, and API integrations streamline development and deployment of agriculture solutions
Near-real time
Analyze streaming data to trigger alerts and notifications
Distributed
Perform some analysis in the field, sending only certain data to the cloud for quicker action and data cost savings

Focus on Security

The Infiswift IoT platform simplifies security management by providing end-to-end security including secure-by-default, defense-in-depth strategies at various levels. Security is also built into Intel® processors to increase protection at the hardware level for devices, gateways, routers, and cloud data centers. While each project has different security requirements that must be taken into account, knowing that top-end security is available is important.

With Infiswift, clients can rely on:

  • Industry-leading standards including SSLv3, AES, and SHA-256
  • Industry-approved practices for key rotation and certificate management
  • Protection of privacy through techniques such as Intel® Enhanced Privacy ID (Intel® EPID)
  • Deployment of hardware-based solutions like TPM (Trusted Platform Module) or TEE (Trusted Execution Environment)
  • Tamper-proof designs that can survive hostile host environments

Planting the Next Generation of Agricultural Insights

Agriculture is becoming more precise and efficient, with technology driving many operational changes and automating manual processes. From more effective watering practices to better management of the supply chain, it is crucial that farms maintain an advantage as the industry evolves. The Infiswift IoT platform for agriculture enables farms and vendors to improve operations by connecting devices and making more precise decisions from the resulting insight. It can also help address challenges such as siloed data (e.g., where tractor data cannot be combined with warehouse data); legacy equipment that can’t be easily replaced, but can be made smart; and competitors with optimized operations.

Infiswift’s momentum in the agriculture arena has been growing steadily, with several implementations moving forward. The company is exploring collaborations with equipment manufacturers, service providers, and vendors who recognize the opportunity to implement IoT technology that can improve efficiencies and profits for their larger farming customers. Infiswift works closely with agricultural partners to develop the right connected solution from hardware selection to application development.

A typical farm implementation could include distributed intelligence at the edge — with smart sensors on livestock, machinery, irrigation, and feed systems — as well as centralized control and analytics for ongoing assessment of the operation as a whole. With the flexibility to integrate a wide range of data sources at low cost (due to wireless hardware advances), the infiswift platform can provide a great foundation for getting actionable insight from data. Different users — from owner to equipment operator to distributor — benefit from different types and amounts of information and can access custom dashboards, available via web and mobile, to visualize important data, analyses, and predictive information.

Infiswift Professional Services

Infiswift works directly with customers to create an optimal solution. Key steps in the process include:

  • Situational analysis
  • Platform optimization
  • Hardware recommendations
  • Custom application and dashboard development
  • Training, system maintenance, and customer support

The Foundation for Smart Agriculture

Intel works closely with the agriculture industry to deliver smart Internet of Things (IoT) solutions based on standardized, scalable, reliable Intel® architecture. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-to-end capabilities — performance, manageability, connectivity, analytics, and advanced security — to help increase productivity, efficiency, and quality across the agriculture value chain. Intel can help food producers, OEMs, retailers, and transport companies use data to monitor, control, optimize, benchmark, and share data in near-real time for better decision making.

Common Agriculture Use Cases

Food distributionAlign harvest availability with transportation to help eliminate idle time and improve lot traceability.
Accurate forecastingMonitor actual vs. projected harvest in near-real-time.
Inventory managementTrack status of livestock, levels of stored grain, and more in near-real time
Supply chain and distributionOperate more profitably based on market signals and just-in-time distribution.
Weather planningIntegrate weather data to make better decisions.
Process automationTake actions automatically based on data (e.g., schedule sprinklers).
Asset managementMonitor farm vehicles and machines to optimize operations and manage preventive maintenance (e.g., optimize harvester routes using GPS).
Environmental monitoring Monitor soil conditions, nutrients, irrigation, and growth patterns; monitor for disease, insect, and weed issues to take preventive measures.
Livestock monitoring Monitor variables such as body temperature, animal activity, pulse, food intake, and GPS position.
Notification and alertsSend automatic alerts or take action based on predefined events (e.g., if a cow is ready for reproduction, identify it for recall from the field).

The greatest value and improvement, however, comes from bringing these use cases together in a secure manner. If a farm can bring data from its supply chain together with real-time harvest and weather data, there’s a lot more potential to make improvements and optimize. This makes interoperability between systems critical to the long-term value of any IoT-based solutions implemented by a farm, manufacturer, or service provider.

Conclusion

Infiswift speeds time-to-market for smart agriculture solutions with its industry-specific IoT solutions. With a foundation in standardized Intel® architecture, Infiswift brings a powerful and flexible platform to move the agriculture industry forward.

Now there’s a clear pathway to commercialization of connected agricultural solutions for OEMs, manufacturers, vendors, service providers, and farmers.

Learn More

Infiswift is a general member of the Intel® Internet of Things Solutions Alliance. From modular components to market-ready systems, Intel and the 400+ global member companies of the Alliance provide scalable, interoperable solutions that accelerate deployment of intelligent devices and end-to-end analytics. Close collaboration with Intel and each other enables Alliance members to innovate with the latest IoT technologies, helping developers deliver first-in-market solutions.

More information about infiswift   More information about Intel® IoT Technology and the Intel® Internet of Things Solutions Alliance

References

1. forbes.com/sites/federicoguerrini/2015/02/18/the-future-of-agriculture-smart-farming/#23f0f7ab337c.

Relayr “Any to Any” Connectivity Transforms Smart Buildings

$
0
0

Solving connectivity and interoperability challenges. Increasing building insight and efficiency.

Executive summary

Smart buildings can improve operations, simplify management, reduce costs, and increase energy efficiency. The challenge is that achieving these benefits means effectively communicating with myriad disparate systems, infrastructures, and protocols. Relayr offers an innovative comprehensive software stack solution that essentially connects “anything to anything,” enabling both legacy and new buildings to reap the considerable advantages of data-driven insight in the era of the Internet of Things (IoT).

Flexibility for the Complex Smart Building Landscape

Building managers seek solutions that bring efficiencies, improve manageability and utilization, and lower costs. But buildings are complex, designed with multiple, often siloed systems from different vendors—from HVAC to security to electrical wiring to environmental controls. Building managers must contend with incompatible systems, multiple interfaces, and disparate analytics—all while managing everything from monitoring and maintenance to space utilization. Building equipment is typically designed to last: An HVAC system can operate for 40 years, making it inefficient and expensive to replace before its projected end of life. Countless systems and devices are already installed, with sensors actively collecting data. However, these discrete systems are often not integrated, making the data difficult to access for analysis. This means building managers do not have the tools or cross-facility insight to create efficiencies or control costs.

The building industry is in flux, with connected facilities offering new services and efficiencies and seizing opportunities to innovate. The process of getting smarter is an evolutionary one that goes beyond smart building installations to analytics-based strategy and optimization. With relayr, building managers and architects now have the agility, flexibility, and support for continuing assessment and improvement.


relayr helps address challenges, including incompatible systems and legacy investments

The Interoperability Advantage

relayr enables smart buildings to connect any device, equipment, or service to the cloud. It offers a full open source software stack with “any to any” integration from sensors to gateways to the cloud. Its innovative middleware solution solves the challenges caused by incompatible protocols and legacy equipment. The outcome is ongoing, comprehensive, cross-facility visibility into building operations, systems, and activity.

Automation and orchestration can be put into place for a wide range of tasks, such as tracking and monitoring building temperature to improve energy efficiency. relayr’s open source SDK and APIs support efficient development of new applications and rapid prototyping, accelerating new services to help buildings compete and deliver tenant satisfaction. Vendor lock-in is eliminated, giving building managers and architects more choices, and preserving legacy infrastructure investments.

Retrofitting existing buildings with relayr helps extend the life of building assets. Data can be tapped for predictive maintenance, preserving equipment and investments, while reducing expenses on costly labor and parts.

Accelerating Digital Transformation

relayr helps smart building customers manage and accelerate digital transformation. Its focus on Industrial IoT (IoT) brings deep understanding of the particular challenges of the building management industry and the configurations in both legacy and new building infrastructures and architectures. Now buildings can benefit from the same platform used by technology industry leaders including Intel, SAP, Cisco, IBM, Microsoft, salesforce, and many others.
  • Manage and control building devices, sensors, equipment, and appliances
  • Get integrated coverage across multiple systems, as well as vendors and cloud services
  • Achieve visibility across building operations with a single integrated dashboard
  • Increase automation of tasks, including machine-to-machine efficiencies
  • Improve orchestration for better system and network utilization
  • Build vertical applications to deliver new services using the relayr technology stack

Key Solution Components

IOT CLOUD PLATFORM AS A SERVICEOPEN SOURCE SDK AND API LIBRARIESOPEN SOURCE SENSOR KIT
Multiprotocol platform delivering data for IoTPowerful open source tools for fast, cost-effective developmentRapid prototyping for new solutions

Smart Buildings. Measurable Benefits.

For relayr customers in the building industry, the gains can be significant.

  • Reduce maintenance costs
  • Lower CAPEX versus traditional systems
  • Reduce machine failures
  • Increase operating efficiency
  • Achieve higher revenue productivity per square foot

End-to-end Smart Building Transformation

With relayr, significant investments in legacy infrastructure and long-life systems can be protected, while still realizing the benefits of IoT. Legacy and brownfield buildings can be retrofitted for better ROI. New building construction can be architected to maximize interoperability and connectivity.

Key Smart Building Use Cases

Interoperability
  • Connect multiple systems, protocols, vendors, and data sources
Building management system integration
  • Integrate with existing applications and dashboards
Security
  • Increase security for buildings, data, and systems
  • Permissions-based administration of central dashboard
Environmental monitoring
  • Real-time sensors for comfort index, fire, air, noise, etc.
Digital ceiling
  • Smart power over Ethernet (PoE) lighting and sensor infrastructure
Space optimization and workforce experience
  • Tools to manage people flow, occupancy, and utilization
  • Collaboration tools
Retrofit, predictive maintenance
  • Cognitive building operations
  • Predictive maintenance (e.g., elevators)
Sustainability and lower OpEx
  • Optimize resource usage for energy, power, water, and gas
Custom IoT use cases
  • Asset tracking
  • Physical security


relayr cloud platform architecture

Tech in Brief

relayr onboards any physical object using any communication method to relayr middleware via Cisco networks. SDKs on top of the cloud middleware stack allow applications to be built quickly, securely, and easily. Cloud APIs simplify integration of industrial assets and physical objects (e.g., sensors, elevators, appliances, buildings) into any cloud service, including Cisco, SAP, NEST, and Google. relayr tools and unique 5-4-3 process support rapid innovation and application development.

relayr offers an agnostic cloud platform for the IoT that works across devices, hardware, or sensors. relayr enterprise-class middleware supports smart connectors, analytic interfaces, standardized engines, and high-performance cloud. relayr also runs on gateways, protecting buildings from Internet downtime or failure. Powerful, easy-to-use cloud APIs simplify onboarding of any physical object using any communication method. Flexible, secure SDK and open source tools for fast, cost-effective development of cloud services and applications.


Agnostic middleware support for connected buildings

A Scalable Foundation for Building Intelligence

Intel®-based IoT gateways gather, filter, and transmit data from sensors and machines to the cloud. Intel® technologies improve data security and manageability, while enabling local machine-to-machine automation. From predictive maintenance to data-based operations, Intel® architecture supports analytics capabilities to optimize processes, equipment, systems, and utilization.

About Relayr

relayr is a rapidly growing enterprise IoT company, providing the enterprise middleware for the digital transformation of industries. As a thought leader in enterprise IoT, relayr develops sustainable IoT solutions, based on the OpenFog IoT reference architecture and its own stack. relayr addresses the central challenge of IoT, digitizing physical objects, with an end-to-end development solution consisting of an IoT cloud platform that communicates from any-to-any (any service, any software, any platform, any sensor); open source software development kits; and a team of IoT experts to support rapid prototyping and implementation.

Learn more: relayr.io

About Intel

Partnering with industry leaders such as relayr, Intel is helping to streamline and simplify how innovative building industry and architecture firms develop smart IoT solutions for an increasingly interconnected world. From the scalable compute power of Intel® Quark™ SoC, Intel® Atom™ processors, and Intel® Core™ vPro™ processors to Intel®-based smart home gateways, Intel delivers the foundational, end-to-end technologies that let you connect, secure, and manage valuable data so that you get more from IoT.

Learn more: intel.com/iot

5-4-3 innovation acceleration

Due to the complexity and dependencies in most buildings, many owners, managers, and architects are unsure where to start—whether moving a building into the digital era or improving existing digital capacity. relayr’s unique 5-4-3 Innovation Acceleration methodology is designed to help buildings rapidly progress from concept to full IoT rollout within one business quarter. In short, the process enables relayr to assess your requirements and identify the top three areas where you can receive the most value for your investment. Prototyping and testing can occur in your actual environment to pilot and validate solutions. Speed, accuracy, and pragmatism are the cornerstones of relayr’s approach to help remove risks, reduce upfront costs, and ensure results are beneficial and on target.

The 5-4-3 approach helps buildings manage and speed IoT transformation with end-to-end support in short, focused stages:

5 days: Kickstart
relayr IoT experts guide clients through the process of developing their own IoT solutions using the relayr toolset. It starts with five days of foundational preparation and a kickstart workshop designed to profile, map, analyze, prioritize, and determine the feasibility of ideas—resulting in three leading ideas to move toward proof of concept.

4 weeks: Accelerate
Four work weeks are devoted to developing a working prototype, prototype testing, and creating a scope of work for the traction phase. The prototype focuses on one use case, from making a digital footprint of an existing or new asset (including retrofitting with sensors, custom connectors, and visualizations) to basic integration with an existing ERP, CRM, or BMS system.

3 months: Traction
Three months of intensive development efforts, including implementation of connectors, full testing, dashboard setup, and sensor consumption package definition.

Launch: Full IoT rollout
The process concludes with a step-by-step rollout of the IoT solution to relevant areas of the building.

Conclusion

relayr brings proven expertise in connecting heterogeneous systems, so buildings can invest in the smart capabilities that will deliver the most benefit. With full interoperability, legacy and new buildings have systems that talk to each other, revealing a holistic profile of operations. Intel-based IoT gateways and technologies provide an added layer of security and intelligence from the edge to the cloud. Turn the complex building industry landscape into opportunity with relayr and Intel.

Innovate FPGA Design Contest Voting

$
0
0

Check Out the Project Abstracts, and Vote for Your Favorites

Over 450 teams from around the world have submitted project abstracts for the 2018 Innovate FPGA design contest. Many are quite impressive, and cover a wide range of applications including:  

  • Autonomous driving
  • Medical diagnostics
  • Robotics
  • Computer vision
  • Agriculture automation
  • Intelligent prosthetics

View the Projects

You can still participate by voting for the teams you think should receive development hardware to begin their designs. You must be a community member to vote, however. Every community member can vote for up to 3 projects, in each of the 4 regions. 
Don’t delay, voting closes on January 30, 2018.

Join the Community

Additional Information

80 bit long double math functions from Intel may conflict with Microsoft* Visual C++* 64 bit versions

$
0
0

Reference Number : CMPLRS-43756

Version : Intel® C++ Compiler Version 18.0, 17.0 and earlier; Microsoft* Visual C++* Version 2017 and 2015 (at least)

Operating System : Windows*

Problem Description :  Calls to 80 bit long double versions of standard math functions under the /Qlong-double Intel compiler option may yield unexpected results due to conflicts with 64 bit long double versions in the Microsoft Visual C++ run-time libraries. Because the function prototypes look the same, there is no warning at compile or link time, so this can lead to run-time errors that may be hard to debug.

Cause : The Intel run-time library contains 32 bit, 64 bit and 80 bit implementations of standard math functions, e.g. sqrtf(), sqrt() and sqrtl(). On Microsoft* Windows* systems, the default length of the long double data type is 64 bits. In the Microsoft run-time library, both sqrt() and sqrtl() functions expect 64 bit arguments in an xmm register and return 64 bit results. The Intel compiler converts calls from sqrtl() for long doubles to sqrt() which also expects 64 bit arguments in xmm registers and returns 64 bit results. However, when the Intel Compiler switch /Qlong-double is set, the long double data type becomes 80 bits long and calls to sqrtl() expect an 80 bit argument on the stack and return an 80 bit result. Thus the Intel and Microsoft entry points for long double versions of math functions such as sqrtl have the same name, but pass arguments and return results in different ways.

In general, the Intel compiler driver tries to link the Intel math run-time library ahead of the Microsoft run-time library, so that the Intel versions of math functions pre-empt the Microsoft versions. In some circumstances, a pragma in Microsoft header files can cause the Microsoft run-time library to be linked ahead of the Intel math library. (An example of this is in use_ansi.h, invoked by cstdio). If /Qlong-double has been set, this leads to an ABI mismatch for long double versions of math functions such as sqrtl(). This can lead to unexpected results for such calls, as seen in the following example:

#include <mathimf.h>
#include <cstdio>

int main() {
                long double value = sqrtl(100.0l);
                printf("Value: %f\n", (double)value);

                return 0;
}
>icl /Od /nologo test.cpp>test
Value: 10.000000>icl /Od /Qlong-double /nologo test.cpp>test
Value: -nan(ind)

Solution : Caution should be exercised when using 80 bit long double APIs on Windows. This issue can be avoided by linking the Intel math run-time library explicitly, e.g. with   /link /DEFAULTLIB:libmmt   for the default of linking with a multithreaded, static run-time library (/MT).  For the above example:

>icl /Od /Qlong-double /nologo test.cpp /link /defaultlib:libmmt>test
Value: 10.000000

 

How to Install Vulkan* APIs for UE4

$
0
0
By Eddie (Edward) Correia

New Paths, New Possibilities

Just as parallelism and multithreaded programming paved the way for the performance strides of multicore CPUs, Vulkan* APIs are poised to forge a future of multithreaded, cross-platform GPU programming, and high-performance rendering, regardless of the target device.

The heir apparent to OpenGL*, Vulkan* gives developers greater control over threading and memory management, and more direct access to the GPU than predecessor APIs, which means more versatility for addressing an array of target platforms. The only costs are a relatively up-to-date processor, and bit more development work up front.

What You’ll Need

The minimum requirement for developing with Vulkan* APIs on Intel Graphics GPUs is a processor from the 6th Generation Intel® Processor Family (introduced in August, 2015) running 64-bit Windows* 7, 8.1 or 10. Intel also offers a 64-bit Windows® 10-only driver for 6th-, 7th- or 8th-generation processors. Vulkan* drivers are now included with Intel® HD Graphics drivers, which helps simplify the setup process a bit.

These instructions require:

Using UE4 with Vulkan* APIs requires that the engine be rebuilt, and this must be done after the Vulkan* SDK has been downloaded, and installed. Rebuilding the Unreal Engine requires the engine source-code, which is freely available on GitHub* to registered users who have linked their git account with Epic Games. All required steps are covered here.

These instructions are for setting up a development host with Intel® HD Graphics.

Part One: Download Intel® Graphics Driver

1. Visit the Intel Download Center.

2. Select “Graphics Drivers” from the “Select a Product” drop-down menu. 

3. Select the required driver based on the development host.

4. Download the .ZIP version of the driver.

5. Extract all files from the: ZIP and create a memorable destination folder.

Part Two: Update Graphics in Windows*

6. In Device Manager, expand “Display adapters,” right-click on the Intel® HD Graphics adapter, and select “Update driver.”

7. Select “Browse my computer…” in the “Update Drivers” screen.

8. Select “Let me pick…”

9. Select “Have Disk…”

10. Navigate to the folder containing the files unzipped in Step 5.

11. If successful, a message like the one below will appear:

Part Three: Configure UE4 for Vulkan*

12. Download and install the Vulkan* SDK.

13. If you’re not already logged into GitHub*, log in now. Then open the Unreal Engine Launcher and click “Get the source code on GitHub*” (UE4 source is free to registered users who have linked their GitHub* account with Epic Games, which we'll cover next).

Clicking the “Grab the source” link brings you to the GitHub* page, as pictured below:

13a. Open your Epic Games Dashboard and link your Epic and GitHub* accounts:

14. Shortly after linking the accounts, a confirmation will arrive in your email inbox. Return to the Epic Games GitHub* page, and look for an invitation to join:

Clicking “View invitation” brings you to the page pictured below. Click “Join Epic Games.”

15. This returns you to the Epic Games GitHub* page, where UE4 repositories will now be available. Click “UnrealEngine” to continue.

16. Select the “master” branch from the “Branch:” button. Then hit the “Clone or download” button.

Important:

If you don’t intend to return engine changes to the community, download the .Zip file. If you do, then fork the master branch, clone it to your hard drive, and proceed from there.

Extract the .Zip (or clone the repo) to a suitable location on your hard drive.

17. When the file download has finished, open the new directory and run “Setup.bat,” and wait for it to finish (this takes a while).

18. In the same directory, run “GenerateProjectFiles.bat” to create the “UE4.sln” project:

Note: On some systems, it might be necessary to activate certain Visual Studio features to enable the “GenerateProjectFiles” script to do its job.

If it fails the first time, open Visual Studio and do the following:

a. Select Tools> “Tools and Features… ”

b. In the list of features, check “Game development with C++”

c. In the right-hand panel, check “Unreal Engine Installer”

d. Click “Modify” to save changes

e. Run “GenerateProjectFiles” again

Double-click the “UE4.sln” file to open the project in Visual Studio:

20. In VS Solution Explorer, right-click the UE4 project, and select Build.

21. Once the build completes successfully, set up a shortcut for the UE4 Editor that puts it in “Vulkan* mode” using these steps:

a. Go to "C:\<installation_dir>\Engine\Binaries\Win64\ 

b. Create a shortcut for the file "UE4Editor.exe"

c. Set the shortcut Target to: "C:\<installation_dir>\Engine\Binaries\Win64\UE4Editor.exe" -vulkan

Your UE4 projects will now build using Vulkan* APIs whenever you start from that shortcut.


Retail Insights with the Intel® Responsive Retail Platform

$
0
0

Platform Overview

The Intel® Responsive Retail Platform (Intel® RRP) enables rapid development and deployment of IoT services that enhance customer experience and drive operational efficiencies from the supply chain through to the store floor. It does this by easing the process of onboarding legacy gateways, diverse sensor types, and having multiple applications running on the same platform (no more data silos). Thus, it serves to connect an entire store which enables retailers to get a comprehensive view of in-store operations along with access to the value of their data. Data-driven insights gained through the platform can be acted upon to help retailers improve operational efficiency and strengthen customer engagement. 
Below we outline modern challenges faced by retail stores and the need to transform their data into business insights. We then discuss the Intel® RRP as a solution that enables retailers to harness the power of their data to meet those challenges and stay competitive.  

Who can Benefit from Using the Intel® RRP

The Intel® RRP can be used by both physical and omni-channel retailers (i.e. those that have both physical and online presence). In the case of omni-channel retailers, Intel® RRP works to manage multiple channels and supports data fusion by enabling the data from digital channels to fuse with the data from physical channels.

Challenges Faced by Retailers

Today’s consumers are increasingly Internet savvy and equipped with mobile devices which enable them to search and compare online for the best deals on products. As a result, many consumers no longer adhere to the idea of “brand loyalty” but instead seek out competing brands and stores. They’re increasingly more likely to make a purchase from whichever brand or store offers the best deal, and this presents a challenge to retailers when it comes to anticipating consumer behavior. Adapting to modern consumer behavior (their variable purchase and browsing habits) along with the pressures to leverage “big data” to get insight into those consumers as well as a store’s operations are key business challenges retailers face to stay competitive.

Modern Browsing and Purchasing Behavior

Consumers can shop online or in-store but often engage in behavior that combines online browsing with an in-store purchase or vice versa. And according to research conducted by Intel, nearly 75% of consumers browse online before visiting a shop to inspect a product then go back online to search for the best deal.
Also, the availability and effectiveness of online or in-store help can affect consumer behavior. The same research findings mentioned above, suggest that about 90% of consumers will leave a store or website if they’re unable to readily find something or no help is available.

Managing Several Data Sources

Omni-channel retail stores that have multiple channels (both physical and digital) are presented with a unique challenge. Not only do they face the task of managing several data sources (data coming from both their online and physical presence) but also finding the time and expertise to manage and then derive insights from these increasingly complex data sources.

The Need for Business Insights

Unlike brick-and-mortar retailers (stores that are not online), online retailers can observe how consumers interact with the design of their apps or websites (where they navigate, what products they browse and for how long, etc.) The advantage of being able to use browser cookies to track what products a consumer considers, views, and eventually purchases (in near real-time) can give online retailers great insight into their operations (which can include following orders from the place of purchase, through distribution, and even into their supply chain).  
But without the benefit of browser cookies to track consumer behavior and interaction, brick-and-mortar retailers struggle to gain the same insights. These retailers have a tougher time honing their customer, workforce, and inventory management as well as store environment strategies, because both the data and data-driven insights are difficult to acquire. 
The Internet of Things, sensor technologies, and business intelligence applications can offer some relief but are typically too complex to implement fully (separate applications have to be cobbled together to share data). Burdened with technology silos, long implementation time-frames, and limited people resources to work with data to extract business insights, brick-and-mortar retailers face great challenges to remain competitive with their online competition.

Data-driven Insights through the Intel® RRP

As a solution to the challenges faced by in-store (brick-and-mortar) retailers, Intel has created the Intel® Responsive Retail Platform (Intel® RRP) which combines gateways, sensors and software to connect an entire store to enable a comprehensive view of a retailer’s in-store operations along with data-driven insights.
Designed to bridge the digital and physical worlds, Intel® Responsive Retail Platform works to:
• Simplify sensor management.
• Eliminate islands of technology.
• Enable interoperable data and events.
• Deliver near real-time alerts and calls-to-action.
Intel® RRP enables the rapid development and deployment of IoT services that enhance the customer experience and drive operational efficiencies from the supply chain to the store floor. 

Ease of Sensor Onboarding with Intel® Context Sensing SDK

Designed to be open to any application (data or sensor-type) regardless of manufacturer or software vendor, Intel® RRP uses the  open source Intel® Context Sensing SDK along with UPM (sensor library) and MRAA (I/O library) to enable the connection of over 400 sensor types from diverse manufacturers. This allows for faster and less complex integration of sensor and other data types for reading, conversion (raw values from MRAA into human readable units), and the subsequent creation and identification of events. The Intel® RRP is an on premise (in-store) edge appliance that can leverage legacy sensing infrastructure with the Intel® Context Sensing SDK, along with UPM and MRAA, to connect devices and other data sources for data integration. 

Sensing with Intel® RRP

There are numerous sensor technologies on the market, but in the retail environment we categorize technologies by what they are sensing. 
 
In the diagram above, we outline four categories of things being sensed: 
1. Consumer: Sensing where a customer is at could help determine if a retailer needs to send an associate to assist them. 
2. Retail associate: Sensing and tracking where associates are located can help determine who is closest and available to assist a customer.
3. Product: Inventory and product placement tracking helps to understand inventory levels and locate misplaced products.
4. Store environment: Monitoring the store environment can help to understand foot-traffic, what aisles or products are receiving a lot of consumer attention and what sections of the store aren’t attracting shoppers.

The Advantage of Interoperable Event Data

Intel® RRP uses Docker* container technology to enable multiple devices and applications to reside on the same platform. And this allows the event data created by the Intel® RRP to be interoperable—devices and applications can share data, interpret that shared data and eventually present it to the user (for example, the business owner). This interoperability can enable insights into in-store operations which empowers a business to take the right actions to improve operational efficiency (discover opportunities to automate tasks or reduce costs) or strengthen customer engagement (Intel® RRP features accurate inventory management to help retailers better serve their customers).

Transition into the Next Generation of Retail

We’ve presented the Intel® RRP as a solution that enables retailers (both physical and omni-channel) to meet the challenges of evolving consumer trends (for example, modern purchase and browsing behavior) as well as the ability to unlock the value of their retail data. By easing the process of onboarding legacy gateways, diverse sensor types, and having multiple applications running on the same platform, the platform connects an entire retail store to provide businesses with a comprehensive view of their in-store operations along with data-driven insights. Those insights can empower businesses to take appropriate action to improve their in-store operations and also help them to discover opportunities for strengthening customer engagement. 
As a responsive model that enables near real-time data-driven insights, the Intel® RRP can help businesses to transition into the next generation of retail where they have the ability to quickly respond to evolving consumer demands and trends, enhance a customer’s shopping experience, and discover valuable opportunities to reduce operational costs.

Build an Image Classifier in 5 steps on the Intel® Movidius™ Neural Compute Stick

$
0
0

What is Image Classification?

Image classification is a computer vision problem that aims to classify a subject or an object present in an image into predefined classes. A typical real-world example of image classification is showing an image flash card to a toddler and asking the child to recognize the object printed on the card. Traditional approaches to providing such visual perception to machines have relied on complex computer algorithms that use feature descriptors, like edges, corners, colors, and so on, to identify or recognize objects in the image.

Deep learning takes a rather interesting, and by far most efficient approach, to solving real-world imaging problems. It uses multiple layers of interconnected neurons, where each layer uses a specific computer algorithm to identify and classify a specific descriptor. For example if you wanted to classify a traffic stop sign, you would use a deep neural network (DNN) that has one layer to detect edges and borders of the sign, another layer to detect the number of corners, the next layer to detect the color red, the next to detect a white border around red, and so on. The ability of a DNN to break down a task into many layers of simple algorithms allows it work with a larger set of descriptors, which makes DNN-based image processing much more effective in real-world applications.

Stop sign

NOTE: the above image is a simplified representation of how a DNN would identify different descriptors of an object. It is by no means an accurate representation of a DNN used to classify STOP signs.

Image classification is different from object detection. Classification assumes there is only one object in the entire image, sort of like the ‘image flash card for toddlers’ example I referred to above. Object detection, on the other hand, can process multiple objects within the same image. It can also tell you the location of the object within the image.

Practical learning!

You will build...

A program that reads an image from a folder and classifies them into the top 5 categories.

You will learn...

  • How to use pre-trained networks to do image classification
  • How to use Intel® Movidius™ Neural Compute SDK’s API framework to program the Intel Movidius NCS

You will need...

  • An Intel Movidius Neural Compute Stick - Where to buy
  • An x86_64 laptop/desktop running Ubuntu 16.04

If you haven’t already done so, install NCSDK on your development machine. Refer NCS Quick Start Guide for installation instructions.

Fasttrack…

If you would like to see the final output before diving into programming, download the code from our sample code repository (NC App Zoo) and run it.

mkdir -p ~/workspace
cd ~/workspace
git clone https://github.com/movidius/ncappzoo
cd ncappzoo/apps/image-classifier
make run

make run downloads and builds all the dependent files, like the pre-trained networks, binary graph file, ilsvrc dataset mean, etc. We have to run make run only for the first time; after which we can run python3 image-classifier.py directly.

You should see an output similar to:

------- predictions --------
prediction 1 is n02123159 tiger cat
prediction 2 is n02124075 Egyptian cat
prediction 3 is n02113023 Pembroke, Pembroke Welsh corgi
prediction 4 is n02127052 lynx, catamount
prediction 5 is n02971356 carton

Inferred image

Let’s build!

Thanks to NCSDK’s comprehensive API framework, it only takes a couple lines of Python scripts to build an image classifier. Below are some of the user configurable parameters of image-classifier.py:

  1. GRAPH_PATH: Location of the graph file, against with we want to run the inference
    • By default it is set to ~/workspace/ncappzoo/caffe/GoogLeNet/graph
  2. IMAGE_PATH: Location of the image we want to classify
    • By default it is set to ~/workspace/ncappzoo/data/images/cat.jpg
  3. IMAGE_DIM: Dimensions of the image as defined by the choosen neural network
    • ex. GoogLeNet uses 224x224 pixels, AlexNet uses 227x227 pixels
  4. IMAGE_STDDEV: Standard deviation (scaling value) as defined by the choosen neural network
    • ex. GoogLeNet uses no scaling factor, InceptionV3 uses 128 (stddev = 1/128)
  5. IMAGE_MEAN: Mean subtraction is a common technique used in deep learning to center the data
    • For ILSVRC dataset, the mean is B = 102 Green = 117 Red = 123

Before using the NCSDK API framework, we have to import mvncapi module from mvnc library

import mvnc.mvncapi as mvnc

Step 1: Open the enumerated device

Just like any other USB device, when you plug the NCS into your application processor’s (Ubuntu laptop/desktop) USB port, it enumerates itself as a USB device. We will call an API to look for the enumerated NCS device.

# Look for enumerated Intel Movidius NCS device(s); quit program if none found.
devices = mvnc.EnumerateDevices()
if len( devices ) == 0:
    print( 'No devices found' )
    quit()

Did you know that you can connect multiple Neural Compute Sticks to the same application processor to scale inference performance? More about this in a later article, but for now let’s call the APIs to pick just one NCS and open it (get it ready for operation).

# Get a handle to the first enumerated device and open it
device = mvnc.Device( devices[0] )
device.OpenDevice()

Step 2: Load a graph file onto the NCS

To keep this project simple, we will use a pre-compiled graph of a pre-trained AlexNet model, which was downloaded and compiled when you ran make inside the ncappzoo folder. We will learn how to compile a pre-trained network in an another blog, but for now let’s figure out how to load the graph into the NCS.

# Read the graph file into a buffer
with open( GRAPH_PATH, mode='rb' ) as f:
    blob = f.read()

# Load the graph buffer into the NCS
graph = device.AllocateGraph( blob )

Step 3: Offload a single image onto the Intel Movidius NCS to run inference

The Intel Movidius NCS is powered by the Intel Movidius visual processing unit (VPU). It is the same chip that provides visual intelligence to millions of smart security cameras, gesture controlled drones, industrial machine vision equipment, and more. Just like the VPU, the NCS acts as a visual co-processor in the entire system. In our case, we will use the Ubuntu system to simply read images from a folder and offload it to the NCS for inference. All of the neural network processing is done solely by the NCS, thereby freeing up the application processor’s CPU and memory resources to perform other application-level tasks.

In order to load an image onto the NCS, we will have to pre-process the image.

  1. Resize/crop the image to match the dimensions defined by the pre-trained network.
    • GoogLeNet uses 224x224 pixels, AlexNet uses 227x227 pixels.
  2. Subtract mean per channel (Blue, Green and Red) from the entire dataset.
    • This is a common technique used in deep learning to center the data.
  3. Convert the image into a half-precision floating point (fp16) array and use LoadTensor function-call to load the image onto NCS.
    • skimage library can do this in just one line of code.
# Read & resize image [Image size is defined during training]
img = print_img = skimage.io.imread( IMAGES_PATH )
img = skimage.transform.resize( img, IMAGE_DIM, preserve_range=True )

# Convert RGB to BGR [skimage reads image in RGB, but Caffe uses BGR]
img = img[:, :, ::-1]

# Mean subtraction & scaling [A common technique used to center the data]
img = img.astype( numpy.float32 )
img = ( img - IMAGE_MEAN ) * IMAGE_STDDEV

# Load the image as a half-precision floating point array
graph.LoadTensor( img.astype( numpy.float16 ), 'user object' )

Step 4: Read and print inference results from the NCS

Depending on how you want to integrate the inference results into your application flow, you can choose to use either a blocking or non-blocking function call to load tensor (previous step) and read inference results. We will learn more about this functionality in a later blog, but for now let’s just use the default, which is a blocking call (no need to call a specific API).

# Get the results from NCS
output, userobj = graph.GetResult()

# Print the results
print('\n------- predictions --------')

labels = numpy.loadtxt( LABELS_FILE_PATH, str, delimiter = '\t' )

order = output.argsort()[::-1][:6]
for i in range( 0, 5 ):
    print ('prediction ' + str(i) + ' is ' + labels[order[i]])

# Display the image on which inference was performed
skimage.io.imshow( IMAGES_PATH )
skimage.io.show( )

Step 5: Unload the graph and close the device

In order to avoid memory leaks and/or segmentation faults, we should close any open files or resources and deallocate any used memory.

graph.DeallocateGraph()
device.CloseDevice()

Congratulations! You just built a DNN-based image classifier.

Further experiments

  • This example script reads only one image; modify the script to read and infer multiple images from a folder
  • Use OpenCV to display the image(s) and their inference results on a graphical window
  • Replicate this project on an embedded board like RPI3 or MinnowBoard

Further reading

Non-standard Math functions invsqrt* Removed from "math.h" from Intel(R) C++ Compiler 18.0

$
0
0

Starting from Intel(R) C++ Compiler 18.0, Intel's "math.h" will only contain functions required in C99/C11 standard. For non-standard math functions include invsqrt, invsqrtf, invsqrtl, they have been moved to Intel specific math header "mathimf.h".

Developers who have called those non-standard functions from "math.h" in previous versions, will encounter compiler errors like following after upgrade to 18.0:

../pair_tersoff_intel.cpp(1375): error: no instance of overloaded function "invsqrt" matches the argument list
            argument types are: (float)
    fvec rikinv = invsqrt(rsq2);
                  ^

There are two ways to resolve the compiler errors:

1. Change to include "mathimf.h" for Intel Compiler.

#ifdef __INTEL_COMPILER
#include <mathimf.h>
#endif

 

2. Declare those functions by yourself. For example, for C++ code, declare:

extern double invsqrt(double __x);
extern float invsqrtf(float __x);

Then compile and link with icc to link with Intel(R) Math Library.

 

Using the CPU for Effective and Efficient Medical Image Analysis

$
0
0

A Quantitative Report Based on the Alibaba Tianchi Healthcare AI Competition 2017

Overview

This paper is based on the Tianchi Healthcare AI Competition, an online challenge for automatically detecting lung nodules from computed tomography (CT) scans, cosponsored by Alibaba Cloud, Intel, and LinkDoc. In October 2017, this competition successfully concluded after an intense seven-month competition among 2,887 teams across the globe. This competition was hosted on Alibaba’s public cloud service that was completely built upon Intel’s deep learning hardware and software stack. Intel was deeply engaged in the architecture design, hardware and software development, performance optimization, and online support throughout the competition and thus obtained many insights into the medical artificial intelligence (AI) domain. This paper reports on the key findings taken from the experiments.

First, we implemented a 3D convolutional neural network (CNN) model to reflect state-of-the-art lung nodule detection, according to the common philosophy among all the Tianchi participants.

Second, we trained the model with the input data of different resolutions and quantitatively proved that a model trained with higher-resolution data achieves higher detection performance, especially for the small nodules. As a result, the higher-resolution model consumed much more memory than lower-resolution ones at the same time.

Third, we compared the behaviors of general-purpose computing on graphics processor units (GPGPU) and CPU and proved that CPU architecture can provide larger memory capacity, which thus enables medical AI developers to explore the higher-resolution designs so as to pursue the optimal detection performance.

We also introduced a customized deep learning framework, called Extended-Caffe*10, the core of Tianchi’s software stack, as an example to demonstrate that CPU architecture can support highly efficient 3D CNN computations, so that people can both effectively and efficiently use CPU for 3D CNN model development.

Background

The Tianchi Healthcare AI Competition1 is the first AI healthcare competition in China and the only one of its kind worldwide in terms of scale and data volume. Sixteen top domestic cancer hospitals in China provided labeled lung CT scans of nearly 3,000 patients for this competition. Lung nodule detection was chosen because the incidence of lung cancer has increased in China during the past 30 years and has already become the number one cause to death among all other cancers. Therefore, the early screening of lung nodules is an urgent problem that needs to be addressed immediately2. After an intense seven-month online competition among 2,887 teams across the globe, the team from Peking University won the contest.

This online competition was hosted on Alibaba’s public cloud service that was completely built upon Intel’s deep learning hardware and software stack. The underlying hardware infrastructure is a cluster of Intel® Xeon® and Intel® Xeon Phi™ processor-based platforms, which offer a total of 400+ TFLOPS computing power. Intel also offered a series of deep learning software components to facilitate model training and inference, the core of which was a customized deep learning framework, called Extended-Caffe, which was specifically optimized for medical AI usages. Intel experts also helped hundreds of online participants run their models efficiently, and, in return, obtained valuable insights into the medical AI domain.

This competition revealed that, although deep learning has been applied to computer vision areas for more than a decade, the medical image analysis still brings unique and significant challenge to the domain experts and engineers. In particular, almost all the state-of-the-art solutions for medical image analysis are heavily relying on 3D, or even 4D/5D, CNN, which call for very different engineering considerations compared to the 2D ones that we have often seen in other areas of computer vision. We found that the CPU platform, compared to the traditional GPGPU platform, can more effectively support 3D CNNs for medical image analysis due to the CPU’s advantage of large memory capacity, while keeping high computing efficiency through delicate algorithm implementations for 3D CNN primitives.

Although the Tianchi dataset and models are confidential, this paper discusses our key findings based on self-developing and then experimenting on a 3D CNN model that can reflect state-of-the-art lung nodule detections. The following sections describe how we preprocessed the CT dataset, designed our model, implemented highly efficient 3D primitives on the CPU, conducted our experiments, and then drew conclusions through quantitative analysis.

CT Data Preprocessing

Every raw CT image contains a sequence of 2D images. The interval between 2D images is called Z-internal. Every 2D image is a matrix of gray-scale pixels, where the horizontal and vertical intervals between pixels are called X- and Y-intervals, respectively. These intervals are measured in millimeters. Because the CT instruments often differ from one another, different raw CT images have different intervals. For example, in the LUNA’16 dataset3, Z-intervals range from 0.625 mm to 2.5 mm. The Tianchi dataset has a similar situation. In order to make a deep learning model work on a unified dataset, we have to interpolate the original pixels in X, Y, and Z directions by taking a fixed sampling distance, so that a raw CT is converted to a new 3D image where the pixel-to-pixel intervals in three directions equal the sampling distance. Then, if we measure things on a pixel basis, the sizes (that is, resolutions) of the new 3D images and of the nodules are determined by the sampling distance. Table 1 shows that smaller sampling distances lead to higher resolutions. Note that unlike ordinary object detections, lung nodule detections suffer a unique problem that a nodule takes only one millionth of the whole CT. Therefore, in order for a model to effectively extract the features of nodules, we must crop the CT image into smaller 3D regions, and then feed those crops one by one into the model. Again, smaller sampling distances lead to bigger crops.

Table 1. Different Sampling Distances Generate Different Resolutions of 3D Data

Sampling Distance (mm)3D Image Resolution (Pixel Pixel Pixel)Nodule Resolution (Diameter: Pixel)Crop Resolution (Pixel Pixel Pixel)
1.00249 256 3023.66128 128 128
1.33188 196 2312.7496 96 96
2.00127 136 1591.8364 64 64

Our 3D CNN Model for Lung Nodule Detection


Figure 1. Our 3D CNN model architecture (crop size = 128 x 128 x 128).

Using the common philosophy of prior networks4‒6 and Tianchi models as our guides, we constructed a 3D CNN model for lung nodule detection, as shown in Figure 1, which is divided into down-sampling and up-sampling parts. The down-sampling part consisted of five 3D residual blocks interleaved with four pooling layers. Each residual block was made up of convolution, batch normalization, ReLU, and other operations, together with a residual structure (C1 and C2 in Figure 1). The up-sampling was done through two de-convolutions (Deconv in Figure 1). We combined the output of each deconvolution with the output of the corresponding down-sampling layer, so as to generate the feature maps, which contained both local and global information of the original input data.

For each input crop (m x m x m), our model generated (m/4) x (m/4) x (m/4) x 3 bounding cubes, called candidates, and then associated each cube with a probability (that is, the possibility that this cube was a nodule), the coordinates of the cube’s center, and the size of the cube.

Usually post-processing includes a false-positive reduction and other steps, following the model, to filter out false positive candidates. However, since this paper focuses on engineering considerations that just impact the effectiveness and efficiency of the model itself, we didn’t develop these. But, even without the enhancements by such post-processing steps, we submitted our trained model (CCELargeCubeCnn9) to the LUNA’16 competition and ranked number 14 in its LeaderBoard, which demonstrates that our model indeed reflects state-of-the-art lung nodule detection.

Highly Efficient 3D CNN Primitives on the CPU

Our experiments compared the effectiveness between the GPGPU and CPU platforms, in terms of running our model with different resolutions and hyperparameters (for example, batch size). Therefore, the computing efficiency on a CPU platform must be guaranteed first, especially for 3D convolution, the most frequently used primitive.

3D Convolution on the CPU

We implemented a highly efficient 3D convolution primitive on the CPU, by leveraging the highly optimized 2D convolution in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)7 (see Figure 2). First, we treated the 3D data and kernels as a group of 2D slices. Then, a 3D convolution is equivalent to convolving the corresponding 2D slices (having the same color in Figure 2), and then summing all the intermediate results together. Because the Intel MKL-DNN 2D convolutions are extremely optimized on the CPU, our 3D convolutions can also run highly efficiently on the CPU.


Figure 2. Highly efficient 3D convolution (leveraging the Intel® Math Kernel Library for Deep Neural Networks 2D convolution).

In order to show the effect, we also provided a baseline implementation, called single precision, floating general matrix multiplication (GEMM) based 3D convolution, which follows a more straightforward philosophy. Figure 3 illustrates how a 2D GEMM-based convolution works, that is, to rearrange data and kernels so that we could use matrix multiplication to compute the original convolution. Then, we applied the same idea to 3D data and kernels to get a 3D GEMM-based convolution. Since the core computations are matrix multiplications, for which we could leverage the highly optimized SGEMM implementation in the Intel® Math Kernel Library (Intel® MKL)8, this baseline implementation could actually achieve reasonable performance on the CPU.


Figure 3. GEMM-based 2D convolution (leveraging the Intel® Math Kernel Library SGEMM).

Figure 4 shows that our highly efficient 3D convolution implementation outperformed the GEMM-based one. The forward pass was accelerated by 4X, backward pass by 30 percent, and overall by 2X.


Figure 4. Execution time comparison (highly efficient 3D convolution versus GEMM-based 3D convolution).

Other 3D CNN Primitives on the CPU

In addition to 3D convolution, the efficiency of all related 3D primitives must be guaranteed. Thanks to Intel MKL and Intel MKL-DNN, we implemented the highly efficient 3D primitives, including 3D batch normalization, 3D deconvolution, 3D pooling, 3D softmax loss, 3D cross-entrophy loss, 3D smooth L1 loss, 3D concat, and so on, and then packaged them into Extended-Caffe10, the core of the Tianchi software stack. Figure 5 shows the overall efficiency improvement for the training and inference of our model.


Figure 5. Overall efficiency improvement (optimized versus unoptimized). Training time is measured per iteration (that is, the time to process one crop), while inference time is measured per CT image (that is, the time to process all the crops of one CT image).

Experimental Results and Quantitative Analysis

Model Training

We used the stochastic gradient descent (SGD) and stepwise learning rate strategies to train our model. All network parameters were initialized randomly, and the initial learning rate was set to 0.01. We trained the model for 100 epochs and downscaled the learning rate by 10X at the 50th and 80th epoch, respectively. Figure 6 records the loss trends when we trained the LUNA’16 dataset (subset 0, as an example). You can see that the model successfully converged on this dataset during the 80th ~ 100th epoch.


Figure 6. An example of model training (with LUNA’16 subset 0).

Detection Performance Evaluation

The method that well-known competitions like Tianchi, LUNA’16, and so on are using to evaluate a model’s detection performance is called FROC (free-response receiver operating characteristic)11. A candidate (that is, a bounding cube) is considered to be a true positive if the distance between the center of this candidate and the center of the ground truth nodule is shorter than the radius of the nodule. Then, a FROC score is calculated to quantify the sensitivity of the model versus the average number of false positives.

The Impact of Resolution

Figure 7 shows the different FROC scores of our 3D CNN model versus the different resolutions of the input data that were used for the model training. We can see that the model trained with higher resolutions achieved higher FROC scores. This experiment quantitatively proved that higher resolution can help improve the detection performance of a model.


Figure 7. Higher resolution leads to higher FROC scores.

Because human radiologists can easily detect the large nodules but find it much harder to detect the smaller ones, the capability of an AI solution to detect small nodules is more in demand. Figure 8 compares our model’s accuracies among different resolutions, in terms of detecting the nodules of different sizes. We can see that higher resolutions can especially improve the detection performance on smaller nodules.


Figure 8. Higher resolution improves the detection accuracy on smaller nodules.

Memory Consumption Analysis

We analyzed the memory consumptions when training our model with different resolutions, and then compared the behaviors between the CPU platform and GPGPU platform. Figure 9 (a) and (b) records the cases where the training batch size equals 1 and 4, respectively. When the batch size equals 1, a modern GPGPU with 12 GB memory can only support up to 128 x 128 x 128 resolution, while a CPU platform with 384 GB memory can easily support up to 448 x 448 x 448. When the batch size equals 4, the GPGPU gets worse—only up to 96 x 96 x 96 can be supported, while the CPU can easily support up to 256 x 256 x 256.


(a) Batch size=1


(b) Batch size=4

Figure 9. Memory consumption versus different resolutions.

Since the modern CPU server has terabytes of memory capacity, which will be especially the case when Intel’s Apache Pass technology is out soon, the CPU platform will be able to offer almost infinite flexibility for the model designer in the medical image analysis domain to explore extremely high-resolution solutions in the pursuit of optimal detection performance.

Summary

The Tianchi Healthcare AI Competition, co-sponsored by Alibaba Cloud, Intel, and LinkDoc, was a cloud-based AI challenge built upon Intel’s deep learning hardware and software. In this paper, derived from Tianchi, we discussed our key insights into the medical AI domain based on our experiments. First, we self-developed a 3D CNN model that reflects the state-of-the-art in lung nodule detections. Next, we quantitatively proved that a model trained with higher-resolution data would achieve higher detection performance, especially for the small nodules. As a result, the higher-resolution model consumed much more memory. We compared the GPGPU and CPU and proved that the CPU platform, thanks to its large memory capacity, can enable medical AI designers to explore much higher-resolution solutions so as to pursue optimal detection performance. We also introduced the Extended-Caffe framework, as an example, to demonstrate that CPU architecture can support highly efficient 3D CNN computations, so that people can both effectively and efficiently use CPU for 3D CNN model development.

Reference

  1.  https://tianchi.aliyun.com/getStart/introduction.htm?raceId=231601
  2. Bush I., Lung nodule detection and classification. Technical report, Stanford Computer Science, 2016.
  3. https://luna16.grand-challenge.org/home/
  4. Girshick, R. Fast R-CNN. Computer Science, 2015.
  5. Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation, Medical Image Computing and Computer-Assisted Intervention — MICCAI 2015. Springer International Publishing, 2015:234‒241.
  6. https://github.com/lfz/DSB2017
  7. https://software.intel.com/en-us/articles/intel-mkl-dnn-part-1-library-overview-and-installation
  8. https://software.intel.com/en-us/mkl
  9. https://luna16.grand-challenge.org/results/
  10. https://github.com/extendedcaffe/extended-caffe
  11. http://devchakraborty.com/Receiver%20operating%20characteristic.pdf

Alternative Platforms to the Intel® Joule™ Module

$
0
0

Overview

For developers interested in alternatives to the Intel® Joule™ platform (discontinued), learn how the Intel® Joule™ platform compares to some of the latest embedded platforms from Intel (formerly codenamed Apollo Lake). We cover feature comparison, design considerations and then a comparison of the Intel® Joule™ Developer Kit with the latest IoT developer kit from Intel (UP Squared* Grove* Development Kit).

Audience

Developers looking for general guidance on how the Intel® Joule™ module compares to some of the latest embedded platforms from Intel (formerly codenamed Apollo Lake).

Feature Comparison

Below is a comparison of features between the Intel® Joule™ module and embedded processor products formerly codenamed Apollo Lake.

Product NameIntel® Joule™ 550x or 570x modulesIntel® Celeron® and Pentium® processorsIntel Atom® processor E3900 series
CodenameBroxtonApollo LakeApollo Lake
StatusLaunchedLaunchedAnnounced
Recommended Customer Pricingdiscontinued (was priced at $149 - $159 or $199 - $209)$107 or $161n/a
Processor Numbern/aN3350; N4200E3930; E3940; E3950
CPU cores42 or 42 or 4
Processor Base Frequency1.5 or 1.7 GHz1.1 GHz1.3 or 1.6 GHz
Burst frequency2.4 GHz on 570x2.4 or 2.5 GHz1.8 or 2.0 GHz
Max Memory Size3 or 4 GB8 GB8 GB
Memory TypesLPDDR4DDR3L/LPDDR3 or LPDDR4DDR3L (ECC and Non ECC) or LPDDR4
Flash memory8 or 16 GB eMMC Up to 64GB eMMC
Cache1MB2 MB2 MB
# of USB Ports1 or 2 USB 3.08 (6 USB 3.0)8 (6 USB 3.0)
Total # of SATA Ports022
Max # of PCI Express Lanes0 or 166
Graphics OutputHDMI 1.4B and MIPI-DSI (1x4)eDP/DP/HDMI*/MIPI-DSIeDP/DP/HDMI/MIPI-DSI
Processor GraphicsIntel® HD Graphics, gen 9Intel® HD Graphics 500 or 505Intel® HD Graphics 500 or 505
OSWindows® 10 IoT Core; Ubuntu;  Reference Linux* OS for IoTLinux*; Windows® 10 EnterpriseWindows® 10 Enterprise; Windows® 10 IoT Core; Wind River Linux*, VxWorks*; Android*
Intel® High Definition Audio (Intel® HD Audio) TechnologyNoYesYes
Operating temperature range0°C to 70°C

0°C to 70°C

Commercial applications.

-40°C to 85°C

Extended temperature range for industrial applications

Power DeliveryPMICPMIC / discrete voltage regulator (VR)
Sleep statesS0ixS0ix, S3, S4, S5S0ix, S3, S4, S5
Security FeaturesIntel® AES New Instructions (Intel® AES-NI)Intel® Trusted Execution Engine (Intel® TXE); Intel® AES-NIIntel® TXE; Intel® AES-NI
Package Size24mm x 48mm24mm x 31mm24mm x 31mm

Design Considerations

This section presents design considerations for developers who are interested in alternative platforms to the Intel® Joule™ module. You may be a developer interested in transitioning from the Intel® Joule™ platform to the latest embedded processor products from Intel to take advantage of new features. Or you may have been considering developing with the Intel® Joule™ module but now that it has been discontinued, must choose another platform to develop on. We outline below important design considerations for both kinds of developers. Here we focus on comparing the Intel® Joule™ platform to the Atom® processor E3900 series.

  • Form factor
    The Intel Atom® processor E3900 series (formerly codenamed Apollo Lake) board area will probably increase because of a larger SoC package size, larger Power Management IC (PMIC) and Voltage Regulator (VR) solution space, and memory down (i.e. not package-on-package).
  • Performance Differences
    Lower operating frequencies on latest generation of Intel Atom® processor E3900 series, smaller cache size per core pair (e.g. 2MB vs 1MB) may affect performance. Memory configuration differences may have an impact since Intel Atom® processor E3900 series have higher peak BW, but lower transfer rate.
  • I/O Interface Limitations
    Intel Atom® processor E3900 series supports a single LPSS SPI port, compared to Joule’s two LPSS SPI ports; The Intel® Joule™ module supports USB 2.0 and USB 3.0 OTG while the Intel Atom® processor E3900 series supports USB 2.0 and USB 3.0 dual-role (it does not support OTG).
  • Completing design regulatory testing
    A design with the Intel Atom® processor E3900 series will need to go through various types of emissions certifications, safety certifications, and environmental certifications.
  • Driver Compatibility
    Register compatibility and I/O location compatibility from an Intel® Joule™ module to an Intel Atom® processor E3900 series may require driver changes.
  • Additional Features of the Intel Atom® processor E3900 series
    The Intel Atom® processor E3900 series has some new features and interfaces over Intel® Joule™ modules. Taking advantage of these interfaces and features may extend design and validation time of a migration, when compared to a situation where no new features are added.
  • Wireless Technology
    There is no integrated Wi-Fi and Bluetooth® on the Intel Atom® processor E3900 series.
  • Power Management
    Intel® Joule™ module does not support traditional PC sleep states (S3 , S4, S5), while the Intel Atom® processor E3900 series does.

Comparison with the latest IoT Developer Kit from Intel

Below is a table comparing the features of the Intel® Joule™ module with the latest IoT Developer Kit from Intel, the UP Squared* Grove* IoT Development Kit.

 Intel® Joule™ 550x Developer KitUP Squared* Grove* IoT Development Kit
TypeComputer on a moduleSingle board computer
Pricediscontinued (was ~$250)Starting from $249
Processor
Processor familyIntel® Atom™Intel® Celeron®
CodenameBroxtonApollo Lake
Processor model 

N3350

Processor frequency1.7 GHz1.1 GHz
Processor boost frequency2.4 GHz2.4 GHz
Processor cores42
64-bit computingYesYes
Memory
Maximum internal memory3 or 4 GB2 GB
eMMCYesYes
Ports and Interfaces
Wi-FiYesNo
BluetoothYesNo
HDMIYesYes
SATANoYes
Mini PCIe*NoYes
M.2NoYes
Raspberry Pi* header

No

Yes

Board and Dimensions
Carrier boardMandatory. Cost for an Intel development carrier board is around $100Not required
Board dimensions24 x 48 mm86.5 x 90mm
Sensors and power supply includedNoYes
Software
Linux operating systems supportedUbuntu* 16.04, Ubuntu CoreUbuntu 16.04 (pre-installed) Ubilinux, Yocto
Windows operating systems supportedWindows® 10 IoT CoreWindows® 10, Windows® 10 IoT Core

Android operating systems supported

n/aAndroid 6.0

Support for Arduino Create* and Intel® System Studio 2018

NoYes
Graphics
On-board graphicsIntel® HD Graphics, Gen9Intel® HD Graphics 500, Gen9

Conclusion

The Intel® Joule™ module, now discontinued, was a compact yet powerful modular device that included wireless and video capabilities. The feature comparison and design considerations sections of this paper served as general guidance for developers interested in alternatives to the Intel® Joule™ platform. And the last section presented the UP Squared* Grove* IoT Development Kit as a complete kit alternative to the Intel® Joule™ platform.

For developers not interested in the kit alternative and instead plan on selecting an individual processor product, please note that you may need to work with a hardware vendor to create a custom board. However, there are some alternate products (modular in nature) available through the Solutions Directory from Intel to consider:

Up Squared Grove IoT Development Kit

The Power of the Personal Assistant

$
0
0

Intel and Amazon Give Voice to Smart Homes of the Future

Our Homes are Becoming Smart

Our living spaces and the technologies in them are becoming smarter every day, working to enrich our daily lives, helping manage household tasks and providing peace of mind.

Sit Down, Speak Up

Voice-controlled technologies that listen, speak and converse are emerging in every corner of the home.

  • Technology like TVs, speakers, alarms, and sprinkler systems will be speech-enabled.
  • Devices can become responsive, perceptive and autonomous.

The Power of a Personal Assistant

The Intel® Speech Enabling Developer Kit is a complete audio front-end solution for far-field voice control, accelerating the design of products, while enabling manufacturers to integrate the 8-mic circular array with Amazon* Alexa* Voice Service.

Product developers can add voice to a range of form factors, with far-field voice capabilities, speech recognition, amazing acoustics and low power requirements.

Intel® Speech Enabling Developer Kit

Intel and Amazon have collaborated to make it easier for developers to add intelligent, far-field voice with Alexa Voice Service.

For a complete overview of the kit, refer to the PDF attached to this article

Order Your Kit to Start Innovating Today

More Info on the Intel® Speech Enabling Developer Kit

Dynamic Device Personalization for Intel® Ethernet 700 Series

$
0
0

Download pcap file

Introduction

To address the ever-changing requirements for both cloud and network functions virtualization, the Intel® Ethernet 700 Series was designed from the ground up to provide increased flexibility and agility. One of the design goals was to take parts of the fixed pipeline used in Intel® Ethernet 500 Series, 82599, X540, and X550, and move to a programmable pipeline allowing the Intel Ethernet 700 Series to be customized to meet a wide variety of customer requirements. This programmability has enabled over 60 unique configurations all based on the same core silicon.

Even with so many configurations being delivered to the market, the expanding role that Intel® architecture is taking in the telecommunication market requires even more custom functionality, the most common of which are new packet classification types that are not currently supported, are customer-specific, or maybe not even fully defined yet. To address this, a new capability has been enabled on the Intel Ethernet 700 Series network adapters: Dynamic Device Personalization (DDP).

This article describes how the Data Plane Development Kit (DPDK) is used to program and configure DDP profiles. It focuses on the GTPv1 profile, which can be used to enhance performance and optimize core utilization for virtualized enhanced packet core (vEPC) and multi-access edge computing (MEC) use cases.

DDP allows dynamic reconfiguration of the packet processing pipeline to meet specific use case needs on demand, adding new packet processing pipeline configuration profiles to a network adapter at run time, without resetting or rebooting the server. Software applies these custom profiles in a nonpermanent, transaction-like mode, so the original network controller’s configuration is restored after network adapter reset, or by rolling back profile changes by software. The DPDK provides all APIs to handle DDP packages.

The ability to classify new packet types inline, and distribute these packets to specified queues on the device’s host interface, delivers a number of performance and core utilization optimizations:

  • Removes requirement for CPU cores on the host to perform classification and load balancing of packet types for the specified use case.
  • Increases packet throughput; reduces packet latency for the use case.

In the case that multiple network controllers are present on the server, each controller can have its own pipeline profile, applied without affecting other controllers and software applications using other controllers.

DDP Use Cases

By applying a DDP profile to the network controller, the following use cases can be addressed.

  • New packet classification types (flow types) for offloading packet classification to network controller:
    • IP protocols in addition to TCP/UDP/SCTP; for example, IP ESP (Encapsulating Security Payload), IP AH (authentication header)
    • UDP Protocols; for example, MPLSoUDP (MPLS over UDP) or QUIC (Quick UDP Internet Connections)
    • TCP subtypes, like TCP SYN-no-ACK (Synchronize without Acknowledgment set)
    • Tunnelled protocols, like PPPoE (Point-to-Point Protocol over Ethernet), GTP-C/GTP-U (GPRS Tunnelling Protocol-control plane/-user plane)
    • Specific protocols, like Radio over Ethernet
  • New packet types for packets identification:
    • IP6 (Internet protocol version 6), GTP-U, IP4 (Internet protocol version 4), UDP, PAY4 (Pay 4)
    • IP4, GTP-U, IP6, UDP, PAY4
    • IP4, GTP-U, PAY4
    • IP6, GTP-C, PAY4
    • MPLS (Multiprotocol Label Switching), IP6, TCP, PAY4

DDP GTP Example


Figure 1. Steps to download GTP profile to Intel® Ethernet 700 Series network adapter.

The original firmware configuration profile can be updated in transaction-like mode. After applying a new profile, the network controller reports back the previous configuration, so original functionality can be restored at runtime by rolling back changes introduced by the profile, as shown in Figure 2.


Figure 2. Processing DDP profiles.

Personalization profile processing steps, depicted in Figure 2:

  1. Original firmware configuration; no profile applied.
  2. On applying a new profile, the firmware returns the original configuration in the profile's buffer.
  3. Writing the returned configuration back to the hardware will restore original state.

Firmware and Software Versions

DDP requires an Intel Ethernet 700 Series network adapter with the latest firmware 6.01.

Basic support for applying DDP profiles to Intel Ethernet 700 Series network adapters was added to DPDK 17.05. DPDK 17.08 and 17.11 introduced more advanced DDP APIs, including the ability to report a profile's information without loading a profile to an Intel Ethernet 700 Series network adapter first. These APIs can be used to try out new DDP profiles with DPDK without implementing full support for the protocols in the DPDK rte_flow API.

DPDK APIs

The following three calls are part of DPDK 17.08:

rte_pmd_i40e_process_ddp_package():This function is used download a DDP profile and register it or rollback a DDP profile and un-register it.

int rte_pmd_i40e_process_ddp_package(
	uint8_t port,  /* DPDK port index to download DDP package to */
	uint8_t *buff, /* buffer with the package in the memory */
	uint32_t size, /* size of the buffer */
	rte_pmd_i40e_package_op op /* operation: add, remove, write profile */
);

rte_pmd_i40e_get_ddp_info(): This function is used to request information about a profile without downloading it to a network adapter.

int rte_pmd_i40e_get_ddp_info(
	uint8_t *pkg_buff,  /* buffer with the package in the memory */
	uint32_t pkg_size,  /* size of the package buffer */
	uint8_t *info_buff, /* buffer to store information to */
	uint32_t info_size, /* size of the information buffer */
	enum rte_pmd_i40e_package_info type /* type of required information */
);

rte_pmd_i40e_get_ddp_list():This function is used to get the list of registered profiles.

int rte_pmd_i40e_get_ddp_list (
	uint8_t port,  /* DPDK port index to get list from */
	uint8_t *buff, /* buffer to store list of registered profiles */
	uint32_t size  /* size of the buffer */
);

DPDK 17.11 adds some extra DDP-related functionality:

rte_pmd_i40e_get_ddp_info(): Updated to retrieve more information about the profile.

New APIs were added to handle flow type, created by DDP profiles:

rte_pmd_i40e_flow_type_mapping_update(): Used to map hardware-specific packet classification type to DPDK flow types.

int rte_pmd_i40e_flow_type_mapping_update(
	uint8_t port, /* DPDK port index to update map on */
	/* array of the mapping items */
	struct rte_pmd_i40e_flow_type_mapping *mapping_items,
	uint16_t count, /* number of PCTYPEs to map */
	uint8_t exclusive /* 0 to overwrite only referred PCTYPEs */
);

rte_pmd_i40e_flow_type_mapping_get(): Used to retrieve current mapping of hardware-specific packet classification types to DPDK flow types.

int rte_pmd_i40e_flow_type_mapping_get(
	uint8_t port, /* DPDK port index to get mapping from */
	/* pointer to the array of RTE_PMD_I40E_FLOW_TYPE_MAX mapping items*/
	struct rte_pmd_i40e_flow_type_mapping *mapping_items
);

rte_pmd_i40e_flow_type_mapping_reset(): Resets flow type mapping table.

int rte_pmd_i40e_flow_type_mapping_reset(
uint8_t port /* DPDK port index to reset mapping on */
);

Using DDP Profiles with testpmd

To demonstrate DDP functionality of Intel Ethernet 700 Series network adapters, we will use GTPv1 profile with testpmd. The profile will be launched to the public in early 2018. In the meantime, please contact your local Intel representative to get your copy of this profile.

Although DPDK 17.11 adds GTPv1 with IPv4 payload support at rte_flow API level, we will use lower-level APIs to demonstrate how to work with the Intel Ethernet 700 Series network adapter directly for any new protocols added by DDP and not yet enabled in rte_flow.

For demonstration, we will need GTPv1-U packets with the following configuration:

Source IP                1.1.1.1
Destination IP           2.2.2.2
IP Protocol              17 (UDP)
GTP Source Port          45050
GTP Destination Port     2152
GTP Message type         0xFF
GTP Tunnel id            0x11111111-0xFFFFFFFF random
GTP Sequence number      0x000001
-- Inner IPv4 Configuration --------------
Source IP                3.3.3.1-255 random
Destination IP           4.4.4.1-255 random
IP Protocol              17 (UDP)
UDP Source Port          53244
UDP Destination Port     57069

Figure 3. GTPv1 GTP-U packets configuration.

As you can see, the outer IPv4 header does not have any entropy for RSS as IP addresses and UDP ports defined statically. But the GTPv1 header has random tunnel endpoint identifier (TEID) values in the range of 0x11111111 to 0xFFFFFFFF, and the inner IPv4 packet has IP addresses randomly host-generated in the range of 1 to 255.

A pcap file with synthetic GTPv1-U traffic using the configuration above can be downloaded here - host provided embedded pcap file alongside the article.

We will use the latest version of test-pmd from the DPDK 17.11 release. First, start testpmd in receive only mode with four queues, and enable verbose mode and RSS:

testpmd -w 02:00.0 -- -i --rxq=4 --txq=4 --forward-mode=rxonly
testpmd> port config all rss all
testpmd> set verbose 1
testpmd> start

Figure 4. testpmd startup configuration.

Using any GTP-U capable traffic generator, send four GTP-U packets. A provided pcap file with synthetic GTPv1-U traffic can be used as well.

As all packets have the same outer IP header, they are received on queue 1 and reported as IPv4 UDP packets:

testpmd> port 0/queue 1: received 4 packets
src=3C:FD:FE:A6:21:24 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0xd9a562 - RSS queue=0x1 - hw ptype: L2_ETHER L3_IPV4_EXT_UNKNOWN L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - Receive queue=0x1
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

src=3C:FD:FE:A6:21:24 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0xd9a562 - RSS queue=0x1 - hw ptype: L2_ETHER L3_IPV4_EXT_UNKNOWN L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - Receive queue=0x1
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

src=3C:FD:FE:A6:21:24 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0xd9a562 - RSS queue=0x1 - hw ptype: L2_ETHER L3_IPV4_EXT_UNKNOWN L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - Receive queue=0x1
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

src=3C:FD:FE:A6:21:24 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0xd9a562 - RSS queue=0x1 - hw ptype: L2_ETHER L3_IPV4_EXT_UNKNOWN L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - Receive queue=0x1
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

Figure 5. Distribution of GTP-U packets without GTPv1 profile.

As we can see, hash values for all four packets are the same: 0xD9A562. This happens because IP source/destination addresses and UDP source/destination ports in the outer (tunnel end point) IP header are statically defined and do not change from packet to packet; see Figure 3.

Now we will apply a GTP profile to a network adapter port. For the purpose of the demonstration, we will assume that the profile package file was downloaded and extracted to the /home/pkg folder. The profile will load from the gtp.pkgo file and the original configuration will be stored to the gtp.bak file:

testpmd> stop
testpmd> port stop 0
testpmd> ddp add 0 /home/pkg/gtp.pkgo,/home/pkg/gtp.bak

Figure 6. Applying GTPv1 profile to device.

The 'ddp add 0 /home/pkg/gtp.pkgo,/home/pkg/gtp.bak' command first loads the gtp.pkgo file to the memory buffer, then passes it to rte_pmd_i40e_process_ddp_package() with the RTE_PMD_I40E_PKG_OP_WR_ADD operation, and then saves the original configuration, returned in the same buffer, to the gtp.bak file.

We can confirm that the profile was loaded successfully:

testpmd> ddp get list 0
Profile number is: 1

Profile 0:
Track id:     0x80000008
Version:      1.0.0.0
Profile name: GTPv1-C/U IPv4/IPv6 payload

Figure 7. Checking whether the device has any profiles loaded.

The 'ddp get list 0' command calls rte_pmd_i40e_get_ddp_list() and prints the returned information.

Track ID is the unique identification number of the profile that distinguishes it from any other profiles.

To get information about new packet classification types and packet types created by profile:

testpmd> ddp get info /home/pkg/gtp.pkgo
Global Track id:       0x80000008
Global Version:        1.0.0.0
Global Package name:   GTPv1-C/U IPv4/IPv6 payload

i40e Profile Track id: 0x80000008
i40e Profile Version:  1.0.0.0
i40e Profile name:     GTPv1-C/U IPv4/IPv6 payload

Package Notes:
This profile enables GTPv1-C/GTPv1-U classification
with IPv4/IPV6 payload
Hash input set for GTPC is TEID
Hash input set for GTPU is TEID and inner IP addresses (no ports)
Flow director input set is TEID

List of supported devices:
  8086:1572 FFFF:FFFF
  8086:1574 FFFF:FFFF
  8086:1580 FFFF:FFFF
  8086:1581 FFFF:FFFF
  8086:1583 FFFF:FFFF
  8086:1584 FFFF:FFFF
  8086:1585 FFFF:FFFF
  8086:1586 FFFF:FFFF
  8086:1587 FFFF:FFFF
  8086:1588 FFFF:FFFF
  8086:1589 FFFF:FFFF
  8086:158A FFFF:FFFF
  8086:158B FFFF:FFFF

List of used protocols:
  12: IPV4
  13: IPV6
  17: TCP
  18: UDP
  19: SCTP
  20: ICMP
  21: GTPU
  22: GTPC
  23: ICMPV6
  34: PAY3
  35: PAY4
  44: IPV4FRAG
  48: IPV6FRAG

List of defined packet classification types:
  22: GTPU IPV4
  23: GTPU IPV6
  24: GTPU
  25: GTPC

List of defined packet types:
  167: IPV4 GTPC PAY4
  168: IPV6 GTPC PAY4
  169: IPV4 GTPU IPV4 PAY3
  170: IPV4 GTPU IPV4FRAG PAY3
  171: IPV4 GTPU IPV4 UDP PAY4
  172: IPV4 GTPU IPV4 TCP PAY4
  173: IPV4 GTPU IPV4 SCTP PAY4
  174: IPV4 GTPU IPV4 ICMP PAY4
  175: IPV6 GTPU IPV4 PAY3
  176: IPV6 GTPU IPV4FRAG PAY3
  177: IPV6 GTPU IPV4 UDP PAY4
  178: IPV6 GTPU IPV4 TCP PAY4
  179: IPV6 GTPU IPV4 SCTP PAY4
  180: IPV6 GTPU IPV4 ICMP PAY4
  181: IPV4 GTPU PAY4
  182: IPV6 GTPU PAY4
  183: IPV4 GTPU IPV6FRAG PAY3
  184: IPV4 GTPU IPV6 PAY3
  185: IPV4 GTPU IPV6 UDP PAY4
  186: IPV4 GTPU IPV6 TCP PAY4
  187: IPV4 GTPU IPV6 SCTP PAY4
  188: IPV4 GTPU IPV6 ICMPV6 PAY4
  189: IPV6 GTPU IPV6 PAY3
  190: IPV6 GTPU IPV6FRAG PAY3
  191: IPV6 GTPU IPV6 UDP PAY4
  113: IPV6 GTPU IPV6 TCP PAY4
  120: IPV6 GTPU IPV6 SCTP PAY4
  128: IPV6 GTPU IPV6 ICMPV6 PAY4

Figure 8. Getting information about the DDP profile.

The 'ddp get info gtp.pkgo' command makes multiple calls of rte_pmd_i40e_get_ddp_info() to get different information about the profile, and prints it.

There is a lot of information, but we are looking for new packet classifier types:

List of defined packet classification types:
  22: GTPU IPV4
  23: GTPU IPV6
  24: GTPU
  25: GTPC

Figure 9. New PCTYPEs defined by GTPv1 profile.

There are four new packet classification types created in addition to all default PCTYPEs available (see Table 7-5. Packet classifier types and its inputsets of the latest datasheet

To enable RSS for GTPv1-U with the IPv4 payload we need to map packet classifier type 22 to the DPDK flow type. Flow types are defined in rte_eth_ctrl.h; the first 21 are in use in DPDK 17.11 and so can map to flows 22 and up. After mapping to a flow type, we can start to port again and enable RSS for flow type 22:

testpmd> port config 0 pctype mapping update 22 22
testpmd> port start 0
testpmd> start
testpmd> port config all rss 22

Figure 10. Mapping new PCTYPEs to DPDK flow types.

The 'port config 0 pctype mapping update 22 22' command calls rte_pmd_i40e_flow_type_mapping_update() to map new packet classifier type 22 to DPDK flow type 22 so  that the 'port config all rss 22' command can enable RSS for this flow type.

If we send GTP traffic again, we will see that packets are being classified as GTP and distributed to multiple queues:

port 0/queue 1: received 1 packets
  src=00:01:02:03:04:05 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0x342ff376 - RSS queue=0x1 - hw ptype: L3_IPV4_EXT_UNKNOWN TUNNEL_GTPU INNER_L3_IPV4_EXT_UNKNOWN INNER_L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - VXLAN packet: packet type =32912, Destination UDP port =2152, VNI = 3272871 - Receive queue=0x1
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

port 0/queue 2: received 1 packets
  src=00:01:02:03:04:05 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0xe3402ba5 - RSS queue=0x2 - hw ptype: L3_IPV4_EXT_UNKNOWN TUNNEL_GTPU INNER_L3_IPV4_EXT_UNKNOWN INNER_L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - VXLAN packet: packet type =32912, Destination UDP port =2152, VNI = 9072104 - Receive queue=0x2
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

port 0/queue 0: received 1 packets
  src=00:01:02:03:04:05 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0x6a97ed3 - RSS queue=0x0 - hw ptype: L3_IPV4_EXT_UNKNOWN TUNNEL_GTPU INNER_L3_IPV4_EXT_UNKNOWN INNER_L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - VXLAN packet: packet type =32912, Destination UDP port =2152, VNI = 5877304 - Receive queue=0x0
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

port 0/queue 3: received 1 packets
  src=00:01:02:03:04:05 - dst=00:10:20:30:40:50 - type=0x0800 - length=178 - nb_segs=1 - RSS hash=0x7d729284 - RSS queue=0x3 - hw ptype: L3_IPV4_EXT_UNKNOWN TUNNEL_GTPU INNER_L3_IPV4_EXT_UNKNOWN INNER_L4_UDP  - sw ptype: L2_ETHER L3_IPV4 L4_UDP  - l2_len=14 - l3_len=20 - l4_len=8 - VXLAN packet: packet type =32912, Destination UDP port =2152, VNI = 1459946 - Receive queue=0x3
  ol_flags: PKT_RX_RSS_HASH PKT_RX_L4_CKSUM_GOOD PKT_RX_IP_CKSUM_GOOD

Figure 11. Distribution of GTP-U packets with GTPv1 profiles applied to the device.

Now, the Intel Ethernet 700 Series parser knows that packets with UDP destination port 2152 should be parsed as GTP-U tunnel, and extra fields should be extracted from GTP and inner IP headers.

If the profile is no longer needed, it can be removed from the network adapter and the original configuration restored:

testpmd> port stop 0
testpmd> ddp del 0 /home/pkg/gtp.bak
testpmd> ddp get list 0
Profile number is: 0

testpmd>

Figure 12. Removing GTPv1 profile from the device.

The 'ddp del 0 gtp.bak' command first loads the gtp.bak file to the memory buffer, then passes it to rte_pmd_i40e_process_ddp_package() but with the RTE_PMD_I40E_PKG_OP_WR_DEL operation, restoring the original configuration.

Summary

This new capability provides the means to accelerate packet processing for different network segments providing needed functionality of the network controller on demand by applying a DDP profile. The same underlying infrastructure (servers with installed network adapters) can be used for optimized processing of traffic of different network segments (wireline, wireless, enterprise) without the need for resetting network adapters or restarting the server.

References

About the Authors

Andrey Chilikin: Software Architect working on developing and adoption of new networking technologies and solutions for telecom and enterprise communication industries.

Brian Johnson: Solutions Architect focusing on defining networking solutions and best practices in data center networking, virtualization, and cloud technologies.

Robin Giller: Software Product Manager in the Network Platform Group at Intel.


Hand Gesture Recognition

$
0
0

Abstract

Soldiers communicate with each other through gestures. But sometimes those gestures are not visible due to obstructions or poor lighting. For that purpose an instrument is required to record the gesture and send it to the fellow soldiers. The two options for gesture recognition are through Computer Vision and through some sensors attached to the hands. The first option is not viable in this case as proper lighting is required for recognition through Computer Vision. Hence the second option of using sensors for recognitions has been used. We present a system which recognizes the gestures given in this link.

Construction

The given gestures include motions of fingers, wrist, and elbow. To detect any changes in them we have used flex sensors which detect the amount by which it has been bent at each of these joints. To take into account for the dynamic gestures an Inertial Measurement Unit (IMU-MPU-9250) was used. The parameters used from the IMU are acceleration, gyroscopic acceleration, and angles in all three axes. An Arduino* Mega was used to receive the signals from the sensors and send it to the processor.

A flex sensor is a strip which has a resistance proportional to the amount of strain in the sensor. Thus it gives out a variable voltage value according to the strain. An IMU (MPU-6050) gives out linear acceleration and gyroscopic acceleration in all three axes (x, y, z).

The gestures can be classified into two sub-classes:

  1. Static Gestures
  2. Dynamic Gestures

The number of features primarily used for both the sub classes differ

  1. For static gestures we have used the flex sensors values and the angles with all three axes as the features.
  2. For dynamic gestures we used the flex sensors values, linear acceleration, gyroscopic acceleration, and the angles in all three axes.

Algorithm for Static Gesture Recognition

First of all the angles have to be calculated from the acceleration values using these formulae.

The angle values have some noise in them and thus have to be filtered out in order to get smooth values out of it. Thus we have used a Kalman filter for filtering the values. Then both the flex sensor values and angles are fed into a pre-trained Support Vector Machine (SVM) with Radial Basis Function (Gaussian) Kernel. And thus the output is obtained.

 

Figure 1: Principal Component Analysis of the dataset using all the features. Each of the colored cluster represents a particular gesture. As accelerations are also included the clusters are quite elongated.

Figure 2: Principal Component Analysis of the dataset using just flex sensor values and angles. Here each colored cluster represents a particular gesture. Also these clusters are classifiable.

Algorithm for Dynamic Gesture Classification

The angles, liner accelerations, and gyroscopic accelerations are filtered using a Kalman Filter. The values are stores in a temporary file with each line representing one time point. Then every value is normalized column-wise. Then 50 time points are sampled out of them. After that they are linearized into one single vector of 800 dimensions. Then it is fed into a SVM with Radial Basis Function kernel (Gaussian). Because some gestures like ‘Column Formation’, ‘Vehicle’, ‘Ammunition’, and ‘Rally-Point’ are similar to each other we have grouped such similar features as one class. If the first SVM classifies into one of these groups then they are fed into another SVM which is trained just to classify the gestures in that group.

Figure 3: Two samples of graph of x-axis acceleration the gesture door.

Salient Features of the system:

  1. No hindrance in the motion of the hands.
  2. The system is lightweight.
  3. The system can recognize 27/28 static gestures and 14/15 dynamic gestures.
  4. The system can be improved by using a Neural Network by gathering more data. Hence a mechanism to record new data and store them immediately has been made. Thus making room for more number of gestures to be recognized.
  5. The size can be reduced a lot by using a custom made processor for signals.

*Since we were told to show the output on a screen we have not used a Raspberry Pi Zero (microprocessor) for processing purposes. But it can be used for that and we have checked the feasibility of the algorithm’s speed in that processor also.

** We generated our own data for training and testing.

***For detailed documentation and code visit my GitHub. The code and documentation will be uploaded soon.

Emulating Applications with Intel® SDE and Control Flow Enforcement Technology

Review of Architecture and Optimization on Intel® Xeon® Scalable Processors in context of Intel® Optimized TensorFlow* on Intel® AI DevCloud

$
0
0

When I joined the Intel® Student Developer Program in late 2017 I was pretty excited to try the Intel® Xeon® scalable processors [1] that were a part of the Intel® AI DevCloud they launched roughly at the same time. Just to get everyone on same page I would like to begin with what Intel Xeon Scalable processors are and how they affect computation. Later I will discuss what I learned on optimizing Deep Learning TensorFlow* [2] codes to juice out the last drop of this beast.

The Intel Xeon Scalable processor family on the "Purley" platform is a new microarchitecture with many additional features compared to the previous-generation Intel® Xeon® processor E5-2600 v4 product family (formerly Broadwell microarchitecture). The core reason I believe Intel Xeon Scalable processors are so cool at handling artificial intelligence (AI) computations are due to Intel® Advanced Vector Extension 512 (Intel® AVX-512) [3] instruction set. It provides ultra-wide 512 bit vector operation capabilities which allows it to handle most of the high performance computing required for TensorFlow*. As the most basic units of computation in TensorFlow involve flow of tensors through operations which are paralleled through the vector processing units. These are called Single Instruction Multiple Data (SIMD) [4] operations. I will give you an example, say we add two vectors in a natural way, we will go about looping over the dimension and adding the corresponding units. While a vector supported central processing unit (CPU) will add those two with single add operation thereby reducing latency to factor of dimension of the vector, but of course in our case it can do a maximum of up to 512 bits at time. So we get a performance boost of up to 3 to 4x over normal CPUs.

Now I will switch to talk about my experiments which I conducted and which proves my point. When I joined the program I already had 1500 lines of codebase of a Neural Image Captioning System to tryout which I had previously ran only on the Google* Cloud Platform [5]. A Neural Captioning System is one which generates captions for images through an encoder-decoder neural network system. In my case I followed the work of Vinyals et al [6] with slight modifications. My encoder system for images is a VGG16 model [7]. It’s a convolution neural network that was presented in ILSVRC [8] for the object recognition competition. It turns out that it can be used as a good feature extractor, so I removed everything after its 7th fully connected layer and used the final 4096 length vector. This approach is popularly coined as Transferred Learning in the deep learning community. I pre-extracted the features of all the images of Microsoft COCO [9] dataset then performed Principle Component Analysis (PCA) over the data to reduce its dimension to 512. I conducted experiments with both the datasets (PCA and non-PCA). The work is still in progress and the codebase is in my GitHub repository if you want to take a look.

Now during my runs when I first tried to run out of the box on the Intel Xeon processors I got absolutely zero performance increase, in fact it was bit slower. So I’ve spent the last month hoping I could figure it out before Santa knocked on my door. I would like to share with you some steps that I found need to be done before you see improved performance off this complicated yet powerful processor.

  1. As we batch through multiple data over the epoch, avoid any kind of ‘Disk Read Write’ as far as possible. On Intel AI DevCloud our home folder, Network File System (NFS), is shared between the compute node and login node. The read writes takes a long time on the cluster as it resides further away compared to home PCs. So how do we go about it? Well, TensorFlow provides an elegant queue based operation through its Dataset API [10]. The API enables us to build complex input pipelines using reusable pieces of operation. It wraps your data with a pre-processing operation of your choice and batches them together. This will drastically reduce your latency as your entire dataset shall be cached with all limitations and requirements in place and embed all the operations in a computation graph.
     
  2. There is very nice paper by Colfax research [11] from which I would like to write few tips which directly affected my performance. It deals with optimization of an object detection network based on YOLO [12] and recommends tuning of certain critical variables that are quite important from a performance point of view.
  • KMP_BLOCKTIME : This is an ‘of many’ variable that controls the behavior of the OpenMP* API which is a parallel programming interface and is primarily responsible for multi-threading operation inside the Tensorflow API. It is the variable that controls the wait time, in milliseconds, that an OpenMP thread waits before going to sleep. A large value and you keep your data hot, but at the same time you can easily starve other threads of resources, so this variable needs to be tuned to best suite our interest. In my case I kept it to 30.
    os.environ["KMP_BLOCKTIME"] = “30”
  • OMP_NUM_THREADS: This refers to number of parallel threads that a TensorFlow operation can use. The recommended setting for TensorFlow is to keep this to the number of physical cores. I tried 136 and it worked for my case.
    os.environ["OMP_NUM_THREADS"] = “136”
  • KMP_AFFINITY: This provides abstract control over placement of OpenMP threads to physical cores. The recommended setting for TensorFlow is ‘fine, compact, 1, 0’. ‘Fine’ prevents thread migration thereby reducing cache misses. ‘Compact’ places neighboring threads close together. ‘1’ provides priority placement of threads on different free physical cores rather than on the same core which has a situation of hyper threading. This behavior is similar to how electronic orbitals are filled in atoms. ‘0’ refers to indexing core mapping.
    os.environ["KMP_BLOCKTIME"] = “30"
  • Inter and Intra Operation Parallel Threads: These are the variables provided by TensorFlow to control how many simultaneous operations can be run and how many parallel threads each operation can run. In my case I kept the former at two and latter to be equal to OMP_NUM_THREADS (as recommended)
    tf.app.flags.DEFINE_integer(‘inter_op’,2,”””Inter op Parallelism Threads”””)
    tf.app.flags.DEFINE_integer(‘inter_op’,136,”””Intra op Parallelism Threads”””)

After tuning all the said variables I got a performance increase of up to 4x reducing my per epoch time from 2.5 hours to 30 mins thereby greatly reducing latency. As I said, Intel Xeon Scalable processors are pretty powerful and what we get in the Intel AI DevCloud is a theoretical promised performance of 260 TFlops of performance, but this can’t be expected out of box unless certain cards fall in right place. 

References

  1. Intel Xeon Scalable Platform
  2. arXiv:1603.04467: TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
  3. Advanced Vector Extension
  4. Single Instruction Multiple Data (SIMD)
  5. Google cloud Platform
  6. arXiv:1411.4555: Show and Tell: A Neural Image Caption Generator
  7. arXiv:1409.1556 Very Deep Convolutional Networks for Large-Scale Image Recognition
  8. ImageNet Large Scale Visual Recognition Competition (ILSVRC)
  9. arXiv:1405.0321v3 Microsoft COCO: Common Objects in Context
  10. TensorFlow’s Dataset API
  11. Optimization of Real time object detection on Intel Xeon Scalable Processor, Colfax
  12. arXiv:1506.02640: You Only Look Once: Unified, Real-Time Object Detection

Intel® Parallel Computing Center at National Renewable Energy Laboratory

$
0
0

National Renewable Energy Laboratory

Principal Investigators:

Michael Sprague is a senior scientist at the National Renewable Energy Laboratory. Mike’s research interests include high-performance computing and computational mechanics. He is leading several projects in wind energy, including a U.S. Department of Energy Exascale Computing Project called ExaWind. He was an assistant professor of applied mathematics at the University of California, Merced (2005-2010), and he was a postdoctoral fellow in applied mathematics at the University of Colorado at Boulder (2003-2005). His degrees are in mechanical engineering, with a B.S. from the University of Wisconsin-Madison (1997) and a Ph.D. from the University of Colorado at Boulder (2002).

Description:

OpenFAST is an open-source software package for wind turbine simulation and analysis. It is supported by the National Renewable Energy Laboratory under the U.S. Department of Energy (DOE) Wind Energy Technologies Office. OpenFAST encompasses models and associated simulation modules for aerodynamics, substructure hydrodynamics for offshore systems, control and electrical systems, and structural dynamics. OpenFAST modules are coupled to allow for nonlinear analysis of aero-hydro-servo-elastic interactions in the time domain.

OpenFAST serves as the high-fidelity turbine model (structures and control system) in DOE-supported efforts to enable predictive high-performance-computing simulations of whole wind farms, for which the complex flow dynamics are simulated with computational fluid dynamics. The DOE-supported efforts include the Exascale Computing Project, ExaWind, which is focused on creating an exascale-ready simulation capability for wind farms, and the Atmosphere-to-Electrons High-Fidelity Modeling project. OpenFAST also serves as a computer-aided-engineering tool. The wind energy industry relies heavily on computer-aided-engineering tools for analyzing wind turbine performance, loading, and stability. For example, under its use in design and optimization, OpenFAST is run thousands or even tens of thousands of times under various conditions, in which each simulation can take several hours.

Parallelizing OpenFAST is the objective of this Intel Parallel Computing Center. At the start of this project, FAST was a serial-computation code, but its modularity presents obvious pathways for parallelization (e.g., each of the three blade solvers can be run in parallel). Beyond modular parallelism, individual modules present additional opportunities for parallelism, e.g., through the nonlinear solvers. In this project, we are parallelizing OpenFAST with a focus on Intel Xeon and Xeon Phi processors, under a variety of use cases. Success will be measured in time-to-solution-improvement comparisons against the baseline serial-computation cases, and strong scaling tests.

Related Websites:

http://openfast.readthedocs.io/
https://github.com/openfast/openfast
https://www.exawind.org/

Boosting Deep Learning Training & Inference Performance on Intel® Xeon® and Intel® Xeon Phi™ Processors

$
0
0

View PDF

In this work we present how, without a single line of code change in the framework, we can further boost the performance for deep learning training by up to 2X and inference by up to 2.7X on top of the current software optimizations available from open source TensorFlow* and Caffe* on Intel® Xeon® and Intel® Xeon Phi™ processors. Our system level optimizations result in a higher throughput and a reduction in time-to-train for a given batch size per worker compared to the current baseline for image recognition Convolution Neural Networks (CNN) workloads.

Overview

Intel® Xeon® and Intel® Xeon Phi™ processors are extensively used in deep learning and high performance computing applications. Popular deep learning frameworks such as TensorFlow*, Caffe* and MxNet* have been optimized by Intel software teams to deliver optimal performance on Intel platforms for both deep learning training and inference workflows. With Intel and Google’s continuing collaboration, the performance of TensorFlow has significantly improved with Intel® Math Kernel Library (Intael® MKL) and Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). Similarly, the Intel® Distribution of Caffe* also delivers significant performance gains on Intel Xeon and Intel Xeon Phi processors.

Training deep Convolution Neural Networks (CNN) such as ResNet-50, GoogLeNet-v1, Inception-3, and others involves executing hundreds of compute-intensive functions such as two-dimensional convolutions, matrix multiplication, RELU activation, max-pool and softmax to name a few, for hundreds of thousands of iterations. These function kernels are mapped to libraries such as Intel MKL or Intel MKL-DNN which are highly optimized implementations of these kernels on Intel platforms. In our performance characterization of CNN applications, we have observed that even though Intel optimized deep learning frameworks are multi-threaded, the CPU cores are under-utilized during the execution of CNNs.

Although user controllable configuration parameters are provided in the frameworks, those are not sufficient to achieve optimal performance. TensorFlow, for example, utilizes intra-op and inter-op parallelism. Intra-op controls the size of the thread pool to parallelize kernels in a given operation and inter-op controls the size of thread pool to run operations in parallel. However, these user-level knobs do not provide users with sufficient micro-architectural information on the underlying NUMA configuration in multi-socket Intel Xeon processor-based platforms.

In addition, without the knowledge of CPU socket and NUMA configuration, simple thread affinitization (as in the case of thread pool) does not lead to optimal performance. In fact, it can sometimes prohibitively decrease throughput, as a core from socket 0 might have to continually access cache lines from the memory bank of socket 1 creating increased bandwidth pressure on the Intel® Ultra-Path Interconnect (Intel® UPI). This situation exacerbates with larger number of sockets found in 4, 8, and 16 socket systems. We believe that users need to be aware of system level optimizations in addition to framework specific configuration parameters to achieve the best performance for CNN workloads on CPU platforms.

Improving Deep Learning Performance

In this section we present the methodology (or Best Known Methods – BKMs) on how to optimally run deep learning workloads on multi-socket Intel Xeon platforms. The BKMs achieve the following:

  • Single-node multi-socket with Parameter Server (PS) (if required) deep learning training
  • Multi-node multi-socket with PS (if required) distributed deep learning training
  • Single-node multi-stream deep learning inference

In a later section, we will show that these BKMs are also applicable for Intel Xeon Phi processor-based platforms.

Performance Metrics for Image Recognition

Training Performance Metric

The performance metric used to reach convergence at a given batch size per worker with a specific number of iterations for developing a trained model for a neural network on an image dataset is the Time-To-Train (TTT). With a given batch-size BSize/worker, image throughput in images/sec, and assuming tuned hyper-parameters and convergence with a given number of Epochs:

For 1 worker, the TTT is given by:

formula

For W workers, the TTT is given by:  

formula

Baseline Performance for Single and Multi-Node Training

The current methodology is to train with a single worker per node with a batch size BSize. Single-node baseline performance is measured by TTT with 1 Worker/Node. Multi-Node baseline performance on N nodes is measured by TTT with N Workers, 1 worker on each node.

Baseline Inference Performance

The current methodology is to run inference with a single stream of input with a single worker per node. Baseline Inference performance is measured by throughput in Images/sec achieved by a single node at a given batch size BSize.

Deep Learning Training: Partitioning Multi-Socket Intel® Xeon® Processor Based Systems

To improve core utilization and ultimately performance for CNN workloads on multi-socket Intel Xeon platforms, we partition the sockets and the cores on the platform as separate computing devices and run multiple deep learning training instances. The term ‘instances’ refers to deep learning framework worker processes that are working in tandem, each on a local batch size of input data in a synchronous manner on a multi-socket or even a single-socket system. Each worker is process bound to a subset of the total number of cores and threads in the system using core and thread affinity settings.

Figure 1

Figure 1. Sub-Socket Partitioning across Dual-Socket Intel® Xeon® Platform

We use libnumactl to control memory allocations to target NUMA domains and the KMP_AFFINITY environment variable provided by the OpenMP* runtime library to affinitize OpenMP threads to target CPU cores.

If a parameter server (PS) is required, it may be used to aggregate gradients, whether locally spawned as a separate thread in the host server or remotely spawned over the network on another server works without any change.

Optimized Performance with Multiple Workers on Single- and Multi-Node Training

In this scenario, the single-node optimized performance is measured by TTT with K Workers/Node each with a batch size BSize per worker. The batch size per node would then be equal to K*BSize. Multi-node optimized performance on N nodes is measured by TTT with K*N Workers, K worker on each node. It is assumed that hyper-parameters for the neural network model are tuned for multiple workers for single and multiple nodes.

Deep Learning Inference: Partitioning Multi-Socket Intel® Xeon® Processor-based Systems

Figure 2

Figure 2. Sub-socket Partitioning across Dual-Socket Intel® Xeon® Platforms for Multiple Inference Streams

Similar methodology can be applied for deep learning inference. We create multiple independent deep learning inference framework instances, and set affinity for each instance to a partitioned set of cores and memory locality on single or multiple socket systems. Figure 2 shows an example of 8 framework instances, each concurrently processing a separate stream of input data on affinitized threads and memory locality. Depending on the inference batch size and system memory capacity, one could have even larger number of frameworks and streams, each mapped to different cores.

Optimized Inference Performance

In this scenario, we have K workers per node. The optimized performance is measured by the total throughput in images/sec per node with K streams of input each at a given batch size BSize and processed by the K workers. The total number of batches per node on K workers for inference would then be equal to K*BSize.

TensorFlow Training Performance

Figure 3 shows deep learning training performance (Images/Sec) relative to the current optimization using TensorFlow 1.4.0 release version across 6 deep learning benchmark topologies. The 3 bars in the chart show the performance improvement on 1, 2, & 4 nodes of dual-socket Intel Xeon Platinum 8168 processor cluster over 10Gbit Ethernet fabric. The figure shows that we can improve the performance up to 2.1X even on a single node with 4 workers/node using core/thread affinity and memory locality optimizations.

Figure 3

Figure 3. TensorFlow 1.4 Training Performance (Projected TTT) Improvement with optimized affinity for cores and memory locality using 4 Workers/Node compared to current baseline with 1 Worker/Node

Caffe Training Performance

Figure 4 shows that using our optimized BKMs for Intel® Distribution of Caffe, we are able to boost the performance of GoogLeNet-v1 by up to 1.2X on top of current optimizations for 1, 2, and 4 node clusters of dual-socket Intel Xeon Platinum 8170 processor-based systems. As the current Caffe available from github is highly optimized for Intel CPUs and able to use cores more efficiently, the improvement is smaller compared to TensorFlow.

Figure 4

Figure 4. Intel® Distribution of Caffe* Training Performance (Projected TTT) Improvement with optimized affinity for cores and memory locality using 2 Caffe Instances/Node compared to current optimized baseline with 1 Instance/Node

TensorFlow Inference Performance

Figure 5

Figure 5. TensorFlow Inference Performance (Images/Sec) Improvement with optimized affinity for cores and memory locality using concurrent multiple 2, 4, & 8 Streams/Node compared to current baseline with equivalent batch-size using 1 Stream/Node

Figure 5 shows deep learning inference performance (Images/Sec) relative to the current optimization using TensorFlow 1.4. The 3 bars in the chart show the performance improvements for global batch sizes of 512 (2 streams, each of batch-size of 256), 1024, and 2048 on a single-node, dual-socket Intel Xeon Platinum 8168 processor-based platform. For the optimized test, we have 2, 4, & 8 workers affinitized to cores and mapped to appropriate memory locality. Multiple streams of input data, each stream per worker is concurrently processed by the workers. E.g., for a global batch size of 2048, we use 8 streams each processing a batch size of 256. Performance data measured shows that we are able to boost inference performance up to 2.7X with our system level optimizations.

Caffe* Inference Performance

Figure 6 shows deep learning Inference performance (Images/Sec) relative to the current optimization using Intel Distribution of Caffe. The 4 bars in the chart show the performance improvements for global batch sizes of 256, 512, 1024, and 2048 on a single-node, dual-socket Intel Xeon Platinum 8170 processor-based platform. We observe that although Caffe is well optimized, we are still able to improve the inference performance up to 1.8X for large batch sizes.

Figure 6

Figure 6. Intel® Distribution of Caffe* Inference Performance (Images/Sec) Improvement with optimized affinity for cores and memory locality using concurrent multiple 2, 4, & 8 Streams/Node compared to current baseline with equivalent batch-size using 1 Stream/Node

Deep Learning Training: Partitioning Single-Socket Intel® Xeon Phi™ Processor Based Systems

We used optimization learnings from the Intel Xeon processor and applied them to single-socket Intel Xeon Phi processor-based platforms. The Intel Xeon Phi processor 7250 has 68 cores with 4 threads/core resulting in 272 threads. Figure 7 shows a symbolic view of how one could partition the socket for 4 instances of a framework, each instance affinitized to specific cores. The 4 instances run on 64 cores (16 cores/instance) in a distributed training manner with 4 cores allocated to I/O, Parameter Server (if required).

Figure 7

Figure 7. Symbolic Sub-socket Partitioning for Single Socket Intel® Xeon Phi™ Processor 7250

TensorFlow Training Performance 

To support multiple workers on a single Intel Xeon Phi processor-based system, we configure the processor MCDRAM in Cache-Mode at system boot time. Figure 8 shows that we are successfully able to apply the optimizations on the single-socket Intel Xeon Phi 7250 processor, boosting its performance up to 1.4X with 4 workers/node using TensorFlow 1.3 for ResNet-50 neural network benchmark. The optimizations also hold for multiple worker and multi-node (1, 2, and 4) distributed training using Intel® Omni-Path Architecture (Intel® OPA).

Figure 8

Figure 8. TensorFlow Training Performance (Projected TTT) Improvement with optimized affinity for cores and memory locality using 4 Workers/Node compared to current optimized baseline with 1 Worker/Node

Platform Configurations

Intel Xeon Platinum 8168 Processor

Dual-socket Intel Xeon Platinum 8168 processor @ 2.70GHz (24 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series. Multiple nodes connected with 10Gbit Ethernet.

Intel Xeon Gold 6148 Processor

Dual-socket Intel Xeon Gold 6148 processor @ 2.40GHz (20 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. Multiple nodes connected with Intel Omni-Path Architecture Host Fabric, Intel OPA Interface Driver version 10.4.2.0.7. SSD: Intel SSD DC S3700 Series.

Intel Xeon Platinum 8170 Processor

Dual-socket Intel Xeon Platinum 8170 processor @ 2.10GHz (26 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.16.1.el7.x86_64. Multiple nodes connected with Intel OPA Host Fabric, Intel OPA Interface Driver version 10.4.2.0.7. SSD: Intel SSD 800GB DC S3700 Series.

Intel Xeon Phi Processor 7250

Single-socket Intel Xeon Phi processor 7250, 68 Cores, 4 HW Threads per core, 1.4 GHz, 16GB high-speed MCDRAM set in Cache-Quadrant mode, 32KB L1 data cache per core, 1MB L2 per two-core tile, 96GB DDR4. Multiple nodes connected with Intel OPA Host Fabric, Intel OPA Interface Driver version 10.4.2.0.7, Intel SSD 480GB DC S3500 Series, Software: CentOS Linux release 7.3.1611, Linux kernel 3.10.0-514.10.2.el7.x86_64, Intel® MPI Library 2017 Update 4.

Deep Learning Framework Configurations

TensorFlow

TensorFlow 1.4: https://github.com/tensorflow/tensorflow, Tensorflow 1.4.0, GCC 6.2.0, Intel MKL-DNN. TensorFlow training measured with image data stored on the SSD storage, Inference measured with -forward_only option.

TensorFlow1.3: https://github.com/tensorflow/tensorflow, Tensorflow 1.3.0, GCC 6.2.0, Intel MKL 2017. TensorFlow training measured with image data stored on the SSD storage, Inference measured with --forward_only option.

Intel Distribution of Caffe

Caffe: http://github.com/intel/caffe/, Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models, image data in memory before training and inference, Intel C++ compiler ver. 17.0.2 20170213, Intel MKL version 2018.0.20170425. Caffe training measured with -train and inference measured with -forward_only option.

Best Known Methods (BKMs)

Intel® Xeon® Processor Performance Optimizations on Top of Currently Optimized Deep Learning Frameworks

In this section we outline our Best Known Methods (BKMs) using TensorFlow and Caffe as examples. We have used Intel Xeon and Intel Xeon Phi processor-based platforms in our examples.

Best Known Methods for TensorFlow

Build Methodology for TensorFlow

For Intel® optimized TensorFlow build please follow the BKMs specified for direct optimizations team or refer to this article: Intel Optimized Tensorflow Wheel Now Available

Optimized Run Time BKM for TensorFlow

We use the tf_cnn_benchmarks at TensorFlow github to test and measure performance improvement using our runtime optimizations:

TensorFlow tf_cnn_benchmarks:

  • tf_cnn_benchmarks code available from GitHub
  • Uses the latest APIs for the input pipeline, gradient updates hence designed to be fast
  • Can be easily integrated with custom CNN topologies

BKM for Single-Node Multi-Socket Distributed Training

Example 1: For 2S Intel Xeon Gold 6148 processor-based systems, multi-socket (sub-socket) with 20 Cores/Socket single-mode distributed training with 4 TensorFlow worker instances per node and 1 Parameter Server (PS) can be specified and launched as follows :

PS_HOST: “hostname1”
ps_list: “hostname1:2218”
WK_HOST= hostname2”
workers_list : “hostname2:2223,hostname2:2224,hostname2:2225,hostname2:2226”
worker_env:”export OMP_NUM_THREADS=9; export TF_ADJUST_HUE_FUSED=1; export TF_ADJUST_SATURATION_FUSED=1;”
common_args: “--model resnet50 --batch_size 64 --data_format NCHW --num_batches 100 --distortions=True --mkl=True --local_parameter_device cpu --num_warmup_batches 10 --device cpu --data_dir ‘/path-to/TF_Records' --data_name imagenet --server_protocol grpc --optimizer rmsprop --ps_hosts $ ps_list --worker_hosts $workers_list --display_every 10 “
ps_args: “$common_args --num_intra_threads 4 --num_inter_threads 2“
worker_args: “$common_args --num_intra_threads 9 --num_inter_threads 4“

To start the Parameter Server:

ssh $PS_HOST; numactl -l python tf_cnn_benchmarks.py $ps_args --job_name ps --task_index 0 --ps_hosts $ps_list  --worker_hosts  $workers_list &

To start the Workers:

ssh $WK_HOST; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[0-9,40-49],explicit,verbose”   --job_name worker --task_index 0 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[10-19,50-59],explicit,verbose” --job_name worker --task_index 1 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST; $worker_env;nohup numactl -m 1 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[20-29,60-69],explicit,verbose” --job_name worker --task_index 2 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST; $worker_env;nohup numactl -m 1 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[30-39,70-79],explicit,verbose” --job_name worker --task_index 3 --ps_hosts $ps_list  --worker_hosts $workers_list &

Where $ps_list and $workers_list are the comma separated list of hostname:port pairs of the parameter servers and worker hosts respectively. $ps_args are the arguments to the parameter server such as --num_inter_threads and --num_intra_threads. $worker_args are the arguments to the worker such as the model name, batch_size, data_format, data_dir, server_protocol, num_inter_threads and num_intra_threads values etc.

BKM for Multi-Node, Multi-Socket Training

Example 2: For 2S Intel Xeon Gold 6148 processor-based systems, multi-socket (sub-socket) with 20 Cores/Socket 2-node distributed training with 4 TensorFlow worker instances per node and 1 Parameter Server (PS) can be specified and launched as follows:

PS_HOST_0: “hostname1”

ps_list: “hostname1:2218”

WK_HOST_0=hostname2, WK_HOST_1=hostname3

workers_list: “hostname2:2223,hostname2:2224,hostname2:2225,hostname2:2226,hostname3:2227,hostname3:2228,hostname3:2229,hostname3:2230”

worker_env:”export OMP_NUM_THREADS=9; export TF_ADJUST_HUE_FUSED=1; export TF_ADJUST_SATURATION_FUSED=1;”

common_args: “--model resnet50 --batch_size 64 --data_format NCHW --num_batches 100 --distortions=True --mkl=True --local_parameter_device cpu --num_warmup_batches 10 --device cpu --data_dir ‘/path-to/TF_Records' --data_name imagenet --server_protocol grpc --optimizer rmsprop --ps_hosts $ ps_list --worker_hosts $workers_list --display_every 10 “

ps_args: “$common_args --num_intra_threads 4 --num_inter_threads 2“

worker_args: “$common_args --num_intra_threads 9 --num_inter_threads 4“

To start the parameter server:

ssh $PS_HOST_0; numactl -l python tf_cnn_benchmarks.py $ps_args --job_name ps --task_index 0 --ps_hosts $ps_list  --worker_hosts  $workers_list &

To start the workers on node 0:

ssh $WK_HOST_0; $worker_env; nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[0-9,40-49],explicit,verbose”  --job_name worker --task_index 0 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST_0; $worker_env; nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[10-19,50-59],explicit,verbose” --job_name worker --task_index 1 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST_0; $worker_env; nohup numactl -m 1 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[20-29,60-69],explicit,verbose” --job_name worker --task_index 2 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST_0; $worker_env; nohup numactl -m 1 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[30-39,70-79],explicit,verbose” --job_name worker --task_index 3 --ps_hosts $ps_list  --worker_hosts $workers_list &

To start the workers on node 1:

ssh $WK_HOST_1; $worker_env; nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[0-9,40-49],explicit,verbose”  --job_name worker --task_index 4 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST_1; $worker_env; nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[10-19,50-59],explicit,verbose”  --job_name worker --task_index 5 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST_1; $worker_env; nohup numactl -m 1 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[20-29,60-69],explicit,verbose”  --job_name worker --task_index 6 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST_1; $worker_env; nohup numactl -m 1 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[30-39,70-79],explicit,verbose”  --job_name worker --task_index 7 --ps_hosts $ps_list --worker_hosts $workers_list &

Where $ps_list and $workers_list are the comma separated list of hostname:port pairs of the parameter servers and worker hosts respectively. $ps_args are the arguments to the parameter server such as --num_inter_threads and --num_intra_threads. $worker_args are the arguments to the worker such as the model name, batch_size, data_format, data_dir, server_protocol, num_inter_threads and num_intra_threads values etc.

Multi-Socket Deep Learning Inference on Intel® Xeon® Processor-Based Systems

Example 3: For 2S Intel Xeon Platinum 8170 processor-based systems, multi-socket (sub-socket) with 26 Cores/Socket with 8 TensorFlow instances per node running inference can be launched as follows:

common_args: “--model resnet50 --batch_size 256 --data_format NCHW --num_batches 100 --distortions=True --mkl=True --num_warmup_batches 10 --device cpu --data_dir ~/tensorflow/TF_Records --data_name imagenet --display_every 10 “

WK_HOST= hostname”

worker_env:”export OMP_NUM_THREADS=6; export TF_ADJUST_HUE_FUSED=1; export TF_ADJUST_SATURATION_FUSED=1;”

inf_args: “$common_args --num_intra_threads 6 --num_inter_threads 4“

To start 4 Inference streams on Socket-0:

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 0 python tf_cnn_benchmarks.py --forward_only True $inf_args --kmp_affinity="granularity=thread,proclist=[0-5,52-57],explicit,verbose"&

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 0 python tf_cnn_benchmarks.py --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[6-12,58-64],explicit,verbose"&

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 0 python tf_cnn_benchmarks.py --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[13-18,65-70],explicit,verbose"&

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 0 python tf_cnn_benchmarks.py --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[19-25,71-77],explicit,verbose"&

To start 4 inference streams on Socket-1:

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 1 python tf_cnn_benchmarks.py  --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[26-31,78-83],explicit,verbose"&

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 1 python tf_cnn_benchmarks.py --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[32-38,84-90],explicit,verbose"&

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 1 python tf_cnn_benchmarks.py --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[39-44,91-96],explicit,verbose"&

ssh $WK_HOST; $worker_env; nohup unbuffer numactl -m 1 python tf_cnn_benchmarks.py --forward_only  True $inf_args --kmp_affinity="granularity=thread,proclist=[45-51,96-102],explicit,verbose"&

Where $inf_args are the arguments to the TF instance running inference such as the model name, batch_size, data_format, data_dir, num_inter_threads and num_intra_threads values etc.

Optimized Run Time BKM for TensorFlow for Training on Intel Xeon Phi processor 7250

Example 4: For 1S Intel Xeon Phi processor 7250 based systems, multi-socket (sub-socket) with 68 Cores/Socket 1-node distributed training with 4 TensorFlow worker instances per node and 1 Parameter Server (PS) can be specified and launched as follows. We use 64 Cores for compute and the remaining 4 cores for I/O. We assume that the MCDRAM in the Intel Xeon Phi processor-based system is booted in Cache-Mode.

PS_HOST: “hostname1”

ps_list: “hostname1:2218”

WK_HOST= hostname2”

workers_list : “hostname2:2223,hostname2:2224,hostname2:2225,hostname2:2226”

worker_env:”export OMP_NUM_THREADS=15; export TF_ADJUST_HUE_FUSED=1; export TF_ADJUST_SATURATION_FUSED=1;”

common_args: “--model resnet50 --batch_size 64 --data_format NCHW --num_batches 100 --distortions=True --mkl=True --local_parameter_device cpu --num_warmup_batches 10 --device cpu --data_dir ‘/path-to/TF_Records' --data_name imagenet --server_protocol grpc --optimizer rmsprop --ps_hosts $ ps_list --worker_hosts $workers_list --display_every 10 “

ps_args: “$common_args --num_intra_threads 4 --num_inter_threads 2“

worker_args: “$common_args --num_intra_threads 15 --num_inter_threads 4“

To start the Parameter Server:

ssh $PS_HOST; numactl -l python tf_cnn_benchmarks.py $ps_args --job_name ps --task_index 0 --ps_hosts $ps_list  --worker_hosts  $workers_list &

To start the Workers:

ssh $WK_HOST; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[0-15,68-115],explicit,verbose”   --job_name worker --task_index 0 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[16-31,116-163],explicit,verbose” --job_name worker --task_index 1 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[32-47,164-211],explicit,verbose” --job_name worker --task_index 2 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[48-63,212-259],explicit,verbose” --job_name worker --task_index 3 --ps_hosts $ps_list  --worker_hosts $workers_list &

Where $ps_list and $workers_list are the comma separated list of hostname:port pairs of the parameter servers and worker hosts respectively. $ps_args are the arguments to the parameter server such as --num_inter_threads and --num_intra_threads. $worker_args are the arguments to the worker such as the model name, batch_size, data_format, data_dir, server_protocol, num_inter_threads and num_intra_threads values etc.

BKM for Multi-Node Multi-Socket Distributed Training

Example 5: For 2S Intel Xeon Phi processor 7250 based systems, multi-socket (sub-socket) with 68 Cores/Socket 2-node distributed training with 4 TensorFlow worker instances per node and 1 Parameter Server (PS) can be specified and launched as follows:

PS_HOST_0: “hostname1”

ps_list: “hostname1:2218”

WK_HOST_0=hostname2, WK_HOST_1=hostname3

workers_list: “hostname2:2223,hostname2:2224,hostname2:2225,hostname2:2226,hostname3:2227,hostname3:2228,hostname3:2229,hostname3:2230”

worker_env:”export OMP_NUM_THREADS=15; export TF_ADJUST_HUE_FUSED=1; export TF_ADJUST_SATURATION_FUSED=1;”

common_args: “--model resnet50 --batch_size 64 --data_format NCHW --num_batches 100 --distortions=True --mkl=True --local_parameter_device cpu --num_warmup_batches 10 --device cpu --data_dir ‘/path-to/TF_Records' --data_name imagenet --server_protocol grpc --optimizer rmsprop --ps_hosts $ ps_list --worker_hosts $workers_list --display_every 10 “

ps_args: “$common_args --num_intra_threads 4 --num_inter_threads 2“

worker_args: “$common_args --num_intra_threads 15 --num_inter_threads 4“

To start the Parameter Server:

ssh $PS_HOST; numactl -l python tf_cnn_benchmarks.py $ps_args --job_name ps --task_index 0 --ps_hosts $ps_list  --worker_hosts  $workers_list &

To start the workers on node 0:

ssh $WK_HOST_0; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[0-15,68-115],explicit,verbose”   --job_name worker --task_index 0 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST_0; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[16-31,116-163],explicit,verbose” --job_name worker --task_index 1 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST_0; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[32-47,164-211],explicit,verbose” --job_name worker --task_index 2 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST_0; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[48-63,212-259],explicit,verbose” --job_name worker --task_index 3 --ps_hosts $ps_list  --worker_hosts $workers_list &

To start the workers on node 1:

ssh $WK_HOST_1; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[0-15,68-115],explicit,verbose”   --job_name worker --task_index 4 --ps_hosts $ps_list --worker_hosts $workers_list &

ssh $WK_HOST_1; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[16-31,116-163],explicit,verbose” --job_name worker --task_index 5 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST_1; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[32-47,164-211],explicit,verbose” --job_name worker --task_index 6 --ps_hosts $ps_list  --worker_hosts $workers_list &

ssh $WK_HOST_1; $worker_env;nohup numactl -m 0 python tf_cnn_benchmarks.py $worker_args --kmp_affinity=“granularity=thread,proclist=[48-63,212-259],explicit,verbose” --job_name worker --task_index 7 --ps_hosts $ps_list  --worker_hosts $workers_list &

Where $ps_list and $workers_list are the comma separated list of hostname:port pairs of the parameter servers and worker hosts respectively. $ps_args are the arguments to the parameter server such as --num_inter_threads and --num_intra_threads. $worker_args are the arguments to the worker such as the model name, batch_size, data_format, data_dir, server_protocol, num_inter_threads and num_intra_threads values etc.

Best Known Methods for Optimized Intel Distribution of Caffe

Build Methodology for Caffe: For Intel Distribution of Caffe please follow the BKMs specified by Intel optimized Caffe: https://github.com/intel/caffe

BKM for Single & Multiple Node Multi-Socket Distributed Training Examples:

Example 6: For 2S Intel Xeon Platinum 8170 processor-based systems, multi-socket (sub-socket) with 26 Cores/Socket Dist. Training with 4 Caffe Worker instances per node can be specified and launched as follows:

WK_HOST=“hostname”

CORES_PER_NODE=52

P=2 #Processes per node

N=2 #Total number of processes calculated as num_nodes/P

CORES_PER_MPI_PROCESS=$(($CORES_PER_NODE / $P))

OMPTHREADS=$(($CORES_PER_MPI_PROCESS - 2))

export I_MPI_DEBUG=5; mpiexec.hydra -v -l -ppn $P –n $N -f $WK_HOST -genv OMP_NUM_THREADS $OMPTHREADS -genv KMP_AFFINITY 'granularity=fine,compact,1,0' path-to-intelcaffe/build/tools/caffe train -solver $MODELDIR/solver.prototxt  -engine MKL2017

Where OMP_NUM_THREADS is the Number of OpenMP threads used per process, CAFFEDIR is the path to the Caffe installation, MODELDIR is the path to the directory containing the model prototxt files(for ex. googlenet)

Multi-Socket Inference Example

Example 7: 2S Intel Xeon Platinum 8170 processor-based systems, multi-socket (sub-socket) multi stream inference with 26 Cores/Socket with 8 Caffe instances per node running inference can be launched as follows:

OMP_NUM_THREADS=6 KMP_AFFINITY="granularity=thread,proclist=[0-5,52-57],explicit,verbose" numactl -m 0 path-to-intelcaffe/build/tools/caffe time  -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=7 KMP_AFFINITY="granularity=thread,proclist=[6-12,58-64],explicit,verbose" numactl -m 0 path-to-intelcaffe/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=6 KMP_AFFINITY="granularity=thread,proclist=[13-18,65-70],explicit,verbose" numactl -m 0 path-to-intelcaffe/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=7 KMP_AFFINITY="granularity=thread,proclist=[19-25,71-77],explicit,verbose" numactl -m 0 path-to-intelcaffe/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=6 KMP_AFFINITY="granularity=thread,proclist=[26-31,78-83],explicit,verbose" numactl -m 1 $CAFFEDIR/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=7 KMP_AFFINITY="granularity=thread,proclist=[32-38,84-90],explicit,verbose" numactl -m 1 $CAFFEDIR/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=6 KMP_AFFINITY="granularity=thread,proclist=[39-44,91-96],explicit,verbose" numactl -m 1 $CAFFEDIR/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

OMP_NUM_THREADS=7 KMP_AFFINITY="granularity=thread,proclist=[45-51,96-102],explicit,verbose" numactl -m 1 $CAFFEDIR/build/tools/caffe time -model $MODELDIR/train_val.prototxt -iterations $iters -engine MKL2017 -forward_only &

Platform Configurations

Intel Xeon Platinum 8168 Processor

2S Intel Xeon Platinum 8168 CPU @ 2.70GHz (24 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel SSD DC S3700 Series. Multiple nodes connected with 10Gbit Ethernet.

Intel Xeon Gold 6148 Processor

2S Intel Xeon Gold 6148 CPU @ 2.40GHz (20 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 192GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. Multiple nodes connected with Intel Omni-Path Architecture Host Fabric, Intel OPA Interface Driver version 10.4.2.0.7. SSD: Intel SSD DC S3700 Series.

Intel Xeon Platinum 8170 Processor

2S Intel Xeon Platinum 8170 CPU @ 2.10GHz (26 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.16.1.el7.x86_64. Multiple nodes connected with Intel OPA Host Fabric, Intel OPA Interface Driver version 10.4.2.0.7. SSD: Intel SSD 800GB DC S3700 Series.

Intel Xeon Phi Processor 7250

1S Intel Xeon Phi processor 7250, 68 Cores, 4 HW Threads per core, 1.4 GHz, 16GB high-speed MCDRAM set in Cache-Quadrant mode, 32KB L1 data cache per core, 1MB L2 per two-core tile, 96GB DDR4, Multiple nodes connected with Intel OPA Host Fabric, Intel OPA Interface Driver version 10.4.2.0.7, Intel SSD 480GB DC S3500 Series, Software: CentOS Linux release 7.3.1611, Linux kernel 3.10.0-514.10.2.el7.x86_64, Intel MPI Library 2017 Update 4.

References

  1. TensorFlow* Optimizations on Modern Intel® Architecture
  2. https://github.com/intel/caffe/
  3. Optimizing Applications for NUMA
  4. http://man7.org/linux/man-pages/man3/numa.3.html
  5. Thread Affinity Interface (Linux* and Windows*)
  6. Process and Thread Affinity for Intel® Xeon Phi™ Processors
  7. https://www.open-mpi.org/doc/v2.0/man1/mpiexec.1.php

Authors

Vikram Saletore is a Principal Engineer and a Machine Learning and Deep Learning Performance Architect and leads the Performance Enabling team in the Customer Technical Solutions team in the Artificial Intelligence Products Group at Intel Corporation for Intel® Xeon® and Intel® Nervana™ products. He has delivered optimized parallel database software to ISVs (Oracle, Informix) and ML Analytics optimizations on Apache/Spark to Cloudera, led joint research with HP Labs and more recently Co-PI for research with SURFsara on deep learning. Prior to Intel, Vikram was a faculty member in Computer Science at OSU, Corvallis, OR and led NSF sponsored ($300K) research in parallel programming and distributed computing supervising 8 students (PhD, MS). He also worked for AMD and DEC on network and CPU architectures. Vikram received his PhD in Electrical Engineering from the University of Illinois at Urbana-Champaign and MSEE from Berkeley and holds six patents with two pending and has more than 40 research publications.

Deepthi Karkada is a Machine Learning Engineer in the Performance Enabling Team in the Customer Solutions in the Artificial Intelligence Products Group at Intel Corporation. She works on deep learning framework and platform optimizations and benchmarking targeted for Intel Xeon Architectures and Intel Nervana products. Earlier she worked on seamless integration of Intel® Math Kernel Library with Apache Spark for Machine Learning and data analytics for Cloudera* Distribution of Hadoop*.

Vamsi Sripathi is a Software Engineer at Intel since 2010. He has a Masters' degree in Computer Science from North Carolina State University, USA. During his tenure at Intel, he worked on the performance optimization of Basic Linear Algebra Subroutines (BLAS) in Intel Math Kernel Library spanning multiple generations of Intel Xeon and Intel Xeon Phi architectures. Recently, he has been working on the optimization of deep learning algorithms and frameworks for Intel architectures and Intel Nervana products.

Kushal Datta is a Research Scientist in the Performance Enabling team in the Customer Solutions in the Artificial Intelligence Products Group at Intel Corporation. His interests are in Machine Learning, Deep Learning, systems performance optimizations and CPU micro-architecture. He is one of the lead authors of TileDB – a performant storage library for multi-dimensional arrays and GenomicsDB – a genomics data storage system used in GATK 4.0. Prior to Intel, Kushal graduated from University of North Carolina at Charlotte where he won a $40,000 research grant for developing a cycle-accurate CPU simulator for SPARCV9 instruction set with Sun Microsystems*. He holds four patents and several research publications.

Ananth Sankaranarayanan is the Director of Engineering leading AI Solutions and Applied Machine Learning teams in the AI Products Group at Intel Corporation. He is responsible for enabling and scaling the Intel Xeon and Intel Nervana AI product portfolio worldwide across Cloud Service Providers, Enterprise, Government and Communication Service Providers. Ananth has been with Intel since 2001 in various engineering leadership roles and has received Intel Achievement Award for delivering Intel’s first production High Performance Computing capability and more than 30 Divisional Recognition Awards. Ananth earned B.E. in Computer Science and Engineering, MBA in Information Systems. He holds two patents and has authored several technical publications.

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>