Intel Xeon Phi Coprocessor April 2013 Developer Webinar Q&A Responses

Answers for the questions raised during the April session of our Introduction to High Performance Application Development for Intel® Xeon® & Intel® Xeon Phi™ processors class have been assembled. There were some duplicates and other questions we couldn't decipher, either because of the wording or because of implied context that was not spelled out. We tried to address the rest, which appear below:

Final Questions and Answers

Q: It will be cool if are able to click on the references while the presentation is in progress. However, the lack of access to slides during presentations is a bottleneck.
A: The presentations will be made available after the Webinar, along with this Q&A session in which you are participating (if you are reading this, it has arrived, J). With access to the presentations, the links presented should be available for following or keeping.

Q: What happens on not 64-byte aligned access? Exception or wrong data loaded?
A: Neither. GP faults are issued ONLY if the memory operand linear address is not aligned to the data size granularity dictated by the UpConv or SwizzUpConv mode (whichever is applicable for a particular instruction). As long as element size boundaries are respected, the vector operation should proceed without fault. However, 64-byte alignment will be faster, since the whole vector on non-gather loads can be read with a single cache-line read, instead of having to read two adjacent cache lines to capture the full 64-byte result.

Q: Why does not compiler automatically align if it detects that code should be generated for MIC?

A: The memory allocator for C/C++ is in libc, outside control of the compiler. Without intervention programs are at the mercy of the OS, which decides where to allocate the data. Intel Fortran aligns on 16-byte boundaries by default, although this may change in a future release. In either case, remember aligning data is just step 1 of a 2 step process – the compiler has to KNOW the data are aligned in the code where they are used. Where data are allocated is often not in the same source files where they are used, so a compiler cannot make assumptions about how data are aligned. That is why step 2 is to use pragmas/directives to tell the compiler that the data are aligned.

Q: Will the -ax option work on AMD CPUs? -m works, -x does not in my experience.
A: Applications built with –ax… will run on non-Intel processors, but can only execute a single, default code path. Code paths corresponding to additional instruction sets specified by the –ax switch can only be executed on Intel processors.

Q: How may good performance be achieved by the new Vector unit on 8- or 16-bit data?
A: The new vector unit operates on 512 bit data in chunks sized as 32 and 64 bit elements: floats and integers. The coprocessor is not intended for text/ byte-stream processing. And check out the zmmintrin.h file in the compiler's include directory -- there you find all the intrinsics that cover the entire Instruction Set Architecture. For more details, James Reinders's blog links to the architecture's specification. I guess you can find this easily with a web search. Summary: there a few instructions with vector support for small integer types.

Q: Cache size: If I have multiple threads per core, the 512K of L2 is shared? That is, the average per thread cache size is divided by the number of threads?
A: The terms "shared" in this context is usually not a disadvantage since threads are able to freely work in common, and help each other e.g. to just rely on what's loaded from memory already.

Q: L2 transactions: when 4 threads access to same cache line without overlap, how much transactions is take L2->L1 cache?
A: There's a lot left unspecified in this question, which we can only guess about. Are the 4 threads on a single core? Are the accesses reads or writes? If at first access the line is not in the L1, then it's at least 11 cycles to get it from the L2 (assuming it's there), or much longer if memory has to be accessed. If the other thread accesses are concurrent requests for reads, those requests may also wait on the arrival of the cache line. If those reads come a littler later then they may be satisfied by inclusion from the previous request into the local L1. If on the other hand one of those thread accesses was a write, the L1 store-to-load penalty is 12 clocks.

Q: Is it correct to assume that cache coherence is implemented using a directory and not snooping? With 60+ cores I would think that cache to cache transfers are quite expensive. Is that right? I would also think that sync primitives such as critical regions and spinlocks would be much more expensive than other Xeon systems. Is that correct?
A: The caches run a full MESI protocol for coherence; however, each core has a segment of a Distributed Tag Directory to minimize the impact of snooping on the ring. Full cache coherence is expensive but the programming styles between host and coprocessor would have to be very different to support cache coherence on one but not the other.

Q: How many levels of cache for each Phi processor and what is the size of each level? I am unable to find this information on the internet, but perhaps I haven't looked hard enough.
A: Much of this information is available at the Intel Developer site dedicated to the Intel Xeon Phi coprocessor, http://software.intel.com/en-us/mic-developer (look at the ISA and Software Developers Guide manuals there). The coprocessor memory hierarchy has two levels of caches, separate 32KB L1 Instruction and Data and a 512MB per core L2.

Q: Which is the best balance of memory in the server for 2 Xeon Phi cards, if we are using Sandy bridge processors at 2.6 or 2.7 GHz?
A: It really depends ultimately on the needs of your application. If you have a large data footprint and especially if you are staging data for delivery to two coprocessors on the node, then having more host memory should help, but we have no obvious rules of thumb. With two coprocessors, each with 8 GiB of memory onboard, I would expect a 16 GiB host at the minimum.

Q: The most common applications in HPC using CPU take 2 or 4 GB in RAM per core for correct functionality (don't use swap). In the case of the Teslas it's 4 or 8 GB depending on the number of Teslas, but in the case of Xeon Phi which is the metric for defining the capacity of RAM in the server with 2 Xeon Phi cards, if we are using Sandy bridge processors?
A: Well, if you want the measure you're pointing out -- just relate it to the main memory available on current Intel Xeon Phi Coprocessors. It's 8 GiB, though some of the SKUs provide a lesser amount of memory (6 GiB). You can also expect to have a moderate increase in the main memory size on upcoming coprocessors. Having a coprocessor attached to a host will generally require more host memory than if the host were computing alone, to support buffering needs and such, but ultimately it really depends on your application. The host processor can carry lots more than 8 GiB but how much you need for buffering to the 8 GiB max in the coprocessor depends on the size of data in your application. Our machines around here typically have 32 GiB on the host to complement the 8 GiB on each coprocessor. By the way, the E5 platform (dual socket) is able to fit 768 GiB of main memory.

Q: Can I get a Phi card (PCI, etc) to drop in my desktop to use, or is it only for the Xeon server-type machines? That is, can I get one to use in a Sandybridge machine? Est. of cost?
A: There are requirements for attaching the Intel Xeon Phi coprocessor card to a host system that include providing more than the usual PCI Express I/O address space. There are also power and cooling requirements that must be meant. We suggest that you work with your equipment provider to verify that what you acquire will be a viable system, with adequate power, cooling and interface performance to support the coprocessor. Usually we expect that coprocessor systems will be integrated packages including hosts and coprocessors matched to each other and ensure their compatibility. Please go to http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-where-to-buy.html to find out where to buy Xeon Phi systems. They can help advise you on what is available within their product lines - servers and workstations. Additionally they can also provide you with pricing.

Q: How many co-processors cards are supported in one blade system?
A: There are no limits from the coprocessor on the number of coprocessors per node. However, available power, cooling, PCI Express bandwidth and just physical space may provide a practical limit to the number of coprocessors that can effectively operate on a particular node with a particular configuration.
Q: Is there a maximum limit to the number of co-processors on a node?
A: It is limited by the physical number of PCI-e slots in a system. There are vendors out there who can configure a node with up to 8 Xeon Phi cards

Q: Does the Phi processor pipeline use a reorder buffer (ROB)? What is the size of the ROB? Does the processor pipeline do register renaming? How many zmm sized registers can be obtained after taking renaming into account? How many register read ports does each processor have?
A: The current Intel Xeon Phi coprocessor is built on an in-order design, so there is no Re-Order Buffer present in the architecture. Each core supplies enough registers to support the requirements for four hardware threads. Read ports are not used for internal registers, only for reading cache lines in from the memory hierarchy.

Q: Is there tool to identify the targeted architecture for a given executable made by -ax <XXX>?
A: There are a lot of CPU utility tools like CPUZ that will give you the CPU ID. If there are conditional architectures chosen at compile then the CPU ID will be read and the appropriate code path selected for execution.
Q: Sorry, I meant -xHost
A: As far as I remember this works on non-Intel processors. It's also covered by our documentation as well as the command line help of the compiler. Let me look it up: yes it works with AMD CPUs!

Q: Do I lose half of the flops if I do not use FMA instructions?
A: Yes! The peak performance is rated based on FMA. Please note, this is the case for almost all comparable hardware (GPGPU). However, we show throughout this course how to exploit the flexibility of Intel Xeon Phi coprocessors using multiple levels of parallelism: instruction-level parallelism, SIMD, and multicore.

Q: What is the theoretical floating point peak performance of Xeon SE 10? What fraction is being attained by dgemm()?
A: One can easily calculate this. Let's assume double-precision, then a 512-bit vector register fits 512/64 FP/DP-elements. With FMA available 16 operands are processed per each clock (per core); hence 16 x 61 x 1.090909 GHz gives you ~1065 GF/s (61 cores). So the peak performance is about ~1TF/s. Recent performance figures for DGEMM from Intel Math Kernel Library can be found at http://software.intel.com/sites/default/files/MKL_1101_KNC_GEMM.png (which is the same as in slide 23 of the Intel MKL presentation provided in this two-day session).

Q: Do you say that each core contains HT threads or physical threads?
A: The Intel Xeon Phi Coprocessor has what is referred to as "smart round-robin multithreading" that has not been equated so far with "Intel Hyper-Threading Technology (Intel HT Technology)," though conceptually they are similar: a context peripheral to the ALUs that tracks registers and instruction flow through the processor. However, Intel HT Technology was conceived as a scheme to provide more work to an underutilized core and provide additional latency hiding, keeping the CPU busy with one thread while the other one was waiting for memory. It was possible to saturate that core from a single thread and HPC workloads often were configured to run with it turned off. Whereas, multithreading in the new coprocessor is backed by more horsepower in each core and the shared memory hierarchy and has new smarts in thread scheduling to maximize effective use of the core cycles. In fact, the nature of the instruction decoder means that a minimum of two threads must be run on each core just to have access to all the system's resources.

Q: Is the instruction set compatible with Itanium Architecture?
A: No. The Intel^® Xeon Phi™ core is based on Pentium^® ISA with respect to the x86 ISA. It even includes x87 to allow a painless transition for all existing code -- so assumptions made in the past will hold, and even complicated code will just work. However, not included in this compatibility are any of the SIMD architecture instructions available in previous Intel architectures, or compatibility with IA-64 (Itanium^®) architecture.

Q: If the host has Infiniband, can MIC use Infiniband?
A: The Intel Xeon Phi coprocessor can take advantage of Infiniband* hardware that may be present, but be sure that the PCI bus on your system is adequate if you try to install multiple coprocessors or a single host node. We've seen cases where the IB card can only service a single coprocessor because of the configuration of the PCI express bus

Q: My intended Clients for the Xeon Phi are clients currently using one or more (up to 3) Graphic Processor Unit PCI-e boards in a system, C2075, C2090's How will the Xeon Phi compare against these GPUs? The application is video imaging.
A: Intel Xeon Phi coprocessors are competitive in performance with modern GPUs but it would be foolhardy to try to speculate how a particular unknown application will fare when ported to the Intel Xeon Phi coprocessor. It depends a lot on how effectively your code can make use of the coprocessor's features.

Q: Do you say that Phi core support SSE instructions???
A: Intel Xeon Phi coprocessors support a vector instruction set, but it is not Intel^® Streaming SIMD Extensions (Intel^® SSE) or any of the other vector instructions previously provided on Intel architectures.

Q: What is the capacity of the disk space in the Xeon Phi for applications?
A: There is no disk or other such offline storage attached directly to the coprocessor, and in fact the file system you can see is cut from the memory available on the coprocessor. So ultimately the capacity is 8 GiB; however, using most of that may leave the coprocessor crippled, so we advise against that.
Q: For the applications do we need to make a mount point in the system of the Xeon Phi?
A: You don't need to use NFS. You can copy data as needed and it persists in RAMdisk. However, it is often convenient to NFS mount to the coprocessor to avoid the copy and conserve memory. HOWEVER it is run over TCP/IP routed through the PCI-exp bus and performance will certainly be less than you would expect with reading out of local files housed in the virtual file system.

Q: Regarding the +1 teraflops demonstration on a single phi card --- was that done using just one thread per Phi core or more? If only one thread per Phi core, is to correct to assume that 32*3=96 zmm registers capacity meant for three other threads was unused. Thanks!
A: I don't know the particulars on the configuration that showed that performance, but it is unlikely that it was done with only one thread active per core. As mentioned yesterday, the multi-stage decoder means that you cannot reach peak performance on the coprocessor with only a single thread active per core. So at least two threads per core would be required to achieve peak performance. I don't know whether the specific test mentioned could use all four threads per core or instead ran with three (or two). But it should be a minimum of 2.

Q: Is it one Vector unit for 4 threads on core?
A: Yes, 1 vector unit per core, 4 thread contexts on the core. The vector unit comes with enough vector registers to support the 4 threads running concurrently and many of the vector instructions supported have a 1 clock throughput and a 4-clock latency, meaning that when any particular thread gets its next CPU cycle, a vector operation initiated during its previous execution slot is likely to be ready now.

Q: Will the new vector instruction set be converged with Xeon vector instruction set in the future? Or the Xeon Vector in the future?
A: We do not talk about unreleased architectures, but I think it is safe to assume, given Intel's history, that vector architectures will continue to evolve as our development efforts move forward. Stay tuned, to see what comes next!

Q: If I have only one Xeon Phi card is it enough to buy Intel C++ Composer XE 2013 for compiling applications for the Xeon Phi? Conversely, is it correct that Cluster Studio XE is only necessary if I have a cluster of Xeon Phi's?
A: You can buy C++ by itself. It has full support for programming on the coprocessor. In addition, if you want Intel MPI and Intel Trace Analyzer and Collector, you will need Intel Cluster Studio, which bundles compilers with these additional tools for cluster support.

Q: What are the steps for cross-compiling and debugging? This question is in context of the last point on slide 12 of the debugger presentation.
A: Cross compiling can be invoked specifically via the compiler switch -mmic for native code cross-compilation. Then you can just invoke your program via ssh from gdb and treat it as an ordinary (not offloaded) program.

Q: Is there no support for the DDT parallel debugger?
A: DDT* is indeed supported as are all the other parallel debuggers (Totalview*, GDB, and IDB). You'll have to check with Allinea if DDT supports running on the Intel Xeon Phi coprocessor.

Q: Are there online Xeon Phi machines available that can be used for evaluation and getting a first hands-on impression on how it works and behaves?
A: Unfortunately, no there are not.

Q: How I can make sure a 64 byte alignment of data in Fortran?
A: Compiler option (ifort): -align array<N>byte, where <N> is 64.
Q: Which report level informs me that a function/subroutine was successfully aligned in Fortran? Is a 'pure' declaration and the same file scope enough?
A: Alignment is not an attribute whose status is provided in any report. Alignment mostly affects success in vectorization, so using -vec-report5 should provide reasons why particular loops were not vectorized, some of which may be related to alignment issues. "PURE" declares that a user-defined procedure is without side-effects. It is unrelated to data alignment. Likewise, sharing file scope has no effect on data alignment.

Q: What about OpenCL support?
A: We have support for OpenCL* on Intel Xeon Phi coprocessors. You can find out more at http://software.intel.com/en-us/blogs/2012/11/12/introducing-opencl-12-for-intel-xeon-phi-coprocessor.
Q: What are benefits of using OpenCL vs. OpenMP/ Cilk, on Phi?
A: Probably the principle benefit is that if you have already written OpenCL code for some computational operation, you can run it on the Intel Xeon Phi coprocessor alongside your Fortran, C and C++ codes. We are not advocates of OpenCL but we are supporters. If you find that OpenCL satisfies your requirements for portable computation code, then we want your OpenCL experience on our coprocessor to be stellar.

Q: is OpenCL supported in MPI context?
A: There's nothing in our library that prevents running OpenCL code. But it's not something we test on. If you're having issues, get in touch with us via Intel Premier Support (http://premier.intel.com) or our online Intel Clusters and HPC forums (http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/).

Q: Can Gcc be used for generating code for Xeon Phi? Same question for Clang.
A: Yes, in fact we do so in order to compile the embedded Linux that runs on the coprocessor (along with other components of the SW stack). For performance workloads ("end user code"), we believe that more work is required from us and the community in order to meet performance expectations using the GNU GCC tool chain. With respect to LLVM, Intel participates in the development. In fact, we are respected contributors with commit permissions for the LLVM project. The latter also includes LLDB, etc.

Q: Can older compiler versions support co-processor functionality?
A: The Intel Xeon Phi coprocessor is supported in Intel Compilers version 13.0 and 13.1, though no older compilers support the coprocessor. This compiler was released first last September 2012.

Q: How can I disable generation of __svml_* dependencies?
A: C/C++ -ffreestanding, Fortran -nolib-inline -OR- link with -lc -lm to use libc and libm explicitly.
Q: I'd like to have a library which doesn't depend on Intel's svml library
A: Yes, there is a compiler option in order to link against libsvml.a rather than the dynamic/shared library. Please note, SVML is used to vectorize math functions within loops which even includes transcendental functions, etc.
Q: Static linkage with libsvml.a is not a solution.
A: Let me just sketch a workaround -- you may be able to use pragma novector for the loops that contain higher-order math functions or transcendental functions -- this way you get rid of calls to those intrinsics. Btw, the libsvml should not be a problem on non-Intel processors!

Q: Do I always need openmp support when linking for CAO mode?
A: Practically speaking, yes you do. To get good performance out of the Intel^® Xeon Phi™ coprocessor with its Intel Many Integrated Core (Intel MIC) architecture you really must make use of parallelism. Native Intel Math Kernel Library (Intel MKL) functions will not provide the performance you seek if you run them in sequential mode. Even if you have used other means of parallelism in your program at a higher level, it is likely that you’ll want to try a nested scheme where Intel MKL is using the 4 threads per core that is necessary to keep the coprocessor running efficiently.

Q: Considering BLAS Level 1 and 2 - you have experience regarding MKL performance vs compiling the kernel from Fortran or C?
A: Intel Math Kernel Library has been tuned for performance on Intel Xeon Phi coprocessors. A random kernel, written in Fortran or C, will not at the start have the benefit of similar tuning. Therefore, I would expect that a application making use of Intel MKL already with get to good performance with the coprocessor faster than if you have to tune a particular kernel to the same adaptation.

Q: Is a Cluster Studio XE license is required for running Intel MPI over PCIe on a single host + MIC system?
A: The MPI runtime for Intel MPI is required and is provided free of charge. The development tools in the Intel Cluster Studio XE are a sold item.

Q: Can you say something about NFS performance when running with up to 240 threads on the Phi accessing files exported with NFS from the host? Should NFS be tuned on the server to handle the parallelism, or must programs be rewritten to channel file access through one/few tasks?
A: The challenge in going to such a high core count is that the multiplier of the number of hardware threads can quickly overwhelm the available resources. Consider: 1) Assuming the NFS file system is mounted from a remote server, your code will be limited by Ethernet performance. Even with a gigabit Ethernet, that's only around 120 MB/sec total bandwidth. Divide that between 240 HW threads all demanding a share and even with optimal scheduling that leaves only 0.5 MB/sec or so per thread, which is not a lot. Also 2), add to that the overhead of 240 threads with I/O requests to the MPSS kernel on the coprocessor and you're likely to see a lot of thrash as that kernel tries to deal with this additional load. Therefore, it's probably advisable to keep the number of I/O processes relatively small, maximize their throughput to get the data to the coprocessor memory and then distribute the data to the remaining threads via memory operations on the coprocessor. Remember that a single thread on a (UDP mounted) NFS share can handle up to 80 MB/sec. We suggest starting with just four I/O threads feeding the rest of the available threads (reducing their total number to allow room for the I/O threads) and varying from there to find the best performance.

Q: Do we have any pre-execution commands before offloading commands are executed on the coprocessor?
A: Do you mean some "trick" to lower the overhead of offloaded execution? Please note, the offloaded code is statically compiled so there is no JIT invocation or something similar that asks you to do some "compilation".

Q: Fortran allocatable is not available to be offloaded?
A: Allocatable arrays can be offloaded. Just not allocatable arrays inside derived types. BUT while you can offload an allocated array inside a derived type BY ITSELF, you cannot offload the entire structure.

Q: Where will offload functions be run? Which core is chosen? How can this be controlled?
A: The OFFLOAD statement supports the TARGET() directive whose argument specifies which offload device and can select between multiple coprocessors. You will observe that TARGET(MIC:0) represents the first card, TARGET(MIC:1) second one, whereas TARGET(MIC:-1) means pick one of the available coprocessors for this offload. The offload pragma/directive is not a loop construct. It's capable of much more e.g., offloading an entire call-chain. The latter also implies it's up to your code to parallelize the code running on the coprocessor using whatever programming model you prefer e.g., OpenMP, Intel TBB, or Intel Cilk Plus.

Q: I did not mention loops. With two different offloaded functions, where will they run?
A: As Robert and Ron both mentioned, there is a target-clause in both explict offload and the _Cilk_offload construct (_Cilk_offload_to) that specifies the coprocessor. Of course, the core that's used is specified by the threading model mentioned in an earlier answer.

Q: In the OpenMP example shown on slide 27, will the data be transferred automatically or is that the user's responsibility? I didn't see any data transfer statements in the example - sorry if I missed.
A: No, you didn't miss anything. It's true that the B1 and B2 arrays in the slide in question are not mentioned in the offload statements but they are mentioned in the offloaded code and presumably are defined in the enclosing scope of the offloaded section. Without further qualification in the offload pragma, the compiler will allocate space for B1 and B2 on the coprocessor and manage the copying of their entire contents to and from the coprocessor around the boundaries of the offload section. Remember that the in and out directives effectively limit the data transferred to or from the coprocessor in an offload section and can be used along with inout, nocopy and offload-transfer/ offload-wait to provide additional constraints (like partial array copies, data persistence between offloads, and asynchronous attributes).

Q: What are the differences between programming models in terms of performance on Phi?
A: The reason that Intel offers multiple programming models for using Intel Math Kernel Library on Intel Xeon Phi coprocessors is that performance is highly dependent on the program being ported. A general rule of thumb here is that the amount of performance boost is proportional to the amount of effort put into optimization, though the underlying algorithm is the ultimate limiter of coprocessor performance: start with an algorithm that doesn't adapt well to the coprocessor architecture and no amount of effort will help to tune it.

Q: Naive question: if software prefetch does not jeopardize correctness, why isn't it automatically done by the compiler?
A: In fact, the compiler DOES automatically insert prefetch instructions to try to optimize data availability in loops and such constructs, but it's a very hard problem. Knowing how far ahead to prefetch data depends to some degree to the amount of computation in a loop and how long the loop takes to execute. A short loop would need to prefetch data out further ahead to have it available in time than it would in a loop that takes more time to execute. Our heuristics for automatic determination will improve with time, but still there are patterns that may not be recognized by the compiler. That's why we provide both controls to adjust how the compiler places its prefetch instructions automatically and explicit prefetch intrinsics that can be used for even more custom placement.

Q: Is #pragma simd Intel proprietary?
A: No. In fact, it's almost what you get with OpenMP 4.0 (pragma omp simd, etc.). Intel Composer XE 2013 Update 2 already contains a preview of OpenMP 4.0 including new offload pragmas.

Q: What does I_MPI_MIC enable?
A: It's just a Boolean that permits MPI to make use of any coprocessor cards that might be present, and provide more flexibility for configuration. Perhaps you app will work best where only the host is registered for MPI but makes use of offload to incorporate the coprocessor. In the offload case, setting I_MPI_MIC=enable is not necessary.

Q: If I copy the MPI library onto the MIC, how much of the MIC memory will be used for this?
A: The libraries themselves sum up to about 150 MiB. If you take a look at the <intel_mpi_install>/mic/lib, you can get an idea of how much space is needed.

Q: What are the options for "debug level"?
A: I suggest starting with I_MPI_DEBUG=3. That should give you enough info to get you started on any problems.

Q: What does the presenter mean by "accelerating MPI rank"?
A: Unfortunately, I don't remember when I said this but II probably meant using the Intel Xeon Phi coprocessor as an accelerator when doing offload.
Q: Just to clarify, this question is in context of running MPI natively on Xeon Phi.
A: Don't forget, you only have to copy the files in <intel_mpi_install_dir>/mic/lib and <intel_mpi_install_dir>/mic/bin. You can do that manually or via some scripts as part of your provisioning system.

Q: Is mpiexec.hydra no more required for running in symmetric mode?
A: Good question! mpirun defaults to mpiexec.hydra in our latest version. I use mpirun in my examples for uniformity but feel free to use mpiexec.hydra if that's how your scripts are set up.

Q: I couldn't run my one-sided application with MPI_Accumulate on Xeon Phi but could do so on Xeon (host). I was only able to run on Xeon Phi by replacing Accumulate with Get/Put and "O0" optimization. Any pointers on why this could be happening?
A: One-sided is certainly supported on the Intel Xeon Phi coprocessor. You can see the API description in the libmpi documentation. If you're having problems, I would recommend submitting a bug report at our Intel Premier Support site (http://premier.intel.com) or posting in our online Intel Cluster and HPC forum (http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/).

Q: Is it possible run Linpack via MPI to make use of multiple MICs?
A: Yes. Take a look at the benchmarks directory within the Intel MKL distribution and search for the ‘offload’ version of our MP LINPACK* benchmark. Our benchmark is also available online: http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download

Q: If NFS is not available, then should the MIC executable be explicitly copied to all nodes that would be participating in running the MPI job?
A: Yes, indeed. As with a traditional cluster, you need the libraries to be available on all nodes in order to run an application. So make sure all Intel Xeon Phi coprocessors have the MPI libraries.

Q: General question: do we need to install Redhat on the host to accommodate a co-processor? or can we use an open-source Linux distribution such as Fedora or Ubuntu?
A: Officially we only support Redhat and SUSE on the host, but we've heard of CentOS distributions that seem to work fine with the coprocessor. Beyond that so far, you're on your own.

Q: Please advise with regards to OS requirements
A: Currently Intel Xeon Phi coprocessors are supported by hosts running either Redhat or SUSE Linux, latest releases.

Q: Does it support only Linux? What about Windows Platform?
A: Windows* early enabling for the coprocessor has just been announced. Go to http://software.intel.com/en-us/articles/windows-early-enabling-for-intelr-xeon-phitm-coprocessor for more information.

Q: Why Linux only?
A: The initial target is HPC. The more workstation use cases are covered now. Feel free to join our Beta program for Intel Composer XE SP1. It will support Windows-hosted Intel Xeon Phi coprocessors.

Q: Will the Xeon Phi run under WIN 7 & Server 20xx but can only be programmed under LINUX?
A: The coprocessor runs using our Intel Manycore Platform Software Stack (Intel MPSS), an open-source variant of Linux. The host OSes available so far are flavors of Linux, both Redhat and SUSE, with intentions to expand the host support to other OSes soon.

Q: I want know if Intel offer the support for making the changes to get the best performance the application in Xeon Phi?
A: Yes, our developer relations group division does provide assistance to customers who are trying to optimize their code for coprocessor performance. Also, our software group / consultants can support this as well. Are you asking about a commercially available service? (So far, I talked about enabling that just happens,J). We also provide web-based education, documentation, samples, and a User Forum. If you have an Intel Application Engineer assigned to your company already they may help you with Intel Xeon Phi coprocessor porting and tuning. If not, you may have to rely on web-based assistance.

Q: What common applications for HPC are ready for MIC?
A: A list of applications can be found on the web portal, http://software.intel.com/mic-developer. The mentioned programs and libraries are essentially reported benchmark results. The results are taken from reports of our customers who helped us during development of the Intel Xeon Phi coprocessor.
Q: Where we can see a complete list of the common applications like Nwchem, GROMACS, NAMD, etc., that have been optimized to get the best performance in Xeon Phi?
A: A good starting point could be http://software.intel.com/mic-developer to ask about particular applications of interest, and to share your experience/expectation. Let us know if you are interested in performance of a particular program or workload.

Q: Can I use the Virtual Shared Memory with C structs?
A: Yes, in multiple ways: via the MYO (Mine-Yours-Ours) C API, and via Intel Cilk Plus. Note, the "Plus" is something else -- it's made for C and C++.

Q: Will Virtual Shared Memory become available in Fortran in the near future?
A: At this time there are no plans for VSM for Fortran.

Q: What is the limit of Virtual Shared Memory?
A: It's half the physical memory available on the coprocessor system, or 4 GiB for a coprocessor that maxes out at 8 GiB of onboard memory.

Intel® C++ Composer XE

Intel® Cilk™ Plus

Intel® Composer XE

Intel® Fortran Composer XE

Intel® Debugger