MIC QuickStart Guide

From Hyades
Jump to: navigation, search

Aesyle is our testbed for MIC (Many Integrated Core) computing. The node is equipped with two (2x) Sandy Bridge Xeon E5-2630L processors at 2.0 GHz, two (2x) Knights Corner Xeon Phi Coprocessors 5110P, as well as 64 GB memory and one 500GB hard drive.

It is instructive to compare the double precision peak performance of Xeon E5-2630L vs. that of Xeon Phi 5110P:

  • Xeon E5-2630L: 96 GFLOPS = 2.0 (GHz) x 6 (cores) x 256/64 (AVX) x 2 (FMA)
  • Xeon Phi 5110P: 1.01 TFLOPS = 1.053 (GHz) x 60 (cores) x 512/64 (AVX) x 2 (FMA)

so Xeon Phi 5110P is about 10 times as fast as Xeon E5-2630L. You should keep this ratio in mind when you load-balance between Xeon cores and Xeon Phi cores.

Accessing Aesyle

The node has a public IP address as well as a few private ones. The public hostname is aesyle.ucsc.edu (IP: 128.114.126.227). As long as you have a valid account on Hyades, you can SSH to Aesyle, using the same username, password or SSH key as those for Hyades:

$ ssh -l username aesyle.ucsc.edu

If you are already on Hyades, you can log onto Aesyle simply with:

$ ssh aesyle

In order to enable connection to and from the Xeon Phi coprocessors by using SSH without a password, run the following commands the very first time you log onto Aesyle:

[aesyle]$ ssh-keygen -t rsa -N "" -f $HOME/.ssh/id_rsa -v
[aesyle]$ echo -n 'from="10.*" ' >> $HOME/.ssh/authorized_keys
[aesyle]$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
[aesyle]$ chmod 600 $HOME/.ssh/authorized_keys

Computing Environment

Aesyle's environment is almost identical to those of other nodes in the Hyades cluster. The NFS shares are served from the FreeBSD file server via the private 10GbE network and mounted at /home and /trove, respectively. The Lustre file system is mounted at /pfs. 3 modules are loaded by default on Aesyle, providing basic environment for MIC computing:

[aesyle]$ module list
Currently Loaded Modulefiles:
  1) intel_mpi/4.1.3          2) intel_compilers/14.0.1   3) intel_mic

Each of Xeon Phi coprocessors runs an embedded Linux OS in memory. When the coprocessor starts up, its boot-loader loads a root file system and Linux kernel that are stored on the host system. I've customized the root file system such that {{Span|/home} and /pfs are mounted on the coprocessors too, in order to provide a consistent user environment on both the host and the coprocessors.

The 2 Xeon Phi coprocessors are named mic0 and mic1, respectively. Once you are on aesyle, you can log onto them using SSH:

[aesyle]$ ssh mic0

or

[aesyle]$ ssh mic1

Once you are in, you'll see a mostly familiar Linux environment. Feel free to explore its every nook and cranny. NOTE the embedded Linux utilizes BusyBox to provide a number of UNIX tools. The usage of these tools may differ slightly when compared to the usage of similar tools on the host Linux.

It is worth noting that the embedded Linux sees a total of 240 cores/processors in each Xeon Phi 5110P, although there are only 60 physical cores in each coprocessor:

[mic0]$ grep -c ^processor /proc/cpuinfo 
240

The Xeon Phi coprocessors utilizes hardware multithreading on each physical core – 4 threads per core – as a key to mask the latencies inherent in an in-order microarchitecture. This should not be confused with Hyper-Threading on Intel Xeon processors that exists primarily to more fully feed a dynamic execution engine. For Xeon Phi, the number of threads per core utilized is a tunable parameter in an application and should be set based on experience running the application.

MIC Execution Modes

There are 3 execution modes of running applications on the Xeon Phi coprocessors: native, symmetric & offload[1].

Native Mode

In native mode, applications run directly on Xeon Phi coprocessors[2]. Here are some examples on how to build a native application that runs directly on an Intel Xeon Phi coprocessor and its embedded Linux operating system.

Serial Programs

To cross-compile hello.c, a serial "Hello world" program written in C, using the Intel C compiler, run:

[aesyle]$ cd /pfs/dong
[aesyle]$ icc -mmic hello.c -o hello.icc.k1om

NOTE

  • -mmic enables cross-compiling of applications for the MIC Knights Corner microarchitecture.
  • The default optimization level for the Intel compilers is -O2.
  • The binary, hello.icc.k1om, can run only on Xeon Phi coprocessors.

Since the Lustre file system is directly mounted on the Xeon Phi coprocessors, you can run the MIC binary natively with:

[aesyle]$ ssh mic0 /pfs/dong/hello.icc.k1om 
Hello, world!

or you can first log onto the coprocessor, then run the application:

[aesyle]$ ssh mic1
[mic1]$ /pfs/dong/hello.icc.k1om 
Hello, world!
[mic1]$ exit

If you prefer GCC, Intel MPSS comes with a few customized GCC utilities for building native MIC applications on the x86_64 host, located in /usr/linux-k1om-4.7/bin/. For example, to cross-compile hello.c using gcc, run:

[aesyle]$ /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gcc hello.c -o hello.gcc.k1om

NOTE

  • -march=k1om is the default option.
  • When running /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-readelf -h against a MIC ELF binary, it shows the machine type (instruction set architecture) as Intel K1OM (0xB5). For a list of legal values for e_machine (architecture), check /opt/mpss/3.4.1/sysroots/k1om-mpss-linux/usr/include/elf.h.
  • By comparison, the machine type of an x86-64 ELF binary is Advanced Micro Devices X86-64 (0x3E).

OpenMP Programs

To cross-compile omp_hello.c, an OpenMP "Hello world" program written in C, using the Intel C compiler, run:

[aesyle]$ icc -mmic -openmp omp_hello.c -o omp_hello.k1om

where

  • -mmic enables cross-compiling of applications for the MIC Knights Corner microarchitecture.
  • -openmp enables the parallelizer to generate multi-threaded code based on OpenMP* directives.

You can run the OpenMP program natively on the coprocessor with:

[aesyle]$ ssh mic0 /pfs/dong/omp_hello.k1om

or you can first log onto the coprocessor, then run the OpenMP program:

[aesyle]$ ssh mic1
[mic1]$ /pfs/dong/omp_hello.k1om
[mic1]$ exit

MPI Programs

To cross-compile mpi_hello.c, an MPI "Hello world" program written in C, using Intel C compiler and Intel MPI, run:

[aesyle]$ mpiicc -mmic mpi_hello.c -o mpi_hello.k1om

You can run an MPI session of 60 processes on the coprocessor with:

[aesyle]$ ssh mic0 mpirun -n 60 /pfs/dong/mpi_hello.k1om

or you can first log onto the coprocessor, then run the MPI program:

[aesyle]$ ssh mic1
[mic1]$ mpirun -n 60 /pfs/dong/mpi_hello.k1om
[mic1]$ exit

Symmetric Mode

In symmetric mode, applications run on both the host processors and the coprocessors at the same time.

Compile the MPI code (mpi_hostname.c) for both the x86-64 architecture and the MIC Knights Corner architecture:

[aesyle]$ cd /pfs/dong
[aesyle]$ mpiicc mpi_hostname.c -o mpi_hostname.x86-64
[aesyle]$ mpiicc -mmic mpi_hostname.c -o mpi_hostname.k1om

Run the MPI program on all the processor and coprocessor cores[3]:

[aesyle]$ mpirun -n 12 -host aesyle /pfs/dong/mpi_hostname.x86-64 : \
  -n 60 -host mic0 /pfs/dong/mpi_hostname.k1om : \
  -n 60 -host mic1 /pfs/dong/mpi_hostname.k1om

NOTE

  1. Here we start a total of 132 MPI ranks (processes), with 12 on the host, 60 on mic0 and 60 on mic1.
  2. The host runs the x64 executable mpi_hostname.x86-64 and the coprocessors run the MIC executable mpi_hostname.k1om.
  3. By default, Intel MPI use InfiniBand if available. We can also use the TCP network fabrics -genv I_MPI_FABRICS shm:tcp.
  4. It is unnecessary to manually copy the executable and its dependencies (Intel libraries and MPI tools, etc) to the coprocessors. The environment has been properly configured on Aesyle to assure everything just works!

A shorthand way of doing this in symmetric mode is to use –machinefile option of the mpirun command in coordination with the I_MPI_MIC_POSTFIX environment variable.

[aesyle]$ cd /pfs/dong/
[aesyle]$ ln -s mpi_hostname.x86-64 mpi_hostname
[aesyle]$ export I_MPI_MIC_POSTFIX=.k1om

Create a machine file (/pfs/dong/machines), with simple host:ranks pairs on separate lines:

aesyle:12
mic0:60
mic1:60

Now we can run the MPI application in symmetric mode, with a far simpler command:

[aesyle]$ mpirun -machinefile machines /pfs/dong/mpi_hostname

NOTE

  1. The host runs /pfs/dong/mpi_hostname, which is a symbolic link to the x64 executable /pfs/dong/mpi_hostname.x86-64.
  2. The I_MPI_MIC_POSTFIX environment variable instructs mpirun to add the .k1om postfix to the executable name when running on the coprocessors; so the coprocessors run the MIC executable /pfs/dong/mpi_hostname.k1om.
  3. This shorthand way may not be easier than alternatives on Aesyle; but it is the preferred way of running applications in symmetric mode in a cluster environment with tons of Xeon Phi coprocessors.

Offload mode

In offload mode, an application starts execution on the host; as the computation proceeds it offloads part or all of the computation from its processes or threads to the coprocessors. This is the common execution model in other coprocessor/accelerator operating environments, like CUDA, OpenCL and OpenACC.

OpenMP 4.0

OpenMP 4.0[4], released in July 2013, adds support for accelerators, by introducing a few target directives. Intel compilers support some, but not all, new features in OpenMP 4.0[5]. I've written a sample "Hello world" program (omp4_hello.c), to demonstrate some of the OpenMP4.0 features.

Compile the code with Intel C compiler:

[aesyle]$ icc -openmp omp4_hello.c -o omp4_hello.x

By default, the program will utilize all host processor cores and coprocessor hardware threads, i.e., it will start 6 threads on each Xeon E5-2630L processor, and 236 threads on each Xeon Phi 5110P coprocessor (not 240 threads, because the last core is reserved for running the offload daemon coi_daemon). For brevity, let's start only 2 threads on the hosts, and 2 threads on each coprocessor.

[aesyle]$ export OMP_NUM_THREADS=2
[aesyle]$ export MIC_OMP_NUM_THREADS=2
[aesyle]$ ./omp4_hello.x
Hello, world! I am thread 0 on host
Hello, world! I am thread 1 on host
Number of threads = 2 on host

Number of devices = 2

Hello, world! I am thread 0 on mic0
Hello, world! I am thread 1 on mic0
Number of threads = 2 on mic0

Hello, world! I am thread 0 on mic1
Hello, world! I am thread 1 on mic1
Number of threads = 2 on mic1

Intel-specific Pragmas

Intel compilers provide several proprietary pragmas, offload and others with the prefix offload_ to explicitly direct data movement and code execution[6]. I've written a sample "Hello world" program (offload_hello.c) that is equivalent to the OpenMP 4.0 program above (omp4_hello.c) but uses Intel-Specific Pragmas.

Compile the code with Intel C compiler:

[aesyle]$ icc -openmp offload_hello.c -o offload_hello.x

By default, the program will utilize all host processor cores and coprocessor hardware threads. For brevity, let's start only 2 threads on the hosts, and 2 threads on each coprocessor.

[aesyle]$ export OMP_NUM_THREADS=2
[aesyle]$ export MIC_OMP_NUM_THREADS=2
[aesyle]$ ./offload_hello.x 
Hello, world! I am thread 0 on host
Hello, world! I am thread 1 on host
Number of threads = 2 on host

Number of devices = 2

Hello, world! I am thread 0 on mic1
Hello, world! I am thread 1 on mic1
Number of threads = 2 on mic1
Hello, world! I am thread 0 on mic0
Hello, world! I am thread 1 on mic0
Number of threads = 2 on mic0

NOTE

  1. For the time being, OpenMP 4.0 support is likely limited; using Intel-specific pragmas is the best way to program offload mode.
  2. Here we only touch upon the Explicit Offload model (non-shared memory model), where we explicitly directs data movement and code execution.
  3. We can also use the Implicit Offload model (virtual-shared memory model), which is suitable when the data exchanged between the CPU and the coprocessor is more complex than scalars, arrays, and structs that can be copied from one variable to another using a simple memcpy.[7].
  4. Sample codes using the explicit memory copy model can be found on Aesyle at:
    Fortran
    /opt/intel/composer_xe_2013_sp1.1.106/Samples/en_US/Fortran/mic_samples/
    C++
    /opt/intel/composer_xe_2013_sp1.1.106/Samples/en_US/C++/mic_samples/

MKL Automatic Offload

Intel Math Kernel Library includes a unique Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Xeon Phi coprocessors automatically and transparently[8]. This feature allows us to leverage the additional computational resources provided by the coprocessor without changes to Intel MKL calls in our codes. The data transfer and the execution management is completely automatic and transparent for the user.

Because the coprocessor(s) are connected to the host system via Peripheral Component Interconnect Express (PCIe) , AO support is provided only for functions that involve sufficiently large problems and those having large computation versus data access ratios. As of Intel MKL 11.0 only the following functions are enabled for automatic offload:

  • Level-3 BLAS functions
    •  ?GEMM (for M,N > 2048, k > 256)
    •  ?TRSM (for M,N > 3072)
    •  ?TRMM (for M,N > 3072)
    •  ?SYMM (for M,N > 2048)
  • LAPACK functions
    • LU (M,N > 8192)
    • QR
    • Cholesky

In the above list the matrix sizes for which MKL decides to offload the computation are given in brackets.

To enable automatic offload either the function mkl_mic_enable() has to be called within the source code or the environment variable MKL_MIC_ENABLE=1 has to be set. If no Xeon Phi coprocessor is detected the application runs on the host without penalty.

Build a program for automatic offload, the same way as building code for the Xeon host:

[aesyle]$ icc -O3 -mkl file.c -o file

By default, the MKL library decides when to offload and also tries to determine the optimal work division between the host and the targets (MKL can take advantage of multiple coprocessors). In case of the BLAS routines the user can specify the work division between the host and the coprocessor by calling the routine mkl_mic_set_Workdivision(MKL_TARGET_MIC,0,0.5)

or by setting the environment variable:

[aesyle]$ export MKL_MIC_0_WORKDIVISION=0.5

Both examples specify to offload 50% of computation only to the first coprocessor (mic0).

Further Readings

  1. Intel Xeon Phi Coprocessor
  2. Intel Xeon Phi Coprocessor: Software Developers Guide
  3. Intel C++ Compiler 14.0 - Intel MIC Architecture
  4. Intel Fortran Compiler 14.0 - Intel MIC Architecture
  5. Programming and Compiling for Intel Many Integrated Core Architecture
  6. Debugging Intel Xeon Phi Applications on Linux Host
  7. Intel Xeon Phi Coprocessor High-Performance Programming by James L. Jeffers and James Reinders. The ebook is available at UCSC library.
  8. Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers by Rezaur Rahman. The Kindle edition is freely available at Amazon too.

References

  1. Intel Xeon Phi Programming Environment
  2. Building a Native Application for Intel Xeon Phi Coprocessor
  3. Using the Intel MPI Library on Intel Xeon Phi Coprocessor Systems
  4. OpenMP 4.0 Application Program Interface
  5. OpenMP 4.0 Features in Intel Fortran Composer XE 2013
  6. Fortran vs. C offload directives and functions
  7. Intel C++ Compiler 14.0 - Using Shared Virtual Memory
  8. Using Intel® MKL Automatic Offload on Intel Xeon Phi Coprocessors