From Hyades
Jump to: navigation, search

Aesyle is the one and only Type IIb Compute Node (CN IIb) in our Hyades cluster. We were very fortunate to receive a donation of two (2x) Xeon Phi 5110P processors from Intel in 2013. We've since integrated those 2 Xeon Phi processors into a Dell PowerEdge R720 server, which contains two (2x) 6-core Intel Sandy Bridge Xeon E5-2630L processors at 2.0 GHz, 64 GB memory and one 500GB hard drive.

Intel Xeon Phi Coprocessor 5110P

Many Integrated Core (MIC) Architecture is a coprocessor computer architecture developed by Intel and Xeon Phi is the brand name for all products based on the MIC architecture. Salient features of Intel Xeon Phi Coprocessor 5110P (belonging to the Knights Corner product line) are:

  • 60 cores (in-order, dual-issue x86 design)
  • 4 threads per core
  • Core speed: 1.053 GHz
  • 512-bit AVX (Advanced Vector Extensions)
  • Double precision peak performance: 1.01 TFLOPS = 1.053 (GHz) x 60 (cores) x 512/64 (AVX) x 2 (FMA)
  • Memory: 8GB GDDR5; bandwidth: 320 GB/s = 5 (GT/s) x 16 (channels) x 4 (B)
  • PCI express 2.0 x16; bandwidth: 500 (MB/s) x 8/10 x 16 = 8 GB/s (16 GB/s duplex)

Detailed specifications for Intel Xeon Phi Coprocessor 5110P can be found on Intel ARK[1]; or can be listed with the mpssinfo utility:

# mpssinfo 
MpssInfo Utility Log

		Vendor ID 		 : 0x8086
		Device ID 		 : 0x2250
		Subsystem ID 		 : 0x2500
		Coprocessor Stepping ID	 : 3
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 256 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : B1
		Board SKU 		 : B1PRQ-5110P/5120D
		ECC Mode 		 : Enabled
		SMC HW Revision 	 : Product 225W Passive CS

		Total No of Active Cores : 60
		Voltage 		 : 1004000 uV
		Frequency		 : 1052631 kHz

		GDDR Vendor		 : Elpida
		GDDR Version		 : 0x1
		GDDR Density		 : 2048 Mb
		GDDR Size		 : 7936 MB
		GDDR Technology		 : GDDR5 
		GDDR Speed		 : 5.000000 GT/s 
		GDDR Frequency		 : 2500000 kHz
		GDDR Voltage		 : 1501000 uV


The Intel Manycore Platform Software Stack (MPSS)[2] is necessary to run the Intel Xeon Phi Coprocessor. MPSS 3.4.1 was released on October 22, 2014. The kernel version on Aesyle is 2.6.32-358 (RHEL/CentOS 6.4), which is supported by MPSS 3.4.1. Here we document the installation of MPSS 3.4.1 on Aesyle[3].

Download MPSS 3.4.1 and unpack the tar ball:

# cd /scratch/
# wget
# tar xvf mpss-3.4.1-linux.tar

Remove previous installation of Intel MPSS:

# cd mpss-3.4.1
# ./
# rm -rf /var/mpss/*

If not present, generate a pair of SSH keys for root:

# cd ~/.ssh/
# ssh-keygen -t rsa

Install MPSS:

# cp ./modules/*`uname -r`*.rpm .
# yum install *.rpm

Load the mic.ko driver, and then initialize MPSS Default Settings:

# modprobe mic
# micctrl --cleanconfig
# micctrl --initdefaults

Update Flash & SMC

1. Check the status of the coprocessor(s):

# micctrl -s

If the status for all of the coprocessors is not ready, reset the coprocessor(s):

# micctrl -rw

2. Run:

# /usr/bin/micflash -update -device all

3. Reboot for all flash and SMC changes to take effect.

After reboot, the MPSS service should start automatically. If not, run:

# chkconfig mpss on
# service mpss start

MPSS with OFED Support

Let's first decode some jargons[4]:

  • HCA: Host Channel Adapter for InfiniBand (IB)
  • OpenFabrics Enterprise Distribution (OFED): open-source software for remote direct memory access (RDMA) and kernel bypass applications.
  • Symmetric Communication Interface Framework (SCIF)[5]
    • Sockets-like API for communication between processes on MIC and host within the same system
    • SCIF API provides both send-receive semantics, as well as Remote Memory Access (RMA) semantics
  • Coprocessor Communication Link (CCL)
    • Enables MIC to use IB directly and enables processes on the MIC to talk with the HCA
    • It's an IB proxy through which all privileged operations are staged through
    • Resides on the host and make requests on behalf of the process running on the MIC
    • Data movement calls from the process on the MIC can be made in a direct manner to the HCA using PCIe peer-to-peer copies
    • Intel MPSS implementation of IB verbs over SCIF API
    • This allows processes to use verbs API over a virtual HCA as underlying operations are handled using SCIF

There is one Mellanox QDR InfiniBand adapter in Aesyle, which uses the mlx4 driver.

# lspci | grep InfiniBand
41:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

# ibstatus 
Infiniband device 'mlx4_0' port 1 status:
	default gid:	 fe80:0000:0000:0000:0002:c903:002b:89eb
	base lid:	 0xd3
	sm lid:		 0x1
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 40 Gb/sec (4X QDR)
	link_layer:	 InfiniBand

RHEL/CentOS 6 includes Infiniband device drivers, verbs and MPI support; but it doesn't track OFED releases – Red Hat takes upstream packages directly, and takes kernel code from upstream kernels. For Xeon Phi to take advantage of InfiniBand, we need to manually compile and install an OFED distribution that supports Intel MPSS – see Chapter 2 of MPSS User's Guide[6].

Install OFED

0. Install prerequisite:

# yum install libtool flex tcl-devel

1. Download OFED

# wget
# tar xvf OFED-
# cd OFED-

2. Install the OFED stack:

# perl

During installation, select:

  • Option 2 (Install OFED Software)
  • Option 4 (Customize)
  • ...exclude kernel-ib*, *-debuginfo and dapl* packages...
  • ...exclude MPI packages...
  • "Install 32-bit packages? [y/N]", answer N
  • "Enable ROMIO support [Y/n]", answer Y
  • "Enable shared library support [Y/n]", answer Y
  • "Enable Checkpoint-Restart support [Y/n]", answer N

NOTE here we only install userland libraries and utilities for OFED; but not the device drivers / kernel modules.

Install Intel MPSS OFED:

# cd /scratch/mpss-3.4.1/
# cp ofed/modules/*`uname -r`*.rpm ofed
# rpm -Uvh ofed/*.rpm


  1. In this step we install the ofed-driver-`uname -r`-3.4.1-1.x86_64 package, which provides enhanced OFED drivers that support Intel MPSS on the host. Those kernel modules are located in /lib/modules/`uname -r`/updates/; while the stock IB kernel modules, provided by the kernel-`uname -r` package, are located in /lib/modules/`uname -r`/kernel/.
  2. The header files for the enhanced OFED drivers, provided by the ofed-driver-devel-`uname -r`-3.4.1-1.x86_64 package, are installed in /usr/src/ofed-driver/. We'll use those headers, in order to compile Lustre clients for both the host and the Phi coprocessors.

Network Configuration

A virtual TCP/IP network connection between the host and the Intel Xeon Phi coprocessor is created over the PCIe bus. By default, the network addresses for the coprocessors are:

  • Host side address of first coprocessor (mic0):
  • IP address of first coprocessor (mic0):
  • Host side address of second coprocessor (mic1):
  • IP address of second coprocessor (mic1):

The drawbacks of the default network configuration are:

  1. The 2 Phi coprocessors can't talk to each directly. One consequence is that we won't be able to run an MPI program utilizing both coprocessors.
  2. We won't be able to mount the home NFS share on the coprocessors – The NFS share is exported to 2 subnets: (Private GbE) & (Private 10GbE).

And by default, the virtual HCAs (InfiniBand interfaces) are not enabled on the coprocessors, so we won't be able to mount the Lustre file system on the coprocessors either. In a cluster environment, this will also prevent the processes on the coprocessors to communicate with each other using IB verbs.

Let's work out a better network configuration!

External Network Bridge

Here is the old configuration for em2 (10GbE interface):

# cat /etc/sysconfig/network-scripts/ifcfg-em2

We'll create a network bridge with 3 ports, em2, mic0 & mic1, on the host.

# service mpss stop
# umount /home
# umount /trove
# ifdown em2
# micctrl --addbridge=br0 --type=external --ip= --netbits=16  --mtu=9000

We still have to manually add em2 to the bridge. Here is the modified configuration for em2:

# cat /etc/sysconfig/network-scripts/ifcfg-em2

Configure the virtual network interfaces on the Phi coprocessors:

# micctrl --network=static --bridge=br0 --ip= mic0
# micctrl --network=static --bridge=br0 --ip= mic1

Apply the new configurations:

# service network restart


Now that br0 has taken over the em2's old IP address, we can remount the NFS shares on the host:

# mount /home
# mount /trove

Add the NFS mount /home to the Phi coprocessors:

# rm -rf /var/mpss/mic?/home/*
# micctrl --addnfs= --dir=/home --option=noatime,nosuid,nolock,soft

which will append the following line to /etc/fstab on the coprocessors:	/home	nfs		noatime,nosuid,nolock,soft		1 1

Start the Intel MPSS service:

# service mpss start

The NFS share is now mounted on the coprocessors and seems to be working fine!


Modify /etc/mpss/ipoib.conf to look as follows:

mic0_ib0=" netmask"
mic1_ib0=" netmask"

Start Xeon Phi coprocessor specific OFED service on the host:

# chkconfig ofed-mic on
# service ofed-mic start


  1. RHEL/CentOS 6 uses /etc/init.d/rdma to load/unload InfiniBand kernel modules; while the ofed-driver-`uname -r`-3.4.1-1.x86_64 package provides /etc/init.d/openibd to perform the essentially same tasks. Only one script is needed. Either one can be disabled, e.g., by running chkconfig --del rdma.
  2. Here we use CCL-Direct and IPoIB, which currently only works with OFED- on the Mellanox mlx4 driver and hardware.
  3. Since we use CCL-Direct, probably we don't need the ccl-proxy service (mpxyd)?


As of October, 2014, the Terascala Lustre Storage runs Lustre server 2.15; and almost all nodes in the Hyades cluster run Lustre client 1.8.9. Here we document how to install the latest feature release (2.6.0) of Lustre client on both the host (Aesyle) and the Phi coprocessors.

Lustre client on the host

Install dependencies:

# yum install libselinux-devel

Unmount the Lustre file system (/pfs):

# service lustre stop

Uninstall Lustre 1.8.9:

# rpm -e --noscripts lustre-modules lustre

Download the source RPM for latest feature release (2.6.0) of Lustre client:

$ wget --no-check-certificate

Rebuild RPMs for Lustre client:

$ rpmbuild --rebuild --define "configure_args --with-o2ib=/usr/src/ofed-driver" --without servers lustre-client-2.6.0-2.6.32_431.20.3.el6.x86_64.src.rpm

NOTE We now use Intel MPSS OFED on Aesyle and Lustre should be compiled against Intel MPSS OFED headers (located in /usr/src/ofed-driver/). The option --define "configure_args --with-o2ib=/usr/src/ofed-driver" passes the option --with-o2ib=/usr/src/ofed-driver to configure when building the RPMs.

Install Lustre client on the host:

# cd ~dong/rpmbuild/RPMS/x86_64/
# rpm -Uvh lustre-client-modules-2.6.0-2.6.32_358.el6.x86_64.x86_64.rpm \

Some warnings will spurt out; but can be safely ignored:

WARNING: /lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/infiniband/hw/ipath/ib_ipath.ko needs unknown symbol ib_wq
WARNING: /lib/modules/2.6.32-358.el6.x86_64/updates/drivers/infiniband/ulp/srpt/ib_srpt.ko needs unknown symbol scst_unregister

Remount the Lustre file system on the host:

# service lustre start

Lustre client on Xeon Phi

We mostly follow the guide on how to cross-compile Lustre client for Xeon Phi[7], with some slight variations so as to use the latest MPSS and Lustre releases.

Download Software for Coprocessor OS (k1om):

# cd /scratch/
# wget

Download MPSS source:

# wget

Unpack the tar balls:

# tar xvf mpss-3.4.1-k1om.tar
# tar xvf mpss-src-3.4.1.tar

Prepare the Linux kernel source code:

# tar xvfj ./mpss-3.4.1/src/linux-2.6.38+mpss3.4.1.tar.bz2

which will create a new directory ./linux-2.6.38+mpss3.4.1 containing the Linux kernel source code.

# rpm2cpio ./mpss-3.4.1/k1om/kernel-dev-2.6.38+mpss3.4.1-1.knightscorner.rpm | cpio -idmv

which will create a new directory ./boot containing the files needed to build new kernel modules for Xeon Phi.

# cp ./boot/config- ./linux-2.6.38+mpss3.4.1/.config
# cp ./boot/Module.symvers- ./linux-2.6.38+mpss3.4.1/Module.symvers
# cd ./linux-2.6.38+mpss3.4.1/
# make modules_prepare
# cd ..

Retrieve the Lustre source code:

# git clone git://
# cd lustre-release
# git checkout b2_6

Create the build script /scratch/

set -e
BUILD_DIR=`readlink -f $PWD`
mkdir -p ${DEST_DIR}
export ARCH=k1om
source /opt/mpss/3.4.1/environment-setup-k1om-mpss-linux
export LD=k1om-mpss-linux-ld
cd ${SCM_DIR}
./configure $CONFIGURE_FLAGS \
                                --disable-tests --disable-doc --disable-server \
                                --with-o2ib=/usr/src/ofed-driver/ \
make install DESTDIR=${DEST_DIR}
cd ${DEST_DIR}
mv ./opt/lustre/2.*/k1om-mpss-linux/* .
rm -rf ./opt/lustre
tar cvzf ${BUILD_DIR}/lustre-phi.tar.gz ./

Cross-compile Lustre client for Xeon Phi:

# chmod +x
# ./

which will create a new lustre-phi.tar.gz tarball.

Test Lustre client on Xeon Phi:

# scp lustre-phi.tar.gz mic0:/
# ssh mic0
[root@mic0]# cd /
[root@mic0]# tar xvzf lustre-phi.tar.gz
[root@mic0]# depmod
[root@mic0]# echo 'options lnet networks=o2ib0(ib0)' >> /etc/modprobe.d/lustre.conf
[root@mic0]# modprobe lnet
[root@mic0]# lctl network up
[root@mic0]# mkdir /pfs

mount -t lustre failed:

# mount -t lustre /pfs
mount: mounting on /pfs failed: Invalid argument

But mount.lustre seems to work fine:

# /sbin/mount.lustre /pfs

Automate Lustre client on Xeon Phi:

Add Lustre client to the root file system on Xeon Phi:

# tar xvfz lustre-phi.tar.gz -C /var/mpss/common/
# mkdir -p /var/mpss/common/etc/modprobe.d
# echo 'options lnet networks=o2ib0(ib0)' >> /var/mpss/common/etc/modprobe.d/lustre.conf
# mkdir /var/mpss/common/pfs
# rm -f /var/mpss/common/etc/init.d/lnet

Create an init script for Lustre client on Xeon Phi (/var/mpss/common/etc/init.d/lustre):

# system init for lustre
let err=0
case "$1" in
	echo -n "    Starting lustre ... "
	/sbin/mount.lustre /pfs || let err++
	echo "Done."
	echo -n "    Stopping lustre ... "
	fuser -k /pfs/ /pfs/*
	umount /pfs &> /dev/null
	echo "Done."
	$0 stop		&&
	$0 start	||
	exit 1
	mount | grep lustre
	[ $? -ne 0 ] && echo "Lustre is not mounted"
exit $err

Modify the init script /etc/init.d/ofed-mic on host:

1. Add the following line to near the end of function start_mic():

$ssh $1 /etc/init.d/lustre start &> /dev/null

2. Add the following line to the beginning of function stop_mic():

$ssh $1 /etc/init.d/lustre stop &> /dev/null

3. Replace the following line in start():

	ip address add dev mic0 label mic0:ib


	ip address add dev br0 label br0:ib

4. Replace the following line in stop():

	ip address del dev mic0 2>/dev/null


	ip address del dev br0 2>/dev/null

NOTE the last 2 changes are necessary because mic0 is now a port on br0 and can't be assigned an IP address.

Create a symbolic /opt/intel Intel on the coprocessors, pointing to /pfs/sw/intel, where Intel compilers and Intel MPI are installed:

# rm -rf /var/mpss/mic?/opt
# cd /var/mpss/common/opt/
# ln -s /pfs/sw/intel

Restart the mpss service:

# service mpss restart

Restart the ofed-mic service:

# service ofed-mic restart

Not sure if we need the ccl-proxy service. Let's start it nonetheless:

# chkconfig --add mpxyd
# service mpxyd start

Voila! InfiniBand and Lustre client now appear to be fully working on both the host and the coprocessors!


SSH environment

When running applications directly on Xeon Phi coprocessors (native mode), we usually take the following steps[8]:

  1. Compile the application for native execution.
  2. Build required libraries for native execution.
  3. Copy the executable and any dependencies, such as runtime libraries, to the target hardware.
  4. Mount file shares to the target hardware for accessing input data sets and saving output data sets.
  5. Connect to the target hardware via console, set up the environment, and run the application.

However, it is very tedious to copy the executable and any dependencies to the coprocessors whenever we need to run a native application; and those files can consume a large chunk of coprocessor memory, a limited resource we would rather reserve for running applications. On Aesyle, the NFS share /share and Lustre file system /pfs are mounted on both the host and coprocessors, we'll use those to share files between host and coprocessors.

There is a hurdle to overcome, though. The SSH server on Embedded Linux was compiled with PATH=/usr/bin:/bin:/usr/sbin:/sbin; but Bash on the Embedded Linux does not source the ~/.bashrc file on non-login, non-interactive SSH sessions.

[aesyle]$ ssh mic0 echo \$PATH

So we can't use ~/.bashrc to set environment variables like PATH and LD_LIBRARY_PATH. To run the sample MPI "Hello world" program in native mode, we would have to do something like the following:

[aesyle]$ ssh mic0 \
  PATH=/usr/bin:/bin:/pfs/sw/intel/impi/ \
  LD_LIBRARY_PATH=/pfs/sw/intel/composer_xe_2013_sp1.1.106/compiler/lib/mic:/pfs/sw/intel/impi/ \
  mpirun -n 60 /pfs/dong/mpi_hello.k1om

This is tiresome! One possible fix is to use ~/.ssh/environment to set to set environment variables. For this to work, we enable PermitUserEnvironment in /etc/ssh/sshd_config on the embedded Linux (the default is no):

PermitUserEnvironment yes

but then every user will have to modified his/her ~/.ssh/environment file, which is not ideal. The ordinary users want it just works. OpemSSH does not offer a way to set environment globally by itself. But we can use the PAM module to easily achieve the goal nonetheless. If we append the following line to the default /etc/pam.d/sshd on the coprocessors,

session    required readenv=1

SSH sessions will read /etc/environment to set environment variables. This is a better and my preferred solution!

Modify the root file system on the coprocessors:

# mkdir /var/mpss/common/etc/pam.d

Create /var/mpss/common/etc/pam.d/sshd that reads as follows:

auth       include      common-auth
account    required
account    include      common-account
password   include      common-password
session    optional force revoke
session    include      common-session
session    required
session    required
session    required readenv=1

Create /var/mpss/common/etc/environment:


Restart mpss to apply the new settings to the coprocessors:

# /etc/init.d/ofed-mic stop
# /etc/init.d/mpss restart
# /etc/init.d/ofed-mic start

Now we can run the sample MPI "Hello world" program in native mode, with a much simpler command:

[aesyle]$ ssh mic0 mpirun -n 60 /pfs/dong/mpi_hello.k1om

Bash profile

Now that we've fixed non-login, non-interactive Bash shell, we'll turn to interactive login Bash shell. By default, PATH for an interactive login shell is /usr/local/bin:/usr/bin:/bin on the coprocessors. So to run the sample MPI "Hello world" program interactively in native mode, we would have to do something like the following:

[aesyle]$ ssh mic1
[mic1]$ export PATH=$PATH:/pfs/sw/intel/impi/
[mic1]$ export LD_LIBRARY_PATH=/pfs/sw/intel/composer_xe_2013_sp1.1.106/compiler/lib/mic
[mic1]$ mpirun -n 60 /pfs/dong/mpi_hello.k1om
[mic1]$ exit

which can be easily fixed as well!

Modify the root file system on the coprocessors:

# mkdir /var/mpss/common/etc/profile.d

Create /var/mpss/common/etc/profile.d/ that reads as follows:

export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/opt/intel/impi/
export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013_sp1.1.106/compiler/lib/mic:/opt/intel/impi/

Restart mpss to apply the new settings to the coprocessors:

# /etc/init.d/ofed-mic stop
# /etc/init.d/mpss restart
# /etc/init.d/ofed-mic start

Now it is much easier to run the sample MPI "Hello world" program interactively in native mode:

[aesyle]$ ssh mic1
[mic1]$ mpirun -n 60 /pfs/dong/mpi_hello.k1om
[mic1]$ exit

intel_mic module

In the intel_mic module, we define the environmental variable I_MPI_MIC=enable to enable the MPI communication between host and coprocessors[9].

We set:


In Symmetric Execution Mode, mpirun will by default replicate the host's environment variables to the coprocessors. Setting LD_LIBRARY_PATH as such will allow the MPI processes to find the appropriate shared libraries on the coprocessors; thus we can use much shorter commands to run programs in symmetric mode. Otherwise, we we would have to do something like the following:

[aesyle]$ mpirun -genv I_MPI_FABRICS shm:tcp \
  -n 2 -host `hostname` /pfs/dong/mpi_hello.x86-64 : \
  -env PATH /pfs/sw/intel/impi/ \
  -env LD_LIBRARY_PATH /pfs/sw/intel/composer_xe_2013_sp1.1.106/compiler/lib/mic:/pfs/sw/intel/impi/ \
  -n 60 -host mic0 /pfs/dong/mpi_hello.k1om : \
  -env PATH /pfs/sw/intel/impi/ \
  -env LD_LIBRARY_PATH /pfs/sw/intel/composer_xe_2013_sp1.1.106/compiler/lib/mic:/pfs/sw/intel/impi/ \
  -n 60 -host mic1 /pfs/dong/mpi_hello.k1om

We also set the following for Offload Execution Mode:


"By default, all environment variables defined in the environment of an executing CPU program are replicated to the coprocessor's execution environment when an offload occurs. You can modify this behavior by defining the environment variable MIC_ENV_PREFIX. When you set MIC_ENV_PREFIX to a specific prefix, then not all CPU environment variables are replicated to the coprocessor, but only those environment variables that begin with the value of the MIC_ENV_PREFIX environment variable. The environment variables set on the coprocessor have the prefix value removed. You thus have independent control of OpenMP, Intel Cilk Plus, and other execution environments that use common environment variable names."[10]


  1. Intel Xeon Phi Coprocessor 5110P (8GB, 1.053 GHz, 60 core)
  2. Intel Manycore Platform Software Stack (MPSS)
  3. Intel Manycore Platform Software Stack MPSS 3.4.1 README
  4. Communication in a HPC cluster with MIC
  5. Symmetric Communications Interface (SCIF) User's Guide
  6. Intel MPSS User's Guide
  7. How to cross-compile Lustre client for Xeon Phi
  8. Building a Native Application for Intel Xeon Phi Coprocessors
  9. Using the Intel MPI Library on Intel Xeon Phi Coprocessor Systems
  10. Setting Environment Variables on the CPU to Modify the Coprocessor's Execution Environment