Hyades QuickStart Guide

From Hyades
Jump to: navigation, search
Front & Back of Hyades

Hyades is a Supercomputer dedicated to Computational Astrophysics research at University of California, Santa Cruz (UCSC). It is supported by a million-dollar grant from National Science Foundation (award number AST-1229745) and additional matching funds from UCSC.

System Overview

Architecturally, Hyades is a cluster comprised of the following components:

Component QTY Description
Master Node 1 Dell PowerEdge R820, 4x 8-core Intel Xeon E5-4620 (2.2 GHz), 128GB memory, 8x 1TB HDDs
Analysis Node 1 Dell PowerEdge R820, 4x 8-core Intel Xeon E5-4640 (2.4 GHz), 512GB memory, 2x 600GB SSDs
Type I Compute Nodes 180 Dell PowerEdge R620, 2x 8-core Intel Xeon E5-2650 (2.0 GHz), 64GB memory, 1TB HDD
Type IIa Compute Nodes 8 Dell PowerEdge C8220x, 2x 8-core Intel Xeon E5-2650 (2.0 GHz), 64GB memory, 2x 500GB HDDs, 1x Nvidia K20
Type IIb Compute Nodes 1 Dell PowerEdge R720, 2x 6-core Intel Xeon E5-2630L (2.0 GHz), 64GB memory, 500GB HDD, 2x Xeon Phi 5110P
Lustre Storage 1 146TB of usable storage served from a Terascala/Dell storage cluster
ZFS Server 1 SuperMicro Server, 2x 4-core Intel Xeon E5-2609V2 (2.5 GHz), 64GB memory, 2x 120GB SSDs, 36x 4TB HDDs
Cloud Storage 1 1PB of raw storage served from a Huawei UDS system
InfiniBand 17 17x Mellanox IS5024 QDR (40Gb/s) InfiniBand switches, configured in a 1:1 non-blocking Fat Tree topology
Gigabit Ethernet 7 7x Dell 6248 GbE switches, stacked in a Ring topology
10-gigabit Ethernet 1 1x Dell 8132F 10GbE switch

Master Node

The Master/Login Node is the entry point to the Hyades cluster. It is a Dell PowerEdge R820 server that contains four (4x) 8-core Intel Sandy Bridge Xeon E5-4620 processors at 2.2 GHz, 128 GB memory and eight (8x) 1TB hard drives in a RAID-6 array. Primary tasks to be performed on the Master Node are:

  • Editing codes and scripts
  • Compiling codes
  • Short test runs and debugging runs
  • Submitting and monitoring jobs

Exclamation.pngPlease do not run computationally intensive jobs on the Master Node. Those jobs should be submitted to the Torque batch system, which will allocate them to run on the compute nodes.

The hostname of the Master Node is hyades.ucsc.edu (IP: 128.114.126.225). To access the Master Node, use an SSH client that supports the SSH-2 protocol. Then execute the following command (replace username with your own real username):

ssh -l username hyades.ucsc.edu

or:

ssh username@hyades.ucsc.edu

Visualization & Analysis Node

The Visualization & Analysis Node is Eudora (hostname: eudora.ucsc.edu; IP: 128.114.126.226). Eudora is another public host of the Hyades cluster. It is a Dell PowerEdge R820 server that contains four (4x) 8-core Intel Sandy Bridge Xeon E5-4640 processors at 2.4 GHz, half a TB memory and two (2x) 600GB SSDs in a RAID-0 array. Eudora is designed to run jobs that require a lot of memory and/or fast IO speed. It is ideal for Visualization & Data Analysis tasks.

To access Eudora, use an SSH client that supports the SSH-2 protocol. Then execute the following command:

ssh -l username eudora.ucsc.edu

or:

ssh username@eudora.ucsc.edu

Compute Nodes

There are 3 types of Compute Nodes in the Hyades cluster.

Type I Compute Nodes (CNs I) are conventional compute nodes (180 in total). Each CN I is a Dell PowerEdge R620 server containing two (2x) 8-core Intel Sandy Bridge Xeon E5-2650 processors at 2.0 GHz, 64 GB memory and one 1TB hard drive. Among the 180 CNs I, 2/3 (120 nodes) have Hyper-Threading turned off (thus the operating system addresses 16 cores in each node); while the remaining 1/3 (60 nodes) have Hyper-threading turned on (thus the operating system addresses 32 virtual or logical cores in each node). The former belong to the normal queue; and the latter to the hyper queue of the Torque batch system.

Type IIa Compute Nodes (CNs IIa) are GPU nodes (8 in total). Each CN IIa is a Dell C8220x server containing two (2x) 8-core Intel Sandy Bridge Xeon E5-2650 processors at 2.0 GHz, 64 GB memory, two (2x) 500GB hard drives, and one Nvidia K20 GPU. All GPU nodes have Hyper-threading turned off (thus the operating system addresses 16 cores in each node); and they belong to the gpu queue of the Torque batch system.

Many Integrated Core (MIC) Architecture is Intel's response to GPU or Accelerated computing. We were very fortunate that we received a donation of two (2x) Xeon Phi 5110P processors from Intel in 2013. We've since integrated those 2 Xeon Phi processors into a Dell PowerEdge R720 server, which contains two (2x) 6-core Intel Sandy Bridge Xeon E5-2630L processors at 2.0 GHz, 64 GB memory and one 500GB hard drive. That machine is our one and only Type IIb Compute Node (CN IIb), and is christened as Aesyle. To experiment with MIC computing, please consult our MIC QuickStart Guide.

Storage

The Storage subsystem of the Hyades cluster is a rich medley of many interesting technologies.

On each node, we use tmpfs for /tmp; thus up to half of the memory (32GB on compute nodes and 256GB on Eudora) can be used for lightning fast file storage. In addition, part of the local hard drive is made available as a scratch space too (mounted at /scratch). Used judicially, those scratch spaces can help greatly improve the I/O performance of our simulations. Users are warmly encouraged to explore the great opportunity offered by those scratch spaces. If you do use them, please make sure to clean up the spaces after your job is done, as they are temporary by nature.

The /home partition is served from a ZFS pool on a FreeBSD server. The server is a SuperMicro box containing 2x 4-core Intel Ivy Bridge Xeon E5-2609V2 processors at 2.5 GHz, 64GB memory, 2x 120GB SSDs, 36x 4TB HDDs. Among those 36x HDDs, 12 are in a RAIDZ2 ZFS volume which is NFS-mounted at /home on each node; the remaining 24 are in another RAIDZ2 ZFS volume which is NFS-mounted at /trove on the Master Node & Eudora.

On Hyades, the workhorse file system is Lustre, which is a high performance parallel distributed file system, and is widely used in top supercomputers in the world. The Lustre storage of Hyades is served from a Terascala/Dell storage cluster. It provides 146TB of usable capacity, and is mounted at /pfs on each node.

The last piece of our storage jigsaw is a Huawei Cloud Storage system. In 2013, We were very privileged to collaborate with Huawei on deploying a UDS cloud storage system at UCSC. The Huawei Cloud Storage system provides a petabyte-level data storage, archiving, and sharing platform for Hyades. It is an Object Storage System, utilizing the Amazon S3 protocol. For further details, please refer to the main article Huawei Cloud Storage.

Interconnects

Hyades Network Topology

Here is a bird's eye view of the internetworking of the Hyades cluster.

InfiniBand fabric is the expressway of Hyades. The backbone of the InfiniBand fabric is made up of 17 Mellanox IS5024 QDR InfiniBand switches, which are interconnected to form a 1:1 non-blocking Fat Tree topology. The InfiniBand fabric delivers high bandwidth (40 Gb/s) as well as low latency (~ 1 microsecond). Every Compute Node, the Master Node, Eudora, as well as the Lustre storage cluster are all plugged into InfiniBand fabric. By default, the Message Passing of your MPI programs is conducted through InfiniBand; so is the Lustre file system served to all nodes in the cluster.

Every Compute Node, the Master Node, Eudora, the Lustre storage cluster, plus the ZFS file server are all interconnected through a Gigabit Ethernet (GbE) fabric too. The backbone of the fabric is made up of 7 Dell 6248 GbE switches, stacked in a Ring topology. The GbE is mostly used for management and Network File System (NFS) traffics. Although it is possible to run MPI programs through Gigabit Ethernet, it is not wise to do so; as the bandwidth is too low (1 Gb/s, of course) and the latency too high (~ a few milliseconds).

All the public hosts are also connected to a Dell 8132F 10-gigabit Ethernet (10GbE) switch, which, via UCSC's 10G routers, exposes Hyades to the chaotic and wild Internet. Moreover, it is worth noting that the Network File Systems (/home & /trove) are served to the Master Node and Eudora via 10GbE.

User Environment

Each user has a home directory at /home/$USER, where $USER is the username. The home directory is NFS-mounted on all the nodes in the Hyades cluster. It has a usable capacity of 36TB and is intended for storing your source codes and configuration files, and some reasonable amount of data as well. But because its I/O performance is relatively sluggish, Exclamation.pngdo not run your jobs from your home directory!

Instead you should run jobs from the Lustre scratch storage, which is mounted at /pfs on all the nodes. For your convenience, a symbolic link pfs (pointing to /pfs/$USER) is also created in your home directory.

For more details on storage, see the subsection Storage.

Module

      For more details on this topic, see Module.

We use the Environment Modules tool to manage users' software environment, via modulefiles. The Intel Compilers module and the Intel MPI module are loaded by default.

To see what modules are currently loaded, run:

module list

These modules are loaded by default:

  • intel_mpi/4.1.3
  • intel_compilers/14.0.1

To see what modules are available, run:

module avail

To learn the usage of the module tool, run:

module --help

Compiling codes

      Main article: Compilers

Compiling Serial Programs

Intel Compilers are the default and recommended compilers on Hyades; PGI Compilers and GNU Compiler Collection (GCC) are available as alternatives. The following table summarizes how to compile C/C++ and Fortran 77/90 serial programs using the Intel Compilers.

Compiler Program TypeSuffix Example
icc C .c icc [compiler_options] prog.c
icpc C++ .C, .cc, .cpp, .cxx icpc [compiler_options] prog.cpp
ifort Fortran 77 .f, .for, .ftn ifort [compiler_options] prog.f
ifort Fortran 90 .f90, .fpp ifort [compiler_options] prog.f90

Here are a few examples:

To compile hello.c, a serial "Hello world" program written in C, run

icc -o hello.x hello.c

To compile hello.cpp, a serial "Hello world" program written in C++, run

icpc -o hello.x hello.cpp

To compile hello.f, a serial "Hello world" program written in Fortran 77, run

ifort -o hello.x hello.f

To compile hello.f90, a serial "Hello world" program written in Fortran 90, run

ifort -o hello.x hello.f90

Compiling MPI Programs

      Main article: MPI

Intel MPI is the default MPI implementation on Hyades. The Intel MPI Library is a multi-fabric message passing library that implements the Message Passing Interface v2.2 (MPI-2.2) specification. The following table summarizes how to compile MPI programs in C/C++ and Fortran 77/90, using Intel MPI.

MPI Compiler Command Default Compiler Supported Language(s)
mpicc gcc C
mpicxx g++ C/C++
mpifc gfortran Fortran 77/90
mpigcc gcc C
mpigxx g++ C/C++
mpif77 g77 Fortran 77
mpif90 gfortran Fortran 90
mpiicc icc C
mpiicpc icpc C++
mpiifort ifort Fortran 77/90

The mpicmds in the table above are just wrappers of the GNU and Intel compilers. They automatically link startup and message passing libraries for Intel MPI into the executables. Here are a few examples:

To compile mpi_hello.c, an MPI "Hello world" program written in C, run

mpiicc -o mpi_hello.x mpi_hello.c

To compile mpi_hello.cpp, an MPI "Hello world" program written in C++, run

mpiicpc -o mpi_hello.x mpi_hello.cpp

To compile mpi_hello.f, an MPI "Hello world" program written in Fortran 77, run

mpiifort -o mpi_hello.x mpi_hello.f

To compile mpi_hello.f90, an MPI "Hello world" program written in Fortran 90, run

mpiifort -o mpi_hello.x mpi_hello.f90

Compiling OpenMP Programs

To compile OpenMP programs using Intel compilers, option -openmp must be set. Here are a few examples:

To compile omp_hello.c, an OpenMP "Hello world" program written in C, run

icc -openmp -o omp_hello.x omp_hello.c

To compile omp_hello.cpp, an OpenMP "Hello world" program written in C++, run

icpc -openmp -o omp_hello.x omp_hello.cpp

To compile omp_hello.f, an OpenMP "Hello world" program written in Fortran 77, run

ifort -openmp -o omp_hello.x omp_hello.f

To compile omp_hello.f90, an OpenMP "Hello world" program written in Fortran 90, run

ifort -openmp -o omp_hello.x omp_hello.f90
Note3.png
Cache Line Size
For OpenMP programming, it is very important to know the cache line size of the CPU. Here is a tip on how to get the value on a Linux machine:
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64 (bytes)

Compiling Hybrid Programs

All the nodes in Hyades are Non-Uniform Memory Access (NUMA) systems. In each of the compute nodes, each of the two (2x) Intel Sandy Bridge Xeon processor has its own integrated memory controller and PCI express controller; and the 2 processors (16 cores in total) share a single QDR InfiniBand link. There are 2 NUMA nodes per compute node, each processor belonging to one. To extract the maximal performance out of such an architecture, it is often profitable to employ a hybrid programming model, in which we launch only one MPI process on each processor (NUMA node) and then start one thread on each core of the processor. This model often compares favorably with the pure MPI model, in which we launch one MPI process on each processor core.

Note3.png
Two options, -mt_mpi & -openmp, must be set when compiling MPI/OpenMP hybrid programs:
  • -mt_mpi: linking the thread safe version of the Intel MPI Library
  • -openmp: enabling the parallelizer to generate multi-threaded code based on OpenMP directives

Here are a few examples on how to compile MPI/OpenMP hybrid programs:

To compile hybrid_hello.c, an hybrid "Hello world" program written in C, run

mpiicc -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.c

To compile hybrid_hello.cpp, an hybrid "Hello world" program written in C++, run

mpiicpc -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.cpp

To compile hybrid_hello.f, an hybrid "Hello world" program written in Fortran 77, run

mpiifort -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.f

To compile hybrid_hello.f90, an hybrid "Hello world" program written in Fortran 90, run

mpiifort -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.f90

Intel Compiler Options

Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

At the most basic level of optimization that the compiler can perform is -On options, explained below.

Level Description
n = 0 Fast compilation, full debugging support; equivalent to -g
n = 1,2 Low to moderate optimization, partial debugging support:
  • instruction rescheduling
  • copy propagation
  • software pipelining
  • common subexpression elimination
  • prefetching, loop transformations
n = 3+ Aggressive optimization - compile time/space intensive and/or marginal effectiveness;

may change code semantics and results (sometimes even breaks code!):

  • enables -O2
  • more aggressive prefetching, loop transformations

The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Option Description
-c For compilation of source file only.
-O3 Aggressive optimization (-O2 is default).
-xAVX Optimizes for Intel processors that support AVX (Advanced Vector Extensions) instructions.
-g Debugging information, generates symbol table.
-mp Maintain floating point precision (disables some optimizations).
-mp1 Improve floating-point precision (speed impact is less than -mp).
-ip Enable single-file interprocedural (IP) optimizations (within files).
-ip0 Enable multi-file IP optimizations (between files).
-prefetch Enables data prefetching (requires –O3).
-openmp Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.

For more compiler/linker options, check the ifort and icc man pages, or consult the following online documentations:

Running Codes

Exclamation.pngDo not run your codes from your home directory, which is slow in speed and limited in capacity.

Torque

      For more details on this topic, see Torque.

On Hyades we use Torque as the resource manager and Maui as the job scheduler. Torque is an open-source derivative of Portable Batch System (PBS). Commonly used Torque tools include:

  • qsub, for submitting PBS job
  • qstat, for monitoring the status of jobs
  • qdel, for terminating jobs prior to completion

Consult the man pages for more detailed information regarding these commands.

Users submit jobs to a queue and wait in line until nodes become available to run the job. There are 3 queues: normal, hyper, and gpu. The default queue is normal, your job will be submitted to the normal queue if no queue name is specified. The following table summarizes the queue characteristics (n below is the number of nodes requested for the job):

Queue Total # of nodes resource per node Max Walltime qsub options
normal 120 16 cores 2 days -l nodes=n:ppn=16 -q normal
hyper 60 32 cores 4 days -l nodes=n:ppn=32 -q hyper
gpu 8 16 cores and 1 GPU 10 days -l nodes=n:ppn=16 -q gpu

To run your code on Hyades, usually you create a PBS job script, and then use the qsub command to submit the job to a queue. A PBS script is a shell script that contains a few extra comments at the beginning specifying directives to Torque/PBS. You are free to use your favorite shell; we use Bash in the following examples.

Running Serial Programs

To run the serial executable hello.x compiled in subsection Compiling Serial Programs, first make sure the executable resides in the Lustre scratch storage (in /pfs/$USER or one of its subdirectories); then create a PBS script named serial.pbs in the same directory, with the following content:

#!/bin/bash
 
#PBS -N serial
#PBS -l ncpus=1
#PBS -l walltime=0:10:00
 
cd $PBS_O_WORKDIR
./hello.x

Idea3.png Annotations of serial.pbs:

  • #PBS -N serial: the job name is serial
  • #PBS -l ncpus=1: we request 1 core for the job; alternatively, we can use #PBS -l nodes=1:ppn=1
  • #PBS -l walltime=0:10:00: we request 10 minutes of run time
  • if we want to submit the job to a queue other than the default normal, add a line like #PBS -q hyper
  • cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node

We submit the job with the following command:

qsub serial.pbs

Torque/PBS will print out the job ID, e.g.:

12345.hyades.ucsc.edu

The standard output of the executable will be saved in a file who name has the following form: job_name.ojob_ID. When our job is completed, for example, we'll get serial.o12345:

$ cat serial.o12345
Hello, world!

Running Embarrassingly Parallel Programs

      For more details on this topic, see PBS Job Array.

Oftentimes we need to run a lot of instances of the same executable simultaneously, but with different parameters. For example, here is a sample serial program (jobarray_hello.x) that takes an integer argument:

./jobarray_hello.x 23
Hello master, I am slave no. 23 running on hyades.ucsc.edu!

For educational purpose, let's assume that we need to run the following instances:

./jobarray_hello.x 101
./jobarray_hello.x 102
...
./jobarray_hello.x 164

Instead of submitting 64 serial jobs, we can submit one job array. Create a PBS script named jobarray.pbs, with the following content:

#!/bin/bash
 
#PBS -N jobarray
#PBS -l ncpus=1
#PBS -t 101-164
#PBS -l walltime=0:10:00
 
cd $PBS_O_WORKDIR
./jobarray_hello.x $PBS_ARRAYID

Idea3.png Annotations of jobarray.pbs:

  • #PBS -N jobarray: the job name is jobarray
  • #PBS -l ncpus=1: although the job array will use 64 cores in total, each member of the job array will use only 1 core
  • #PBS -t 101-164: task ids of the job array
  • if we want to submit the job to a queue other than the default normal, add a line like #PBS -q hyper
  • $PBS_ARRAYID: each member of the job array is assigned a unique identifier with the option -t above

We submit the jobs with the following command:

qsub jobarray.pbs

Each member's standard output will be saved in a file whose name has the following form: job_name.ojob_ID-task_id. When our job array is completed, for example, we'll get jobarray.o12345-101, ..., jobarray.o12345-164.

$ cat jobarray.o12345-103
Hello master, I am slave no. 103 running on astro-3-5.local!

Running MPI programs

Assume that we've successfully compiled the sample MPI program mpi_hello.c in subsection Compiling MPI Programs, and we want to run the executable mpi_hello.x on 64 cores. First make sure the executable resides in the Lustre scratch storage (in /pfs/$USER or one of its subdirectories); then create a PBS script named impi.pbs in the same directory, with the following content:

#!/bin/bash
 
#PBS -N impi
#PBS -l nodes=4:ppn=16
#PBS -l walltime=0:10:00
 
cd $PBS_O_WORKDIR
mpirun -genv I_MPI_FABRICS shm:ofa -n 64 ./mpi_hello.x

Idea3.png Annotations of impi.pbs:

  • #PBS -N impi: the job name is impi
  • #PBS -l nodes=4:ppn=16: the job will run on 4 nodes (64 cores) in the default normal queue
  • if we want to submit the job to the hyper queue instead, replace #PBS -l nodes=4:ppn=16 with the following 2 lines:
    • #PBS -q hyper
    • #PBS -l nodes=2:ppn=32
  • cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node
  • -env I_MPI_FABRICS shm:ofa: we use shared memory for intra-node communication and OFED verbs for inter-node communication

We are now ready to submit the job:

qsub impi.pbs

Running OpenMP Programs

To run the OpenMP executable omp_hello.x (compiled in subsection Compiling OpenMP Programs) on 16 cores of a compute node, create a PBS script named omp.pbs, with the following content:

#!/bin/bash
 
#PBS -N omp
#PBS -l nodes=1:ppn=16
#PBS -l walltime=0:10:00
 
export OMP_NUM_THREADS=16
cd $PBS_O_WORKDIR
./omp_hello.x

Idea3.png Annotations of omp.pbs:

  • #PBS -N omp: the job name is omp
  • #PBS -l nodes=1:ppn=16: we request 16 cores on a compute node
  • #PBS -l walltime=0:10:00: we request 10 minutes of run time
  • if we want to submit the job to a queue other than the default normal, add a line like #PBS -q hyper
  • export OMP_NUM_THREADS=16: set the maximum number of OpenMP threads to 16
  • cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node

We submit the job with:

qsub omp.pbs

Running Hybrid Programs

To run the MPI/OpenMP hybrid hybrid_hello.x (compiled in subsection Compiling Hybrid Programs) on 64 cores (8 MPI processes and 8 OpenMP threads per MPI process), create a PBS script named hybrid.pbs, with the following content:

#!/bin/bash
 
#PBS -N hybrid
#PBS -l nodes=4:ppn=16
#PBS -l walltime=0:10:00
 
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE | sort | uniq > hosts.$PBS_JOBID
export OMP_NUM_THREADS=8
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=compact
mpirun -machine hosts.$PBS_JOBID -genv I_MPI_FABRICS shm:ofa -n 8 -ppn 2 ./hybrid_hello.x

Idea3.png Annotations of hybrid.pbs:

  • #PBS -N hybrid: the job name is hybrid
  • #PBS -l nodes=4:ppn=16: the job will run on 4 nodes (64 cores) in the default normal queue
  • cd $PBS_O_WORKDIR: required; PBS starts the scripts from the home directory on the executing computing node
  • export OMP_NUM_THREADS=8: set the maximum number of OpenMP threads to 8
  • export I_MPI_PIN_DOMAIN=omp: control process pinning
  • export KMP_AFFINITY=compact: bind OpenMP threads to physical processing units
  • -env I_MPI_FABRICS shm:ofa: we use shared memory for intra-node communication and OFED verbs for inter-node communication
  • -ppn 2: 2 MPI processes per compute node (one on each processor)
  • -n 8: 8 MPI processes in total
  • if we want to submit the job to the hyper queue instead, use
    • #PBS -q hyper
    • #PBS -l nodes=2:ppn=32
    • mpirun -genv I_MPI_FABRICS shm:ofa -n 8 -ppn 4 ./hybrid_hello.x

We submit the job with:

qsub hybrid.pbs