Hyades QuickStart Guide
Hyades is a Supercomputer dedicated to Computational Astrophysics research at University of California, Santa Cruz (UCSC). It is supported by a million-dollar grant from National Science Foundation (award number AST-1229745) and additional matching funds from UCSC.
- 1 System Overview
- 2 User Environment
- 3 Compiling codes
- 4 Running Codes
Architecturally, Hyades is a cluster comprised of the following components:
|Master Node||1||Dell PowerEdge R820, 4x 8-core Intel Xeon E5-4620 (2.2 GHz), 128GB memory, 8x 1TB HDDs|
|Analysis Node||1||Dell PowerEdge R820, 4x 8-core Intel Xeon E5-4640 (2.4 GHz), 512GB memory, 2x 600GB SSDs|
|Type I Compute Nodes||180||Dell PowerEdge R620, 2x 8-core Intel Xeon E5-2650 (2.0 GHz), 64GB memory, 1TB HDD|
|Type IIa Compute Nodes||8||Dell PowerEdge C8220x, 2x 8-core Intel Xeon E5-2650 (2.0 GHz), 64GB memory, 2x 500GB HDDs, 1x Nvidia K20|
|Type IIb Compute Nodes||1||Dell PowerEdge R720, 2x 6-core Intel Xeon E5-2630L (2.0 GHz), 64GB memory, 500GB HDD, 2x Xeon Phi 5110P|
|Lustre Storage||1||146TB of usable storage served from a Terascala/Dell storage cluster|
|ZFS Server||1||SuperMicro Server, 2x 4-core Intel Xeon E5-2609V2 (2.5 GHz), 64GB memory, 2x 120GB SSDs, 36x 4TB HDDs|
|Cloud Storage||1||1PB of raw storage served from a Huawei UDS system|
|InfiniBand||17||17x Mellanox IS5024 QDR (40Gb/s) InfiniBand switches, configured in a 1:1 non-blocking Fat Tree topology|
|Gigabit Ethernet||7||7x Dell 6248 GbE switches, stacked in a Ring topology|
|10-gigabit Ethernet||1||1x Dell 8132F 10GbE switch|
The Master/Login Node is the entry point to the Hyades cluster. It is a Dell PowerEdge R820 server that contains four (4x) 8-core Intel Sandy Bridge Xeon E5-4620 processors at 2.2 GHz, 128 GB memory and eight (8x) 1TB hard drives in a RAID-6 array. Primary tasks to be performed on the Master Node are:
- Editing codes and scripts
- Compiling codes
- Short test runs and debugging runs
- Submitting and monitoring jobs
Please do not run computationally intensive jobs on the Master Node. Those jobs should be submitted to the Torque batch system, which will allocate them to run on the compute nodes.
The hostname of the Master Node is hyades.ucsc.edu (IP: 126.96.36.199). To access the Master Node, use an SSH client that supports the SSH-2 protocol. Then execute the following command (replace username with your own real username):
ssh -l username hyades.ucsc.edu
Visualization & Analysis Node
The Visualization & Analysis Node is Eudora (hostname: eudora.ucsc.edu; IP: 188.8.131.52). Eudora is another public host of the Hyades cluster. It is a Dell PowerEdge R820 server that contains four (4x) 8-core Intel Sandy Bridge Xeon E5-4640 processors at 2.4 GHz, half a TB memory and two (2x) 600GB SSDs in a RAID-0 array. Eudora is designed to run jobs that require a lot of memory and/or fast IO speed. It is ideal for Visualization & Data Analysis tasks.
ssh -l username eudora.ucsc.edu
There are 3 types of Compute Nodes in the Hyades cluster.
▸ Type I Compute Nodes (CNs I) are conventional compute nodes (180 in total). Each CN I is a Dell PowerEdge R620 server containing two (2x) 8-core Intel Sandy Bridge Xeon E5-2650 processors at 2.0 GHz, 64 GB memory and one 1TB hard drive. Among the 180 CNs I, 2/3 (120 nodes) have Hyper-Threading turned off (thus the operating system addresses 16 cores in each node); while the remaining 1/3 (60 nodes) have Hyper-threading turned on (thus the operating system addresses 32 virtual or logical cores in each node). The former belong to the normal queue; and the latter to the hyper queue of the Torque batch system.
▸ Type IIa Compute Nodes (CNs IIa) are GPU nodes (8 in total). Each CN IIa is a Dell C8220x server containing two (2x) 8-core Intel Sandy Bridge Xeon E5-2650 processors at 2.0 GHz, 64 GB memory, two (2x) 500GB hard drives, and one Nvidia K20 GPU. All GPU nodes have Hyper-threading turned off (thus the operating system addresses 16 cores in each node); and they belong to the gpu queue of the Torque batch system.
▸ Many Integrated Core (MIC) Architecture is Intel's response to GPU or Accelerated computing. We were very fortunate that we received a donation of two (2x) Xeon Phi 5110P processors from Intel in 2013. We've since integrated those 2 Xeon Phi processors into a Dell PowerEdge R720 server, which contains two (2x) 6-core Intel Sandy Bridge Xeon E5-2630L processors at 2.0 GHz, 64 GB memory and one 500GB hard drive. That machine is our one and only Type IIb Compute Node (CN IIb), and is christened as Aesyle. To experiment with MIC computing, please consult our MIC QuickStart Guide.
The Storage subsystem of the Hyades cluster is a rich medley of many interesting technologies.
▸ On each node, we use tmpfs for /tmp; thus up to half of the memory (32GB on compute nodes and 256GB on Eudora) can be used for lightning fast file storage. In addition, part of the local hard drive is made available as a scratch space too (mounted at /scratch). Used judicially, those scratch spaces can help greatly improve the I/O performance of our simulations. Users are warmly encouraged to explore the great opportunity offered by those scratch spaces. If you do use them, please make sure to clean up the spaces after your job is done, as they are temporary by nature.
▸ The /home partition is served from a ZFS pool on a FreeBSD server. The server is a SuperMicro box containing 2x 4-core Intel Ivy Bridge Xeon E5-2609V2 processors at 2.5 GHz, 64GB memory, 2x 120GB SSDs, 36x 4TB HDDs. Among those 36x HDDs, 12 are in a RAIDZ2 ZFS volume which is NFS-mounted at /home on each node; the remaining 24 are in another RAIDZ2 ZFS volume which is NFS-mounted at /trove on the Master Node & Eudora.
▸ On Hyades, the workhorse file system is Lustre, which is a high performance parallel distributed file system, and is widely used in top supercomputers in the world. The Lustre storage of Hyades is served from a Terascala/Dell storage cluster. It provides 146TB of usable capacity, and is mounted at /pfs on each node.
▸ The last piece of our storage jigsaw is a Huawei Cloud Storage system. In 2013, We were very privileged to collaborate with Huawei on deploying a UDS cloud storage system at UCSC. The Huawei Cloud Storage system provides a petabyte-level data storage, archiving, and sharing platform for Hyades. It is an Object Storage System, utilizing the Amazon S3 protocol. For further details, please refer to the main article Huawei Cloud Storage.
Here is a bird's eye view of the internetworking of the Hyades cluster.
▸ InfiniBand fabric is the expressway of Hyades. The backbone of the InfiniBand fabric is made up of 17 Mellanox IS5024 QDR InfiniBand switches, which are interconnected to form a 1:1 non-blocking Fat Tree topology. The InfiniBand fabric delivers high bandwidth (40 Gb/s) as well as low latency (~ 1 microsecond). Every Compute Node, the Master Node, Eudora, as well as the Lustre storage cluster are all plugged into InfiniBand fabric. By default, the Message Passing of your MPI programs is conducted through InfiniBand; so is the Lustre file system served to all nodes in the cluster.
▸ Every Compute Node, the Master Node, Eudora, the Lustre storage cluster, plus the ZFS file server are all interconnected through a Gigabit Ethernet (GbE) fabric too. The backbone of the fabric is made up of 7 Dell 6248 GbE switches, stacked in a Ring topology. The GbE is mostly used for management and Network File System (NFS) traffics. Although it is possible to run MPI programs through Gigabit Ethernet, it is not wise to do so; as the bandwidth is too low (1 Gb/s, of course) and the latency too high (~ a few milliseconds).
▸ All the public hosts are also connected to a Dell 8132F 10-gigabit Ethernet (10GbE) switch, which, via UCSC's 10G routers, exposes Hyades to the chaotic and wild Internet. Moreover, it is worth noting that the Network File Systems (/home & /trove) are served to the Master Node and Eudora via 10GbE.
Each user has a home directory at /home/$USER, where $USER is the username. The home directory is NFS-mounted on all the nodes in the Hyades cluster. It has a usable capacity of 36TB and is intended for storing your source codes and configuration files, and some reasonable amount of data as well. But because its I/O performance is relatively sluggish, do not run your jobs from your home directory!
Instead you should run jobs from the Lustre scratch storage, which is mounted at /pfs on all the nodes. For your convenience, a symbolic link pfs (pointing to /pfs/$USER) is also created in your home directory.
For more details on storage, see the subsection Storage.
For more details on this topic, see Module.
▸ To see what modules are currently loaded, run:
These modules are loaded by default:
▸ To see what modules are available, run:
▸ To learn the usage of the module tool, run:
Main article: Compilers
Compiling Serial Programs
Intel Compilers are the default and recommended compilers on Hyades; PGI Compilers and GNU Compiler Collection (GCC) are available as alternatives. The following table summarizes how to compile C/C++ and Fortran 77/90 serial programs using the Intel Compilers.
|icc||C||.c||icc [compiler_options] prog.c|
|icpc||C++||.C, .cc, .cpp, .cxx||icpc [compiler_options] prog.cpp|
|ifort||Fortran 77||.f, .for, .ftn||ifort [compiler_options] prog.f|
|ifort||Fortran 90||.f90, .fpp||ifort [compiler_options] prog.f90|
Here are a few examples:
▸ To compile hello.c, a serial "Hello world" program written in C, run
icc -o hello.x hello.c
▸ To compile hello.cpp, a serial "Hello world" program written in C++, run
icpc -o hello.x hello.cpp
▸ To compile hello.f, a serial "Hello world" program written in Fortran 77, run
ifort -o hello.x hello.f
▸ To compile hello.f90, a serial "Hello world" program written in Fortran 90, run
ifort -o hello.x hello.f90
Compiling MPI Programs
Main article: MPI
Intel MPI is the default MPI implementation on Hyades. The Intel MPI Library is a multi-fabric message passing library that implements the Message Passing Interface v2.2 (MPI-2.2) specification. The following table summarizes how to compile MPI programs in C/C++ and Fortran 77/90, using Intel MPI.
|MPI Compiler Command||Default Compiler||Supported Language(s)|
The mpicmds in the table above are just wrappers of the GNU and Intel compilers. They automatically link startup and message passing libraries for Intel MPI into the executables. Here are a few examples:
▸ To compile mpi_hello.c, an MPI "Hello world" program written in C, run
mpiicc -o mpi_hello.x mpi_hello.c
▸ To compile mpi_hello.cpp, an MPI "Hello world" program written in C++, run
mpiicpc -o mpi_hello.x mpi_hello.cpp
▸ To compile mpi_hello.f, an MPI "Hello world" program written in Fortran 77, run
mpiifort -o mpi_hello.x mpi_hello.f
▸ To compile mpi_hello.f90, an MPI "Hello world" program written in Fortran 90, run
mpiifort -o mpi_hello.x mpi_hello.f90
Compiling OpenMP Programs
To compile OpenMP programs using Intel compilers, option -openmp must be set. Here are a few examples:
▸ To compile omp_hello.c, an OpenMP "Hello world" program written in C, run
icc -openmp -o omp_hello.x omp_hello.c
▸ To compile omp_hello.cpp, an OpenMP "Hello world" program written in C++, run
icpc -openmp -o omp_hello.x omp_hello.cpp
▸ To compile omp_hello.f, an OpenMP "Hello world" program written in Fortran 77, run
ifort -openmp -o omp_hello.x omp_hello.f
▸ To compile omp_hello.f90, an OpenMP "Hello world" program written in Fortran 90, run
ifort -openmp -o omp_hello.x omp_hello.f90
Compiling Hybrid Programs
All the nodes in Hyades are Non-Uniform Memory Access (NUMA) systems. In each of the compute nodes, each of the two (2x) Intel Sandy Bridge Xeon processor has its own integrated memory controller and PCI express controller; and the 2 processors (16 cores in total) share a single QDR InfiniBand link. There are 2 NUMA nodes per compute node, each processor belonging to one. To extract the maximal performance out of such an architecture, it is often profitable to employ a hybrid programming model, in which we launch only one MPI process on each processor (NUMA node) and then start one thread on each core of the processor. This model often compares favorably with the pure MPI model, in which we launch one MPI process on each processor core.
Here are a few examples on how to compile MPI/OpenMP hybrid programs:
▸ To compile hybrid_hello.c, an hybrid "Hello world" program written in C, run
mpiicc -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.c
▸ To compile hybrid_hello.cpp, an hybrid "Hello world" program written in C++, run
mpiicpc -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.cpp
▸ To compile hybrid_hello.f, an hybrid "Hello world" program written in Fortran 77, run
mpiifort -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.f
▸ To compile hybrid_hello.f90, an hybrid "Hello world" program written in Fortran 90, run
mpiifort -mt_mpi -openmp -o hybrid_hello.x hybrid_hello.f90
Intel Compiler Options
Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.
At the most basic level of optimization that the compiler can perform is -On options, explained below.
|n = 0||Fast compilation, full debugging support; equivalent to -g|
|n = 1,2|| Low to moderate optimization, partial debugging support: |
|n = 3+|| Aggressive optimization - compile time/space intensive and/or marginal effectiveness; |
may change code semantics and results (sometimes even breaks code!):
The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.
|-c||For compilation of source file only.|
|-O3||Aggressive optimization (-O2 is default).|
|-xAVX||Optimizes for Intel processors that support AVX (Advanced Vector Extensions) instructions.|
|-g||Debugging information, generates symbol table.|
|-mp||Maintain floating point precision (disables some optimizations).|
|-mp1||Improve floating-point precision (speed impact is less than -mp).|
|-ip||Enable single-file interprocedural (IP) optimizations (within files).|
|-ip0||Enable multi-file IP optimizations (between files).|
|-prefetch||Enables data prefetching (requires –O3).|
|-openmp||Enable the parallelizer to generate multi-threaded code based on the OpenMP directives.|
For more compiler/linker options, check the ifort and icc man pages, or consult the following online documentations:
- Intel C++ Compiler XE 14.0 User and Reference Guides
- Intel Fortran Compiler XE 14.0 User and Reference Guides
For more details on this topic, see Torque.
On Hyades we use Torque as the resource manager and Maui as the job scheduler. Torque is an open-source derivative of Portable Batch System (PBS). Commonly used Torque tools include:
- qsub, for submitting PBS job
- qstat, for monitoring the status of jobs
- qdel, for terminating jobs prior to completion
Consult the man pages for more detailed information regarding these commands.
Users submit jobs to a queue and wait in line until nodes become available to run the job. There are 3 queues: normal, hyper, and gpu. The default queue is normal, your job will be submitted to the normal queue if no queue name is specified. The following table summarizes the queue characteristics (n below is the number of nodes requested for the job):
|Queue||Total # of nodes||resource per node||Max Walltime||qsub options|
|normal||120||16 cores||2 days||-l nodes=n:ppn=16 -q normal|
|hyper||60||32 cores||4 days||-l nodes=n:ppn=32 -q hyper|
|gpu||8||16 cores and 1 GPU||10 days||-l nodes=n:ppn=16 -q gpu|
To run your code on Hyades, usually you create a PBS job script, and then use the qsub command to submit the job to a queue. A PBS script is a shell script that contains a few extra comments at the beginning specifying directives to Torque/PBS. You are free to use your favorite shell; we use Bash in the following examples.
Running Serial Programs
To run the serial executable hello.x compiled in subsection Compiling Serial Programs, first make sure the executable resides in the Lustre scratch storage (in /pfs/$USER or one of its subdirectories); then create a PBS script named serial.pbs in the same directory, with the following content:
#!/bin/bash #PBS -N serial #PBS -l ncpus=1 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR ./hello.x
We submit the job with the following command:
Torque/PBS will print out the job ID, e.g.:
The standard output of the executable will be saved in a file who name has the following form: job_name.ojob_ID. When our job is completed, for example, we'll get serial.o12345:
$ cat serial.o12345 Hello, world!
Running Embarrassingly Parallel Programs
For more details on this topic, see PBS Job Array.
Oftentimes we need to run a lot of instances of the same executable simultaneously, but with different parameters. For example, here is a sample serial program (jobarray_hello.x) that takes an integer argument:
./jobarray_hello.x 23 Hello master, I am slave no. 23 running on hyades.ucsc.edu!
For educational purpose, let's assume that we need to run the following instances:
./jobarray_hello.x 101 ./jobarray_hello.x 102 ... ./jobarray_hello.x 164
Instead of submitting 64 serial jobs, we can submit one job array. Create a PBS script named jobarray.pbs, with the following content:
#!/bin/bash #PBS -N jobarray #PBS -l ncpus=1 #PBS -t 101-164 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR ./jobarray_hello.x $PBS_ARRAYID
We submit the jobs with the following command:
Each member's standard output will be saved in a file whose name has the following form: job_name.ojob_ID-task_id. When our job array is completed, for example, we'll get jobarray.o12345-101, ..., jobarray.o12345-164.
$ cat jobarray.o12345-103 Hello master, I am slave no. 103 running on astro-3-5.local!
Running MPI programs
Assume that we've successfully compiled the sample MPI program mpi_hello.c in subsection Compiling MPI Programs, and we want to run the executable mpi_hello.x on 64 cores. First make sure the executable resides in the Lustre scratch storage (in /pfs/$USER or one of its subdirectories); then create a PBS script named impi.pbs in the same directory, with the following content:
#!/bin/bash #PBS -N impi #PBS -l nodes=4:ppn=16 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR mpirun -genv I_MPI_FABRICS shm:ofa -n 64 ./mpi_hello.x
We are now ready to submit the job:
Running OpenMP Programs
#!/bin/bash #PBS -N omp #PBS -l nodes=1:ppn=16 #PBS -l walltime=0:10:00 export OMP_NUM_THREADS=16 cd $PBS_O_WORKDIR ./omp_hello.x
We submit the job with:
Running Hybrid Programs
To run the MPI/OpenMP hybrid hybrid_hello.x (compiled in subsection Compiling Hybrid Programs) on 64 cores (8 MPI processes and 8 OpenMP threads per MPI process), create a PBS script named hybrid.pbs, with the following content:
#!/bin/bash #PBS -N hybrid #PBS -l nodes=4:ppn=16 #PBS -l walltime=0:10:00 cd $PBS_O_WORKDIR cat $PBS_NODEFILE | sort | uniq > hosts.$PBS_JOBID export OMP_NUM_THREADS=8 export I_MPI_PIN_DOMAIN=omp export KMP_AFFINITY=compact mpirun -machine hosts.$PBS_JOBID -genv I_MPI_FABRICS shm:ofa -n 8 -ppn 2 ./hybrid_hello.x
We submit the job with: