GT Home : : Campus Maps : : GT Directory

Archive for August, 2012

pace-stat

Posted by on Friday, 31 August, 2012

In answer to the requests made by many for insight on the status of your queues, we’ve developed a new tool for you called ‘pace-stat’ (/opt/pace/bin/pace-stat).

When you run pace-stat, a summary of all available queues will be displayed, and for each queue, values for:

– The number of jobs you have running, and the total number of running jobs
– The number of jobs you have queued, and the total number of queued jobs
– The total number of cores that all of your running jobs are using
– The total number of cores that all of your queued jobs are requesting
– The current number of unallocated cores free on the queue
– The approximate amount of memory/core that your running jobs are using
– The approximate amount of memory/core that your queued jobs are requesting
– The approximate amount of memory/core currently free in the queue
– The current percentage of the queue that has been allocated (by all running jobs)
– The total number of nodes in the queue
– The maximum wall-time for the queue

Please use pace-stat to help determine resource availability, and where best to submit jobs.

[updated] new server for job scheduler

Posted by on Tuesday, 28 August, 2012

As of about 3:00 this afternoon, we’re back up on the new server. Things look to performing much better. Please let us know if you have troubles. Also, positive reports on scheduler performance would be appreciated as well.

Thanks!

–Neil Bright

——————————————————————-

[update: 2:20pm, 8/30/12]

We’ve run in to a last minute issue with the scheduler migration.  Rather than rush things going into a long weekend, we will reschedule for next week, 2:30pm Tuesday afternoon.

——————————————————————-

We have made our preparations to move the job scheduler to new hardware, and plan to do so this Thursday (8/30) afternoon at 2:30pm.  We expect this to be a very low impact, low risk change.  All queued jobs should move to the new server and all executing jobs should continue to run without interruption.  What you may notice is some amount of time where you will be unable to submit new jobs and job queries will fail.  You’ll see the usual ‘timeout’ messages from commands like msub and showq.

As usual, please direct any concerns to pace-support@oit.gatech.edu.

–Neil Bright

Scratch Storage and Scheduler Concerns

Posted by on Monday, 20 August, 2012

The PACE team is urgently working on two ongoing critical issues with the clusters:

Scratch storage
We are aware of access, speed and reliability issues with the high-speed scratch storage system. We are currently working with the vendor to define and implement a solution to these issues. At this time, we are told there is a new version of the storage system firmware just released today that will likely resolve our issues. The PACE team is expecting the arrival of a test unit where we can verify the vendor’s solution. Once we have verified the vendor’s solution, we are considering an emergency maintenance for the entire cluster in order to implement the solution. We appreciate your feedback on this approach and especially the impact upon your research. We will let you know and work with you on scheduling when a known solution is available.

Scheduler
We are presently preparing a new system to host the scheduler software. We expect the more powerful system will alleviate many of the difficulties you are experiencing with the scheduler especially with delays in job scheduling and the time-outs when requesting information or scheduling jobs. Once we have a system ready, we will have to suspend the scheduler for a few minutes while we transition services to the new system. We do not anticipate the loss of any jobs currently running or any that are currently queued with this transition.

In both situations, we will provide you with notice well in advance of any potential interruption and work with you to provide the least impact to your research schedule.

– Paul Manno

[Resolved] Unexpected downtime on compute nodes

Posted by on Thursday, 16 August, 2012

[update]   We think we’re back up at this point. If you see odd behavior, please send a support request directly to the PACE team via email to pace-support@oit.gatech.edu.

The issue seems to have been an inadvertent switching off of a circuit breaker by an electrician, and is not expected to recur.

====================

We’ve had a power problem in the data center this afternoon that caused a loss of power to three of our racks.  This has affected some (or all) portions of the following clusters:

Apurimac

Prometheus

Cygnus

Granulous

ECE

Monkeys

Isabella

CEE

Aryabhata

Optimus

Atlas

BioCluster

 

We’re looking into the cause of the problem, and have already started bringing up compute nodes.

[Resolved] Campus DNS Problems

Posted by on Wednesday, 15 August, 2012

Update:  We believe that the DNS issues have been resolved. We have checked that all affected servers are functioning as expected. The scheduler has been unpaused and is now scheduling jobs.

Thank you for your patience.

==================

At this time, the campus DNS server is experiencing problems.

The effect on PACE is that some storage servers and compute nodes cannot be accessed since their DNS names cannot be found. No currently running jobs should be affected. Any job currently executing has already succeeded in accessing all needed storage and compute nodes. The scheduler has been paused so that no new jobs can be started. We are working with the campus DNS administrators to resolve this as quickly as possible.

When the issue is resolved, the scheduler will be allowed to execute jobs.

We apologize for any problems this has caused you.

New Software: HDF5(1.8.9), OBSGRID (April 2, 2010), ABINIT(6.12.3), VMD(1.9.1), and NAMD(2.9)

Posted by on Tuesday, 14 August, 2012

Several new software packages have been installed on all RHEL6 clusters.

HDF5

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.
A previous version of HDF5 (1.8.7) has existed on the RHEL6 clusters for many months.
The 1.8.9 version includes many bug fixes and some new utilities.

The hdf5/1.8.9 module is used differently than the 1.8.7 module.
The 1.8.9 module is able to detect whether an MPI module has been previously loaded and will support the proper serial or MPI version of the library.
The 1.8.7 module was not able to automatically detect MPI vs. non-MPI.

Here are two examples of how to use the new HDF5 module (note that all compilers and MPI installations are usable with HDF5):

$ module load hdf5/1.8.9

or

$ module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9

OBSGRID

OBSGRID is an objective re-analysis package for WRF designed to lower the error of analyses that are used to nudge the model toward the observed state.
The analyses input to OBSGRID as the first guess are analyses output from the METGRID part of the WPS package
Here is how to use obsgrid:

$ module load intel/12.1.4 hdf5/1.8.7/nompi netcdf/4.1.3 ncl/6.1.0-beta obsgrid/04022010
$ obsgrid.exe

ABINIT

ABINIT is a package whose main program allows one to find the total energy, charge density and electronic structure of systems made of electrons and nuclei (molecules and periodic solids) within Density Functional Theory (DFT), using pseudopotentials and a planewave or wavelet basis.
ABINIT 6.8.1 is already installed on the RHEL6 clusters.
There are many changes from 6.8.1 to 6.12.3. See the 6.12.3 release notes for more information.

Here are a few examples of how to use ABINIT in a job script:

#PBS ...
#PBS -l walltime=8:00:00
#PBS -l nodes=64:ib

cd $PBS_O_WORKDIR
module load intel/12.1.4 mvapich2/1.6 hdf5/1.8.9 netcdf/4.2 mkl/10.3 fftw/3.3 abinit/6.12.3
mpirun -rmk pbs abinit < abinit.input.file > abinit.output.file

VMD

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
VMD has been installed with support for the GCC compilers (versions 4.4.5, 4.6.2, and 4.7.0), NetCDF, Python+NumPy, TCL, and OpenGL.
Here is an example of how to use it:

  1. Login to a RHEL6 login node (joe-6, biocluster-6, atlas-6, etc.) with X-Forwarding enabled (X-Forwarding is critical for VMD to work).
  2. Load the needed modules:
    $ module load gcc/4.6.2 python/2.7.2 hdf5/1.8.7/nompi netcdf/4.1.3 vmd/1.9.1
  3. Execute “vmd” to start the GUI

NAMD

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.
Version 2.9 of NAMD has been installed with support for GNU and Intel compilers, MPI, FFTW3.
CUDA support in NAMD has been disabled.

Here is an example of how to use it in a job script in a RHEL6 queue (biocluster-6, atlas-6, ece, etc.):

#PBS -N NAMD-test
#PBS -l nodes=32
#PBS -l walltime=8:00:00
...
module load gcc/4.6.2 mvapich2/1.7 fftw/3.3 namd/2.9
cd $PBS_O_WORKDIR

mpirun -rmk pbs namd2 input.file

Call for Proposals for Allocations on the Blue Waters High Performance Computing System

Posted by on Thursday, 9 August, 2012

FYI – for anybody interested in applying for time on the petaflop Cray being installed at NCSA.

Begin forwarded message:

From: “Gary Crane” <gcrane@sura.org>
To: ITCOMM@sura.org
Sent: Thursday, August 9, 2012 10:51:37 AM
Subject: Call for Proposals for Allocations on the Blue Waters High Performance Computing System
The Great Lakes Consortium for Petascale Computing (GLCPC) has issued a call for proposals for allocations on the Blue Water system. Principle investigators affiliated with a member of the Great Lakes Consortium for Petascale Computation are eligible to submit a GLCPC allocations proposal. SURA is a member of the GLCPC and PIs from SURA member schools are eligible to submit proposals. Proposals are due October 31, 2012.

The full CFP can be found here: http://www.greatlakesconsortium.org/bluewaters.html

–gary

Gary Crane
Director, SURA IT Initiatives
phone: 315-597-1459
fax: 315-597-1459
cell: 202-577-1272