PACE A Partnership for an Advanced Computing Environment

October 29, 2012

VASP Calculation Errors

Filed under: Uncategorized — Semir Sarajlic @ 12:50 pm

UPDATE: The VASP binaries that generate incorrect results have been DELETED.

One of the versions of VASP installed on all RHEL6 clusters can generate incorrect answers.
The DFT energies calculated are correct, but the forces may not be correct.

The affected vasp binaries are located here:
/usr/local/packages/vasp/5.2.12/mvapich2-1.6/intel-12.0.0.084/bin/vasp
/usr/local/packages/vasp/5.2.12/mvapich2-1.7/intel-12.0.0.084/bin/vasp
/usr/local/packages/vasp/5.2.12/openmpi-1.4.3/intel-12.0.0.084/bin/vasp
/usr/local/packages/vasp/5.2.12/openmpi-1.5.4/intel-12.0.0.084/bin/vasp

All affected binaries were compiled with the intel/12.0.0.084 compiler.

Solution:
Use a different vasp binary – versions compiled with the intel/10.1.018 and intel/11.1.059 compilers have been checked for correctness.
Neither of those compilers generate incorrect answers on the test cases that discovered the error.

Here is an excerpt from a job script that uses a correct vasp binary:

###########################################################

#PBS -q force-6
#PBS -l walltime=8:00:00

cd $PBS_O_WORKDIR

module load intel/11.1.059 mvapich2/1.6 vasp/5.2.12
which vasp
#This “which vasp” command should print this:
#/usr/local/packages/vasp/5.2.12/mvapich2-1.6/intel-11.1.059/bin/vasp
#If it prints anything other than this, the modules loaded are not as expected, and you are not using the correct vasp.

mpirun -rmk pbs vasp
##########################################################

We now have a test case with known correct results that will be checked every time a new vasp binary is installed.
This step will prevent this particular error from occurring again.
Unless there are strenuous objections, this version of vasp will be deleted from the module that loads it (today) and the binaries will be removed from /usr/local/packages/ (in one week).

Thank you Ambarish for reporting this issue.

Let us know if you have any questions, concerns, or comments.

October 24, 2012

Regarding the jobs failing around 2am

Filed under: Uncategorized — Semir Sarajlic @ 10:13 pm

We received multiple reports of jobs getting killed around 2:00am. After further investigation, we have found the cause and made the corrections required to prevent this from happening again. Here’s a detailed explanation of what caused the job failures:

Each individual machine in PACE, including workstations in some cases, has an OS-software stack that is maintained by a single-sourced service called GTSWD. During maintenance periods we often use the GTSWD service to push-out new OS updates, firmware updates, and system service updates.

One of the updates we pushed out during the last maintenance window was a new panfs client, which is responsible for mounting the /scratch filesystem. The process used to update the panfs client came in two stages:

#1 initiate the client installation under the assumption that the node was free (this was a valid assumption at that time because this was being done during the maintenance window).

#2 replace the process that did the client installation with another update process that would first check to see if the change had already been made, but more importantly, would not make the assumption that the node was free.

On some of the compute nodes, #2 didn’t not get applied, and so every day at 2am since maintenance day, process #1 has been attempting to umount and remount /scratch, causing the failure of several user jobs. The reason #2 did not get applied, is because the GTSWD service-source is now completely overwhelmed by the number of PACE nodes that we now have trying to use it, and had not been able to automatically apply the updates to a small subset of our nodes.

In terms of this particular problem, we have manually updated all the RHEL5 nodes that were still using process #1; that will stop any more jobs that use RHEL5 queues from getting killed at 2am. In terms of the capacity problem, we’re going to be adding more capacity to the GTSWD service so that it can support all of the PACE nodes. On the RHEL6 nodes we’ve updated the execution instruction sets on our distribution system such that, when updated, they will not attempt to use process #1.

We are sorry for the time this has cost you, and also for not correcting the problem sooner. The failure rate of GTSWD has been historically very low, and thus is usually one of the very last things we look at when trying to determine the source of a problem.

 

October 17, 2012

NSF’s Major Research Instrumentation Program (MRI)

Filed under: News — Semir Sarajlic @ 7:57 pm

Program Name: Major Research Instrumentation (MRI) Program

Program Websitehttp://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5260

A copy of this solicitation is available at: http://www.nsf.gov/pubs/2011/nsf11503/nsf11503.htm

Purpose of the Program: the MRI program assists with the acquisition or development of shared research instrumentation that is, in general, too costly and/or not appropriate for support through other NSF programs. Instruments are expected to be operational for regular research use by the end of the award period. For the purposes of the MRI program, proposals must be for either acquisition or development of a single instrument or for equipment that, when combined, serves as an integrated research instrument (physical or virtual).  The MRI program does not support the acquisition or development of a suite of instruments to outfit research laboratories/facilities or to conduct independent research activities simultaneously.  Further guidance on appropriate requests can be found in the MRI Frequently Asked Questions (FAQs) at http://www.nsf.gov/od/oia/programs/mri.

An institution may be either a lead or a sub in up to 3 proposals. Since Georgia Tech anticipates having 3 lead proposals, proposals as subs will only be considered should we fall short of the 3 leads. Deadline for submission to NSF is January 24, 2013 by 5:00 p.m. EST.  However, the internal deadline is Friday, November 9, 2012.

Internal Process

If you would like to submit an MRI proposal on behalf of Georgia Tech, please submit a 4-page white paper via T-Square by November 9, 2012.  Submission instructions:

1. Log on to http://www.tsquare.gatech.edu

2. Under the “My Workspace” tab, click on the “Membership” sidebar.

3. Click on “Joinable Sites” and join “MRI Submissions.”

4. Submit your paperwork as one pdf using the Drop Box tool.

5. Direct any questions to Gail Spatt: 404.385.8334 or spatt@gatech.edu

Format

Please use the following format for your white papers:

Pages 1-3

  • ACQUISITION or DEVELOPMENT (clearly indicate)
  • Project Title
  • Principle Investigator(s) and Project Director(s)
  • Project Description to include: research activities enabled; description of the research instrumentation and needs; impact on research and training infrastructure; and management plan.

Page 4

  • Budget and justification
  • Cost-sharing – note that cost-sharing at 30% is required for the MRI program; contributions may be from any non-Federal source, including non-Federal grants/contracts, and may be in cash or in-kind (see Section V. B. Budgetary Information of the Program Solicitation for full details).

As always, this information is also available at this website: http://www.evpr.gatech.edu/faculty-resources-and-funding-opportunities/major-research-instrumentation-program

 

Maintenance Day (October 16, 2012) – complete

Filed under: tech support — admin @ 5:59 am

We have completed our maintenance activities.  Head nodes are online again and queued up jobs are being released.

Our filesystem correction activities on the scratch found eight “objects” on the v7 volume to be damaged and were automatically removed.  Unfortunately, the process provides no indication which file or directory was problematic.

As always, please followup to pace-support@oit.gatech.edu with any problems you may see, ideally with the pace-support.sh script discussed here: http://pace.gatech.edu/support.

October 16, 2012

Maintenance Day (October 16, 2012)

Filed under: tech support — Semir Sarajlic @ 12:28 pm

PACE Maintenance Day is underway.
All compute nodes are off, and all login nodes should be inaccessible.

October 11, 2012

campus network maintenance

Filed under: tech support — admin @ 2:03 pm

The Network team will be performing some scheduled maintenance this Saturday morning.  This may impact connectivity between your workstations/laptops/home, but should not affect  jobs running within PACE.  However, if your job requires access to network services outside of the PACE cluster (e.g. a remote license server), this maintenance may affect your jobs.

For further information please see the maintenance announcement on status.oit.gatech.edu.

October 10, 2012

Check the status of queue(s) using “pace-check-queue”

Filed under: tech support — Semir Sarajlic @ 4:49 pm

Dear PACE Users,

We have a new tool to announce. If you would like to check the status of any PACE queue, you can now run:

pace-check-queue <queuename>

substituting the queuename with  the name of the queue you would like to check. This tool has a column, which tells you whether a node is accepting jobs or not, including a human readable explanation. This tool provides, at one glance, the following information:

* Which nodes are included in the queue

* Which nodes accept jobs and which don’t (and if they don’t, why)

* How may cores and how much memory each node has, and what percent of them are being used

* Overall usage (CPU/Memory) levels for the entire queue.

(This information is refreshed every half an hour)

We had recently announced a new tool, pace-stat, to check the status of your queues. These tools complement each other, so feel free to use both. Please report any down/problem nodes that you see in the list to pace-support@oit.gatech.edu.

Hope these new tools will provide you with a better HPC environment. Happy computing!

PS: These tools are continuously being developed, therefore your feedback and suggestions for improvements are always welcome!

October 9, 2012

upcoming maintenance day, 10/16 – working on the scratch storage

Filed under: tech support — admin @ 9:49 pm

It’s that time again.  We’ve been working with our scratch storage vendor (Panasas) quite a lot lately, and think we finally have some good news.  Addressing the scratch space will be a major thrust of this quarterly maintenance, and we are cautiously optimistic that we will see improvements.  We will also be applying some VMware tuning to our RHEL5 virtual machines that should increase responsiveness of those head nodes & servers.  Completing upgrades to RHEL6 for a few clusters and a few other minor items round out our activities for the day.

Scratch storage

We have been testing new firmware on our loaner Panasas storage.  Despite our best efforts, we have been unable to replicate our current set of problems after upgrading our loaner equipment to this firmware.  This is good news!  However, simply upgrading is insufficient to fully resolve our issues.  So on maintenance day, we will be performing a number of tasks related to the Panasas.  After the firmware update, we need to perform some basic file integrity checks – the equivalent of a UNIX fsck – on a copule of volumes.  This process requires those volumes to be offline for the duration.  After this, we need to perform reads of every file on the scratch that was created before the firmware upgrade.  Based on our calculations, this will take weeks.  Fortunately, this process can happen in the background, and with the filesystems online and otherwise operating normally.  The net result is that the full impact of our maintenance day improvements to the scratch will not likely be realized for a couple of weeks.  If there are files (particularly large ones) that you no longer need and can delete, this process will go faster.  We will also be upgrading the Panasas client software on all compute nodes to (hopefully) address performance issues.

Finally, we will also be instituting a 20TB per user hard quota in addition to the 10TB per user soft quota currently in place.  Users that exceed the soft quota will receive warning emails, but writes will succeed.  Writes will fail for users that attempt to exceed the hard quota.

VMware tuning

With some assistance from the Architecture and Infrstructure directorate in OIT, we will be making a number of adjustments to our VMware world.  The most significant of which is adjusting the filesystem alignment of our RHEL5 virtual machines.  Users of RHEL5 head nodes are likely to see the most improvement.  We’ll also be installing the VMware tools packages and applying various tuning parameters enabled by this package.

RHEL6 upgrades

The remaining RHEL5 portions of the clusters below will be upgraded to RHEL6.  After maintenance day, RHEL5 will be unavailable to these clusters.

  • Uranus
  • BioCluster
  • Cygnus

Misc items

  • Configuration updates to redundant network switches serving some project storage
  • Capacity expansion of the ECE file server
  • Serial number updates to a small number of compute nodes lacking serial numbers in the BIOS
  • Interoperability testing of Mellanox Infiniband switches
  • Finish project directory migration of two remaining Optimus users

October 5, 2012

Cygnus FS pc5 online…mostly.

Filed under: tech support — Tags: — Semir Sarajlic @ 8:38 pm

We have been able to bring /nv/pc5 back online, but at a cost to redundancy. One of the network interfaces/cables/switches is not behaving, but when we tried disconnecting various combinations of cables, we found one that caused the filesystem to be immediately available to all nodes.

Considering how close maintenance day is (10/16/12), spending time isolating the cable/switch/interface problem now only means more time for this filesystem to be offline as equipment gets retested. Waiting until maintenance day will cause the least disruption for Cygnus pc5 users who have their last run of jobs and take some time pressure off of us to make sure we have resolved the issue in its entirety before bringing all resources back online.

Despite the loss of redundancy, functionality is NOT affected. Only in the case of an additional switch or cable failure between now on October 16 will functionality be impacted.

Cygnus File System pc5 offline

Filed under: tech support — Tags: — Semir Sarajlic @ 8:01 pm

It appears that we have an issue with the server housing the /nv/pc5 filesystem, which contains a subset of the Cygnus cluster users. We’re trying to isolate the source of the problem, but we have yet to actually find a pattern to why it is available on some nodes and not on others.

Powered by WordPress