PACE A Partnership for an Advanced Computing Environment

April 30, 2015

GPFS storage troubles

Filed under: tech support — admin @ 8:15 pm

Dear PACE users,

As part of last week’s maintenance activities, we upgraded the GPFS client software on head nodes and compute nodes to a level recommended by the vendor.

Thanks to some troubling reports from PACE users, we have determined that the new client software has a subtle bug that will cause writes to fail under certain circumstances. We have identified two replicable cases so far, “CMAKE” failing to compile codes, and “LAMMPS” silently exiting after dumping a single line of text.

We have been in close contact with the vendor for an urgent resolution, and have escalated the incident to the highest executive levels. At this point, we have a couple paths to resolution, either moving forward to a newer release or reverting to the version we were running before last week. We are moving to quickly evaluate the merits of both approaches. Implementing either will likely involve a rolling reboot on compute and head nodes. We understand the inconvenience a downtime will cause, and will engage the vendor to find ways to address this problem with minimal interruption.

One way to find out if you are using GPFS is by running the “pace-quota” command, and checking if any of the paths begin with “gpfs”, “pme2” or “pet1”. If you are running on GPFS and having unexplainable problems with your codes, please contact and try to use other storage locations to which you have access (e.g. ~/scratch).

A more detailed description of this bug and the code we used to replicate it can be found here.

We will continue to keep you updated on the progress.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress