PACE A Partnership for an Advanced Computing Environment

September 30, 2016

Headnode GPFS Problem

Filed under: tech support — Semir Sarajlic @ 3:53 am

About 8:30pm this evening, one of the PACE systems that provides services to the GPFS files to headnodes and other PACE internal systems failed. When this happens, users may see the message “stale file handle” or you may notice there are no files under the /gpfs directory. This is a temporary condition that should be fixed shortly.

Please note: All files that were already written and all files accessed or written by any compute node are unaffected. However, if you were in the process of editing a file on a headnode, only your most recent changes may be unavoidably lost. In addition, any process you may have had running on a headnode system using these files may have been killed due to this failure.

To prevent this from recurring, PACE had ordered and very recently received a new computer to replace the system that failed this evening. Our staff will undertake the testing and replacement as soon as possible and we will post an announcement here once the new system is in service.

We apologize for this inconvenience and thank those users who let us know quickly.

September 22, 2016

TestFlight cluster available with RHEL6.7

Filed under: tech support — admin @ 5:57 pm

The TestFlight cluster is now available with the updated RHEL6.7 load, as well as some recompilations of software in /usr/local. Please, do login to testflight-6.pace.gatech.edu and try submitting your jobs to the ‘testflight’ queue. If you have any problems, please send a note to pace-support@oit.gatech.edu.

September 13, 2016

PACE quarterly maintenance – October 2016

Filed under: tech support — admin @ 9:38 pm

Dear PACE users,

Quarterly maintenance is fast approaching. Starting at 6:00am on Thursday, October 13, all resources managed by PACE will be taken offline. Maintenance will continue through Saturday evening unless we are able to finish sooner.

Our major activity this maintenance period will be an operating system upgrade for all compute nodes, head nodes and interactive nodes. This update will take us from RedHat 6.5 to RedHat 6.7, and includes important security and bug fix updates to the operating system, a new Infiniband layer and some recompiled versions of existing /usr/local software. Some applications have shown increased performance as well.

*** IMPORTANT USER ACTION NEEDED ***

PACE staff have been testing this upgrade using various existing applications but we need your help to ensure a smooth rollout. As of today, we have begun applying these updates to our TestFlight cluster, which is available for all to use. We’ll send out a follow up communication when it is ready. PLEASE, PLEASE, PLEASE, use the next few weeks to try your codes on the TestFlight cluster and send feedback to pace-support@oit.gatech.edu. We would especially like to hear of any issues you may have, but reports of working applications would be helpful as well.

Our goal is to provide the best possible conversion to the updated operating system, and ask that you please take the opportunity to help us ensure a smooth transition back into normal operation by availing yourself of the TestFlight cluster.

Powered by WordPress