PACE A Partnership for an Advanced Computing Environment

September 28, 2018

[RESOLVED] Temporary unavailability of home directories

Filed under: Uncategorized — Semir Sarajlic @ 9:29 pm

The storage servers that export PACE home directories experienced a problem at around 9:10am on September 28. We have identified and resolved the issue within 20 min after the event.

This problem caused temporary unavailability of home directories. The symptoms include hanging commands, codes and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

 

 

September 19, 2018

[RESOLVED] Temporary unavailability of home directories

Filed under: Uncategorized — Semir Sarajlic @ 11:03 pm

At around 6:10pm on Sep 19, 2018 the storage servers that export PACE home directories and the software repository experienced a problem. We have identified and resolved the issue within 15 min after the event.

This problem caused temporary unavailability of home directories and applications. The symptoms include hanging commands, codes and login attempts.

We believe most jobs have resumed operation after the issue is resolved, but we can’t be sure. Please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu

 

 

September 12, 2018

Testflight queue transition and unavailability

Filed under: Uncategorized — Semir Sarajlic @ 4:47 pm

As you know, the testflight queue includes nodes that are reserved for testing the systems/services that are planned to be deployed in the future.

As a part of our preparations to transition to the next OS (RHEL7) we will offline this queue, swap its nodes with newly purchased nodes (that better represent the modern systems currently in use), and finally deploy the RHEL7 on these new nodes.

Once these preparations are complete, we’ll reach out to you and ask you to test your codes. Until then, testflight will not be available and submissions will be declined.

There are currently some jobs running on this queue. We’ll wait until the current jobs complete instead of killing them, but we would like to once again emphasize that the use of testflight for production is against policy. This queue should only be used for testing purposes.

Please let us know if you have any questions.

September 6, 2018

[Resolved] File locking issues causing hanging in codes and login troubles

Filed under: Uncategorized — Semir Sarajlic @ 2:38 pm

If you have been observing mysteriously hanging codes, or trouble logging in on headnodes, please read on!

We started receiving reports for hanging processes, mostly for GPU codes. In addition, users who are using tcsh/csh shell as default had difficulties logging into nodes.

Upon further investigation, we found that a storage problem was affecting file locking mechanism on home directories (where most applications keep the configuration files, regardless of where they run).

This problem was very subtle, as it was impacting only a small number of processes and data operations appeared to be working well otherwise.

We have addressed this issue this morning (9/6, 10am) and you should no longer see hanging codes. Please report any ongoing issues to pace-support@oit.gatech.edu.

Powered by WordPress