PACE A Partnership for an Advanced Computing Environment

April 14, 2020

[RESOLVED] Rich Data/Project and Scratch Storage Slow Performance

Filed under: Uncategorized — Aaron Jezghani @ 1:21 pm

[RESOLVED]:
At this time, the issues experienced earlier have been resolved, and all schedulers have been restored to functionality. The problems addressed over the course of this fix include

  • A manual swap to the backup InfiniBand Subnet Manager to correct for a partial failure. This issue caused trouble with the GPFS storage volumes, most notably scratch due to its high activity.
  • TruNAS hpctn1 lost access to drives due to a jostled SAS cable on a drive replaced as part of a CAB “standard change”, of which we were not informed. The cable was reseated to restore connectivity.
  • Additionally, there was a missing license file on unit 1a of TruNAS hpctn1; the license file was updated accordingly.
  • Failed mounts on compute nodes were cleaned up after the aforementioned storage volumes were brought back online.

Please be advised that due to failed storage access, jobs running for the duration of this outage may have failed. Please inspect the results of jobs completed recently to ensure correctness; if an unexplained failure occurred (e.g. the job was terminated for a wallclock violation when previous iterations ran without issue), please resubmit the job. If you have any questions or concerns, please contact pace-support@oit.gatech.edu.

Thank you.

[UPDATE]:
The issues from this morning’s storage problems are still ongoing. At this point, we have paused all schedulers for Rich-based resources. With the exception of Hive and Testflight-Coda, no new jobs will be started until the situation is resolved. Additionally, due to the slowness in accessing data and software repositories, jobs that were running may fail due to reaching wallclock limits or other errors. Updates will continue to be posted on the blog as they are available. We appreciate your patience as we work to resolve the underlying cause. Again, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

[Original Post]:
We have identified slow performance in the Rich data/project and scratch storage volumes. Jobs utilizing these volumes may experience problems, so please verify results accordingly. We are actively working to resolve the issue.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress