PACE A Partnership for an Advanced Computing Environment

February 19, 2020

[Resolved] Rich InfiniBand Switch Power Failure

Filed under: Uncategorized — Aaron Jezghani @ 3:10 pm

This morning, we discovered a power failure in an InfiniBand switch in the Rich Datacenter that resulted in GPFS mount failure to a number of compute resources. Power was restored at 9:10am, and connectivity across the switch has been confirmed. However, prior to the fix, it is possible that jobs may have experienced problems (including failure to produce results or exiting with error) due to GPFS access time-outs. Please review the status of any jobs run recently by checking the output/error logs or, if still running, the timestamps of output files for any discrepancies. If an issue appears (e.g. previously successful code exceeded wallclock limit with no output or file creation occurred much later than the start of the job), please resubmit the job.

As always, if you have any questions or concerns, please contact pace-support@oit.gatech.edu.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress