We experienced a failure in the primary InfiniBand subnet manager that may have impacted both running and starting jobs. The malfunction happened in such a way that the backup IB subnet manager (SM) didn’t notice the primary was failing to operate normally. We disabled the primary SM, and the secondary SM took over as designed. The service outage lasted from 12:56pm to 01:07pm today, October 15, 2018. PACE staff will continue to investigate this failure mode and adjust the procedures to help prevent it in the future. As this brief network interruption may have impacted the running and starting jobs, please check your jobs to identify if there are any crashed jobs and report any problems you may notice to pace-support@oit.gatech.edu
October 15, 2018
No Comments
No comments yet.
RSS feed for comments on this post.
Sorry, the comment form is closed at this time.