Update (12/31/2017, 10:15am): We have addressed the issue and the majority of nodes started running jobs again. As far as we can tell, this was caused by a network related “event” that’s internal to the system. We are working with the vendor to identify the exact root cause.
Original post: One of the primary storage systems (pace2) went offline today, potentially impacting running jobs referencing to that system.
Our automated scripts offlined PACE nodes to prevent new jobs from starting. They will be online once the storage issues are addressed.
PACE team is currently investigating the problems and we will keep you updated.
We are sorry for the delays that may be caused due to the limited staff availability on holidays.