PACE A Partnership for an Advanced Computing Environment

February 6, 2014

Emergency reboot of compute nodes due to power/cooler outage

Filed under: tech support — Semir Sarajlic @ 8:40 pm

The Rich data center cooling system experienced a power outage today (2/6/2014) at around 9:20am when both the main and backup power systems failed requiring an emergency shutdown of all PACE compute nodes. We have since received confirmation from the operations team the room cooling is now stable but using the backup chillers while work proceeds to correct the problem. We are currently bringing the compute nodes back online as quickly as possible.

If you had queued jobs before the incident, they should start running as soon as sufficient number of compute nodes are brought online. However, all of the jobs running at the time of the failure are killed, and they need to be resubmitted. You can monitor the node status using ‘pace-stat’ and ‘pace-check-queue’ commands.

We are sorry for the inconvenience this failure have caused. Please contact us if you have any concerns or questions.


No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress