PACE A Partnership for an Advanced Computing Environment

July 25, 2022

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

Filed under: Uncategorized — Michael Weiner @ 6:30 pm

(updated to reflect that Hive was impacted as well)

Summary: An error in a system utility resulted in the Hive, Phoenix, PACE-ICE, and COC-ICE clusters temporarily not launching new jobs. It has been repaired, and jobs have resumed launching.

Details: An unintended update to the system utility that checks the health of compute nodes resulted in all Hive, Phoenix, PACE-ICE, and COC-ICE compute nodes being recorded as down shortly before 4:00 PM today, even if there was in fact no issue with them. The scheduler will not launch new jobs on nodes marked down. After correcting the issue, all nodes are again correctly reporting their status, and jobs have resumed launching on all three clusters as of 6:30 PM.

Impact: As all nodes appeared down, no new jobs could launch but would instead remain in queue after being submitted. Running jobs were not impacted. Interactive jobs waiting to start might have been cancelled, in which case the researcher should re-submit.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress