PACE A Partnership for an Advanced Computing Environment

April 3, 2023

Phoenix Scratch Storage & Scheduler Outages

Filed under: Uncategorized — Michael Weiner @ 2:29 pm

[Update 4/3/23 5:30 PM]

Phoenix’s scratch storage & scheduler are again fully functional.

The scratch storage system was repaired by 3 PM. We rebooted one of the storage servers, with the redundant controllers taking over the load, and brought it back online to restore responsiveness.

The scheduler outage was caused by a number of communication timeouts, later exacerbated by stuck jobs on scratch storage. After processing the backlog, the scheduler began allowing jobs to begin around 4:20 PM this afternoon. We have been monitoring it since then. At this time, due high utilization, the Phoenix CPU nodes are nearly completely occupied.

We will refund any job that failed after 10:30 AM today due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 4/3/23 2:30 PM]

Summary: Scratch storage is currently inaccessible on Phoenix. In addition, jobs are not able to start. The login nodes experienced high load earlier today, rendering them non-responsive, which was resolved through a reboot.

Details: Phoenix is currently experiencing multiple issues, and the PACE team is investigating. The scratch storage system is inaccessible as the Lustre service has been timing out since approximately 11:30 AM today. The scheduler is also failing to launch jobs, which started by 10:30 AM today. Finally, we experienced high load on all four Phoenix login nodes around 1:00 PM today. The login nodes were repaired through a reboot. All issues, including any potential root cause, are being investigated by the PACE team today.

Impact: Researchers on login nodes may have been disconnected during the reboots required to restore functionality. Scratch storage is unreachable at this time. Home and project storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in scratch storage may not be working. New jobs are not launching and will remain in queue.

Thank you for your patience as we investigate these issues and restore Phoenix to full functionality. For questions, please contact PACE at pace-support@oit.gatech.edu.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress