PACE A Partnership for an Advanced Computing Environment

June 28, 2023

Phoenix Cluster Outage and Fix

Filed under: Uncategorized — Jeff Valdez @ 10:42 am

Summary: The scratch file system became unresponsive yesterday evening (~5:50pm) when some of the network controllers stopped working, causing an outage that may have resulted in difficulties logging into login nodes and writing to scratch.

Details: The file system was recovered this morning after restarting the controllers and all the Lustre components. The Slurm scheduler was also paused to troubleshoot issues with the cluster and has been re-released.

Impact: The file system and scheduler should now be fully functional. Users may have had issues accessing the Phoenix cluster yesterday evening and this morning. Compute jobs ongoing during that time period may have also been affected, so we recommend reviewing jobs run during that time period.

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions.

June 27, 2023

Phoenix cluster outage

Filed under: Uncategorized — Aaron Jezghani @ 10:57 pm

Summary: The Phoenix cluster is currently inaccessible. The status of running jobs cannot be determined at this time. 

Details: Efforts are under way to identify the extent and root cause of the issue.

Impact: Users are unable to access the Phoenix cluster at this time. It is unknown if ongoing compute jobs are affected.

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions. We will continue to investigate and follow-up with another status message tomorrow morning.

June 7, 2023

Phoenix filesystem intermittent slowness

Filed under: Uncategorized — Michael Weiner @ 4:00 pm

Summary: The Phoenix’s filesystem response has been inconsistent starting today. We are noticing that there is a high utilization on all the head-nodes. 

Details: File access is intermittently slow on home storage, project storage, and scratch. Executing any command such as ‘ls’ on the head-node can have a slow response. Slowness in file access was first detected by a couple users around 3pm yesterday, and we have started getting more reports this afternoon. PACE team is actively working on the issue to identify the root cause and resolve this at the earliest. 

Impact: Users may continue to experience intermittent slowness in using the head-node, submitting jobs, compiling code, using interactive sessions, and file read/write. 

Thank you for your patience. Please contact us at pace-support@oit.gatech.edu with any questions. We will continue to watch the performance and follow-up with another status message tomorrow morning.

06/08/2023 Update

Phoenix home, project storage and scratch are all fully functional. The filesystem performance has been normal for the last 12 hours. We will continue our investigation on the root cause and continue to monitor the performance.

As of now, the utilization on our servers has stabilized. The issue has not impacted any jobs running or waiting in queue. Users can resume using Phoenix as usual.

For questions, please contact PACE at pace-support@oit.gatech.edu.

Powered by WordPress