PACE A Partnership for an Advanced Computing Environment

October 13, 2023

Phoenix Storage and Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 9:48 am

[Update 10/13/2023 10:35am] 

Dear Phoenix Users,  

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues. 

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26). 

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors.  We will refund any jobs that failed due to the outage. 

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you, 

-The PACE Team 

[Update 10/13/2023 9:48am] 

Dear Phoenix Users, 

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive. 

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed. 

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler. 

Thank you, 

-The PACE Team  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress