PACE A Partnership for an Advanced Computing Environment

August 21, 2023

Phoenix Slurm Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 11:17 am

[Update 8/21/23 5:02 PM]

Dear Phoenix Users, 

The Slurm scheduler on Phoenix is back up and available. We have applied the patch that was recommended by SchedMD, the developer of Slurm; cleaned the database; and run tests to confirm that the scheduler is running correctly. We will continue to monitor the scheduler database for any other issues.

Existing jobs that have been queued should have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. We will refund any jobs that failed due to the scheduler outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you,

-The PACE Team 

[Update 8/21/23 3:20 PM]

Dear Phoenix Users, 

We have been working with the Slurm scheduler vendor, SchedMD, to identify and fix a corrupted association in the scheduler database and provide a patch. In troubleshooting the scheduler this afternoon, some jobs were able to be scheduled. We are going to pause the scheduler again to make sure the database cleanup can be completed without disruption from new jobs. 

Based on our estimates, we are expecting to restore the scheduler by later tonight. We will provide an update as soon as the scheduler is released.

Thank you, 

-The PACE Team 

[Update 8/21/23 11:17 AM]

Dear Phoenix Users, 

Unfortunately, the Slurm scheduler controller is down due to issues with Slurm’s database and jobs are not able to be scheduled. We have submitted a high-priority service request to SchedMD, the developer of Slurm, and should be able to provide an update soon. 

Jobs currently running will likely run, but we recommend reviewing the output as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler is fixed. 

The rest of the Phoenix cluster infrastructure (i.e. login, storage, etc.) outside of the scheduler should be working. We recommend not running commands that require interaction with Slurm (i.e.  any scheduler commands like ‘sbatch’, ‘srun’, ‘sacct’, or ‘pace-quota’ commands, etc.) because they will not work at this time. 

We will provide updates soon as we work on fixing the scheduler. 

Thank you, 

-The PACE Team 

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress