[Update – December 9, 2020 – 11:03am]
Thank you for your continued patience. We are reaching out to update you that the scheduler has been resolved late last night, new jobs from users that were held in queue yesterday have resumed, and we have observed normal operation of the Phoenix scheduler through the night and this morning. The cause of the scheduler issue was the large sudden influx of user jobs on December 7 as previously reported. The updated timeout parameter will be kept in place to prevent a similar occurrence in the future. Additionally, we are expanding our alert utilities to track additional metrics from the scheduler to alert us of similar stressors so that we may proactively address/mitigate scheduler issues.
The Phoenix cluster is ready for research. Thank you again for your patience as we worked to address this issue in coordination with the vendor.
[Update – December 8, 2020 – 6:31pm]
Thank you for your patience today as we have worked extensively with the vendor to address the scheduler outage on the Phoenix cluster. This is brief update from today’s joint investigation.
What has PACE done: PACE along with the vendor, Adaptive Computing, have been conducting an extensive investigation of the scheduler outage. The root cause of the incident is still under investigation; however, we have identified a multi-pronged event that had started at 11:39pm on December 7 that was compounded with a rapid influx of nearly 30,000 jobs from users that lead the scheduler to become unresponsive. Given this large influx of jobs, we have increased the timeout setting for the scheduler to allow Moab to process the backlog of jobs that were submitted. This is currently underway.
Who does this message impact: This impacts all users on the Phoenix cluster who have submitted jobs. During this incident, it’s normal for users to see their jobs remain in queue after submission as the scheduler is working through the backlog of job submissions.
What PACE will continue to do: We will continue to monitor the scheduler as it’s processing the backlog of jobs and update as needed. This continues to be an active situation and we will update as further information is available.
Thank you again for your patience as we work diligently to address this issue.
The PACE Team
[Original Note – December 8, 2020 – 10:26am]
Dear PACE users,
PACE is investigating a scheduler issue that is impacting the Phoenix cluster. At this time, users are unable to run jobs, and jobs are held in queue.
This is an active situation, and we will follow up with updates as they become available.
Thank you for your attention to this urgent message, and we apologize for this inconvenience.