[Update 6/3/22 4:55 PM]
After the full restart of scheduler services across Hive this afternoon, we have returned to full production status on the cluster. Thank you for your patience this week as we investigated the issue. Please contact us at pace-support@oit.gatech.edu with any questions.
[Update 6/3/22 2:25 PM]
The PACE team is continuing to investigate the partial disruption of the Hive scheduler. We are currently performing a full restart of all scheduler services across the Hive cluster. While this cluster-wide service restart is in progress this afternoon, it is not possible to submit, start, or check the status of any jobs on Hive. Commands such as qsub, qstat, and showq are unavailable. Running jobs are not impacted.
We appreciate your patience during this process. Please contact us at pace-support@oit.gatech.edu with any questions.
[Original Message 5/31/22 5:30 PM]
Summary: The Hive scheduler is currently in a degraded state, and many waiting jobs will not start.
Details: The Torque resource manager and the Moab workload manager, the two components of the Hive scheduler, are currently reporting conflicting information about resources allocated to running jobs. This causes failed attempts to schedule waiting jobs on resources that are already allocated, which prevents the jobs from starting. The PACE team is actively investigating this situation and working to resolve it.
Impact: Some queued jobs, especially those requesting a larger number of resources, may remain in the queue even though resources may appear to be available via tools such as pace-check-queue. Interactive jobs may be cancelled by the scheduler while waiting to start. Running jobs are not impacted.
Please contact us at pace-support@oit.gatech.edu with any questions.