[UPDATE – 10/01/20 5:51pm]
We are following up to let you know that the Hive scheduler has been restored to operation, and users may submit new jobs. We appreciate your patience as we conducted our investigation and resolved this matter. We are providing a brief summary of our findings and actions taken to address this issue.
What Happened and what we did: Yesterday, a user ran an aggressive script that spammed the scheduler with roughly 30,000 job submissions and extremely frequent client queries to both Moab and Torque. This resulted in a chain reaction in which the scheduler utilities were fully overwhelmed and producing log files hundreds of times larger in both size and number of files than normal. Additionally, system utilities were stressed as they tried to keep up with backups and archival. Once PACE became aware of the issue, we terminated the user’s script and began working to clean up the scheduler environment. Ultimately, we had to forcefully remove some of the egregious job logs associated with the user. Other users job(s) that were already submitted to the scheduler prior to the incident have operated normally in that we did not observe abrupt job cancelations/interruptions during this situation. Also, PACE has followed up with the user, and we are working with them to improve their workflow and prevent any future issues such as this one.
What we continue to do: As we blogged this morning at 10:02AM, the scheduler is accepting jobs and running. We have observed some residual effects in system utilities that we have been addressing and monitoring throughout the day. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.
As always, we appreciate your patience as we worked to address this situation.
[UPDATE – 10/01/20 10:02am]
At about 4:30pm, we began experiencing degraded performance with the Hive scheduler. Currently, the scheduler is under significant load, and some users may notice their new job submissions hanging as couple users have already reported to us. PACE is investigating the issue, and we will update once the scheduler is restored to normal operation.
We apologize for the inconvenience this is causing.