[Update 8/7/23 9:34 PM]
Access to Phoenix scratch continued to have issues last night as of 10:19 PM last night (Sunday). We paused the scheduler and restarted the controller around 6am this morning (Monday).
Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 10:19 PM Sunday and ended this morning at 9:24 AM Monday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.
Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.
[Update 8/6/23 2:25 PM]
Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 9:30 PM Saturday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.
Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.
[Original Post 8/6/23 1:30 PM]
Summary: Phoenix scratch storage is currently unavailable, which may impact access to directories on other Phoenix storage systems. The Phoenix scheduler is paused, so no new jobs can start.
Details: A storage target controller on the Phoenix scratch system became unresponsive just before midnight on Saturday evening. The Phoenix scheduler crashed shortly before 7 AM Sunday morning due to the number of failures to reach scratch directories. PACE restarted the scheduler around 1 PM today (Sunday), restoring access, while also pausing it to prevent new jobs from starting.
Impact: The network scratch filesystem on Phoenix is inaccessible. Due to the symbolic link to scratch, an ls
of Phoenix home directories may also hang. Access via Globus may also time out. Individual directories on the home storage device may be reachable if an ls
of the main home directory is not performed. Scheduler commands, such as squeue
, were not available this morning but have now been restored. As the scheduler is paused, any new jobs submitted will not start at this time. There is no impact to project storage.
Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.