[Updated 2023/03/17 3:30 PM]
Phoenix project storage is again available, and we have resumed the scheduler, allowing new jobs to begin. Queued jobs will begin as resources are available.
The storage issue arose when one metadata server rebooted shortly after 1:00 PM yesterday, and the high-availability configuration automatically switched to the secondary server, which became overloaded. After extensive investigation yesterday evening and today, in collaboration with our storage vendor, we identified and stopped a specific series of jobs heavily taxing storage and also replaced several cables to fully restored Phoenix project storage availability.
Jobs that were running as of 1:00 PM yesterday that will fail or have failed due to the project storage outage will be refunded to the charge account provided. Please resubmit these failed jobs to Slurm to continue research.
Thank you for your patience as we repaired project storage. Please contact us with any questions.
[Updated 2023/03/16, 11:55PM ET]
We’re still experiencing significant slowness of the filesystem. We’re going to keep job scheduling paused for tonight and PACE team will resume troubleshooting in the morning as early as possible.
[Updated 2023/03/16, 6:50PM ET]
Troubleshooting continues with the vendor’s assistance. The file system is currently stable, but one of the meta data servers continues with an abnormal workload. We are working to resolve this issue to avoid additional file system failures.
[Original post 2023/03/16, 2:48PM ET]
Summary: Phoenix project storage is currently unavailable. The scheduler is paused, preventing any additional jobs from starting until the issue is resolved.
Details: An MDS server for the Phoenix Lustre parallel filesystem for project storage has encountered errors and rebooted. The PACE team is investigating at this time and working to restore project storage availability.
Impact: Project storage is slow or unreachable at this time. Home and scratch storage are not impacted, and already-running jobs on these directories should continue. Those jobs running in project storage may not be working. To avoid further job failures, we have paused the scheduler, so no new jobs will start on Phoenix, regardless of the storage used.
Thank you for your patience as we investigate this issue and restore Phoenix storage to full functionality.
For questions, please contact PACE at pace-support@oit.gatech.edu.