[Update 06/20/2024 04:58pm]
Dear Phoenix Users,
Summary: The Phoenix cluster is back online. The scheduler is unpaused and the jobs that have been put on hold are now resumed, and the file system is ready for use.
Details: All the appliance components for Phoenix project storage were restarted, and file system consistency was confirmed. We’ll continue to monitor it and run additional consistency checks over the next few days.
Impact: If you were running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. We will be issuing refunds for all impacted jobs, so please reach out to pace-support@oit.gatech.edu if you have encountered any issues.
Thank you for your patience,
-The PACE Team
[Update 06/20/2024 01:36 pm]
Summary: The metadata servers on Phoenix, for project storage, /storage/coda1, are currently down due to degraded performance.
Details: During additional testing with the storage vendor as part of investigation of the performance issues from this morning, it was necessary to bring the storage fully offline, rather than resuming service.
Impact: We have paused the scheduler for now, so you will not be able to start jobs on Phoenix. We will release the scheduler once we have verified that project storage is stable. Access to project storage (/storage/coda1) is currently interrupted, however, scratch storage (/storage/scratch1) is not affected. If you were running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. We will be issuing refunds for all impacted jobs as usual.
Only project storage on Phoenix is affected – storage on Hive, ICE, Buzzard and Firebird work without issues.
Thank you for your patience as we work with our storage vendor to resolve this outage. We will continue to provide updates as work continues.
Please contact us at pace-support@oit.gatech.edu with any questions.