[Update 7/21/2023 3:30pm]
Dear Phoenix Users,
The Lustre project storage filesystem on Phoenix is back up and available. We have completed cable replacements, reseated and replaced a couple hard drives, and restarted the controller. We have run tests to confirm that the storage is running correctly. Performance may still be degraded and impacted as redundant drives rebuild, but is better than the last few days.
Phoenix’s head nodes, which were unresponsive earlier this morning, are available again without issue. We will continue to monitor the login nodes for any other issues.
You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM this morning due to the outage.
Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.
[Original Post 7/21/2023 9:46 am]
Summary: The Lustre project storage filesystem on Phoenix became unresponsive this morning. Researchers may be unable to access data in their project storage. Multiple Phoenix login nodes have also become unresponsive, which may also prevent logins. We have paused the scheduler, preventing new jobs from starting, while we investigate.
Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known, but PACE is working with the vendor to find a resolution.
Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. This may impact logins as well. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress.
Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.