PACE A Partnership for an Advanced Computing Environment

July 13, 2023

Phoenix Project Storage & Login Node Outage

Filed under: Uncategorized — Michael Weiner @ 10:19 am

[ Update 7/18/2023 4:00 PM]

Summary: Phoenix project storage performance is degraded as redundant drives rebuild. The process may continue for several more days. Scratch storage is not impacted, so tasks may proceed more quickly if run on the scratch filesystem.

Details: During and after the storage outage last week, several redundant drives on the Phoenix project storage filesystem failed. The system is rebuilding the redundant array across additional disks, which is expected to take several more days. Researchers may wish to copy necessary files to their scratch directories or to local disk and run jobs from there for faster performance. In addition, we continue working with our storage vendor to identify the cause of last week’s outage.

Impact: Phoenix project storage performance is degraded for both read & write, which may continue for several days. Home and scratch storage are not impacted. All data on project storage is accessible.

Thank you for your patience as the process continues. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 7/13/2023 2:57 PM]

Phoenix’s head nodes, which were unresponsive earlier this morning, have been rebooted and are available again without issue. We will continue to monitor the login nodes for any other issues.

Regarding the failed redundant drives, we have replaced the control cables and a few hard drives have been reseated. We have run tests to confirm that the storage is running correctly.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM due to the outage.

[Update 7/13/2023 12:20 PM]

Failed redundant drives led an object storage target to become unreachable. We are working to replace controller cables to restore access.

[Original Post 7/13/2023 10:20 AM]

Summary: The Phoenix project storage filesystem became unresponsive this morning. Researchers may be unable to access data in their project storage. We have paused the scheduler, preventing new jobs from starting, while we investigate. Multiple Phoenix login nodes have also become unresponsive, which may have prevented logins.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known. We have also rebooted several Phoenix login nodes that had become unresponsive to restore ssh access.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress. Some login attempts this morning may have hung.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress