PACE A Partnership for an Advanced Computing Environment

July 8, 2024

Phoenix project storage outage

Filed under: Uncategorized — Michael Weiner @ 4:40 pm

[Update 7/9/24 12:00 PM]

Phoenix project storage has been repaired, and the scheduler has resumed. All Phoenix services are now functioning.

We have updated a parameter to throttle the number of operations on the metadata servers to improve stability.

Please contact us at pace-support@oit.gatech.edu if you encounter any remaining issues.

[Original Post 7/8/24 4:40 PM]

Summary: Phoenix project storage is currently inaccessible. We have paused the Phoenix scheduler, so no new jobs will start.

Details: Phoenix Lustre project storage has experienced slowness and been intermittently unresponsive at times throughout the day today. The PACE team identified a few user jobs causing high workload on the storage system, but the load remained high on one metadata server, which eventually stopped responding. Our storage vendor recommended a failover to a different metadata server as part of a repair, but the system has been left fully unresponsive. PACE and our storage vendor continue to work on restoring full access to project storage.

Impact: The Phoenix scheduler has been paused to prevent new jobs from hanging, so no new jobs can start. Currently-running jobs may not make progress and should be cancelled if stuck. Home and scratch directories remain accessible, but an ls of the full home directory may hang due to the symbolic link to project storage.

Thank you for your patience as we work to restore Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions. You may visit https://status.gatech.edu/ for additional updates.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress