PACE A Partnership for an Advanced Computing Environment

October 1, 2024

DATA CENTER CHILLER FAILURE – 10/1/2024

Filed under: Uncategorized — Eric Coulter @ 9:18 am

[Update 10/1/24 02:06 PM]

Cooling was restored to the datacenter this morning and PACE has tested all nodes that were powered off. All working nodes on Phoenix and Hive have been returned to service, we continue to investigate a small number of nodes with issues, including 7 of the cpu-amd nodes on RHEL9, but otherwise all clusters are fully operational.  
 
Thank you for your patience during this partial outage!

[Update 10/1/24 09:18 AM]

Our data center hosting provider, DataBank, identified a cooling failure this morning around 8:42am. As temperatures were rising to dangerous levels, we’ve initiated a partial shutdown.  The Phoenix and Hive schedulers have been paused, and all idle compute nodes on Phoenix and Hive have been powered off. Running jobs are not currently impacted. We are continuing to monitor the situation and determine if additional measures are needed. ICE, Firebird, and Buzzard remain in production at this time. 

Access to login nodes and all storage systems remains available. Files can be accessed or retrieved via Globus, the OnDemand web interface, or the login nodes.  

We will continue to provide updates as the situation evolves, and are working closely with the vendor to restore functionality.  

For any questions, please contact PACE at pace-support@oit.gatech.edu.  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress