PACE A Partnership for an Advanced Computing Environment

August 26, 2023

All PACE Clusters Down Due to Cooling Failure

Filed under: Uncategorized — Michael Weiner @ 9:40 pm

[Update 8/27/23 11:10 AM]

All PACE clusters have returned to service.

The datacenter cooling pump was replaced early this morning. After powering on compute nodes and testing, PACE resumed jobs on all clusters. On clusters which charge for use (Phoenix and Firebird), jobs that were cancelled yesterday evening when compute nodes were turned off will be refunded. Please submit new jobs to resume your work.

Thank you for your patience during this emergency repair.

[Original Post 8/26/23 9:40 PM]

Summary: A pump in the Coda Datacenter cooling system overheated on Saturday evening. All PACE compute nodes across all clusters (Phoenix, Hive, Firebird, Buzzard, and ICE) have been shut down until cooling is restored, stopping all compute jobs.

Details: Databank is reporting an issue with the high temp condenser pump in the Research Hall of the CODA data center, hosting PACE compute nodes. The Research Hall is being powered off in order for Databank facilities to get the pump changed.

Impact: All PACE compute nodes are unavailable. Running jobs have been cancelled, and no new jobs can start. Login nodes and storage systems remain available. Compute nodes will remain off until the cooling system is repaired.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress