WHAT’S HAPPENING?
It is necessary to shut down the whole cohort of PACE clusters next week to make repairs in the datacenter.
The repair and cluster resumption will take up to 1 day to complete, requires shutting down all nodes in the research hall, and must be done in the next few days.
This shutdown will NOT affect Globus access, login-node access, or access to any storage locations.
WHEN IS IT HAPPENING?
Tuesday, September 3rd, 2024, starting at 4 PM EDT. Compute nodes are expected to return to availability on the afternoon of Wednesday, September 4th.
WHY IS IT HAPPENING?
Databank, the physical infrastructure provider for our datacenter, detected an issue over the weekend where multiple cooling doors reported high temperature alerts. They traced the issue to a high team chiller sensor. It was temporarily bypassed to avoid the multiple alerts and needs to be replaced to avoid additional issues.
This outage is necessary to prevent widespread catastrophic failure of the servers in the research hall.
WHO IS AFFECTED?
All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) will be stopped at 4pm on the afternoon of September 3rd, 2024. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!
WHAT DO YOU NEED TO DO?
Wait patiently; we will communicate as soon as the clusters are ready to resume work.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
For any questions, please contact PACE at pace-support@oit.gatech.edu.