[Update – 1/3/2024 – 12:43am]
The Buzzard cluster has been tested and confirmed functional, all nodes are back in service.
All PACE clusters are back in service, the impacts of the power outage have been remediated – this outage is over.
[Update – 10/3/2024 – 11:59am]
The Firebird cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Firebird are available for use.
All Firebird nodes are back in service.
Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week.
[Update – 10/3/2024 – 11:55am]
The Phoenix cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Phoenix are available for use.
PACE continues to investigate 54 nodes which we were unable to power on remotely after the outage, which includes 19 NVIDIA V100 GPU nodes.
Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week. We will provide the details for reimbursement of paid storage to affected users later this week.
We are also doubling the amount of credits for ALL free-tier accounts on Phoenix for the month of October to offset the impacts of these outages. All Georgia Tech free-tier accounts (starting with gts-) will have the balance of $136 for the month of October; all GTRI free-tier accounts (starting with gtris-) will have the balance of $504.
[Update – 10/3/2024 – 9:58am]
The Hive cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Hive are available for use.
PACE continues to investigate 21 CPU nodes, 10 “nvme” nodes, and 4 “himem” nodes on Hive for errors and will return those to service as soon as possible.
The PACE Team is continuing to test the Phoenix, Firebird, and Buzzard clusters, in that order of priority.
[Update – 10/3/2024 – 9:00am]
PACE and the OIT Datacenter teams have brought up the vast majority of machines making up the PACE clusters. Roughly 100 nodes remain in a state requiring manual intervention out of our 2,100 machines. The PACE team is working to confirm hardware readiness and beginning to carry out test procedures prior to releasing the clusters. Further updates will be provided as clusters become available for use.
The PACE team is prioritizing the Phoenix and Hive clusters, followed by Firebird and Buzzard. We hope to have the full suite of systems released by mid-afternoon.
[Update – 10/2/2024 – 5:01pm]
The ICE Cluster has been fully powered on, tested, and released for access in order to prioritize educational resources.
PACE and the OIT Datacenter teams are in the process of bringing up machines that make up the research clusters. Due to the sudden nature of the outage, the usual recovery mechanisms for rapid power-up are not available, which is considerably slowing recovery efforts compared to previous outages. The PACE and OIT Datacenter teams are continuing to check, manually reset, power on and subsequently test the hundreds of nodes that have been left in a bad state due to the nature of this power outage. Our tests have currently covered slightly over 1/5th of our 2,100 machines, and we expect to continue working to bring all machines online through the following day and will provide updates as we’re able to release clusters.
[Initial Post – 10/2/2024-12:55pm]
Dear PACE users,
A power outage (related to Georgia Power) impacted Tech Square including the CODA Datacenter. Due to a secondary failure of the UPS system, all PACE clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) were impacted. Currently, most of the nodes on all clusters are powered off, and the schedulers on all clusters have been paused. The outage started at approximately 11:37 am this morning. At the moment, no new jobs can start, and large number of jobs that have been running when the outage started have been terminated. Access to login nodes and storage remains available due to backup power. We are actively monitoring the situation and will keep you updated on the progress of the restoration of services.
Thank you for your patience,
– The PACE team