[Update – 10/07/2020 – 8:02]
After nearly-28 hours since the initial power outage in the Rich Datacenter that further caused complications and failures with the networks and systems, we are pleased to report that we have restored the PACE resources in Rich Datacenter and released the user jobs. We understand the impact this has had on your research, and we are very grateful for your patience and understanding as we worked through this emergency. During this outage, the PACE clusters in the Coda datacenter (Hive, Testflight-Coda, CoC-ICE, PACE-ICE, and Phoenix) have not been impacted.
What we have done: Since last night after the network repairs were conducted, we were closely monitoring the network/fabric, and we have gradually brought the infrastructure back up. We conducted application and fabric testing across the systems to assure the systems are operational, and we addressed problematic nodes and issues with schedulers. The power and fabric are stable. We have identified the users whose jobs were interrupted by this power outage from yesterday, and we will reach out to impacted users directly. We have released user jobs that were queued prior to the power outage when we paused the schedulers, and jobs are currently running.
What we will continue to do: PACE team will continue to monitor the systems, and we will report as needed. We have some straggling nodes that will remain offline, and we will work to bring them back up in the coming days.
Please don’t hesitate to contact us at pace-support@oit.gatech.edu if you have any questions or if you encounter any issues on the clusters. Thank you again for your patience.
[Update – 10/06/2020 – 11:20]
We are following up to update you on the current status of the Rich Datacenter. After a tireless evening, the PACE team in collaboration with OIT have successfully restored the network at approximately 11:00pm. We replaced a failed management module on the core InfiniBand switch, now, the switch is operational. Preliminary spot checks indicate that the fabric is stable. In abundance of caution, we will monitor the network overnight. In the morning, we aim to conduct additional testing and online the compute resources in Rich Datacenter, followed by releasing user jobs that are currently paused. The power remains stable after the repairs were conducted, and the UPS is back at nearly full charge.
As always, thank you for your patience and understanding during this outage as we know how critical these resources are to your research.
If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
[Update – 10/06/2020 – 6:30]
This is brief update on the current power outage. Power has been restored in Rich datacenter, and recovery is underway. Some Ethernet network switches have failed, and replacements and re-configurations are underway to try and restore services. Currently, our core InfiniBand switch has not restarted yet. We will continue to update you as we have more information. For up to date information, please check the status and blog pages:
Again, this emergency work does not impact any of the resources in CODA datacenter.
Thank you for your continued patience and understanding as we work through this emergency.
[Original Post – 10/06/2020 – 4:54]
We have a power outage on a section of campus that includes the Rich datacenter’s 133 computer room. We are urgently shutting down the schedulers and remaining servers in Rich133. Power to storage and login nodes in Rich are currently on generator power and will remain safe.
What is happening and what we have done: At 3:45pm the campus (not GA Power) distribution power issued a power outage, and at 4:05 Rich 133 UPS went out. Power to the chillers and to 2/3 of the computer room in Rich Datacenter is out. . Facilities is on site and investigating the situation, also, High Voltage contractor is in route. We have initiated urgent shutdown of schedulers and remaining servers in the Rich datacenter’s 133 computer room. Storage and login nodes are running on generators, but most of the running user jobs will have been interrupted by this power outage.
What we will continue to do: This is an active situation, and we will follow up with updates as they become available, and for most up to date information, please check the status and blog pages:
This emergency work does not impact any of the resources in CODA datacenter.
Thank you for your attention to this urgent message, and we apologize for this inconvenience.
The PACE Team