GT Home : : Campus Maps : : GT Directory

[Resolved] Emergency Shutdown of all Compute Nodes, Schedulers, and Login Nodes in Rich Data Center

This entry was posted by on Saturday, 27 June, 2020 at

[Update – June 28, 2020, 2:42pm]

We are following up with another update.  The cooling on campus is currently set up to support buildings as best possible but it’s not “normal” operation. Facilities has indicated to us  that we should be able to resume operation.

According to the most recent news from Atlanta Water, they have isolated the 36″ water main failure and are working on the repairs that may conclude late on Wednesday at the earliest, Friday at the latest.

State of PACE: We have brought compute nodes online along with the remaining services.  Frequently, there are a few nodes that require specific manual action.  We will continue to work on bringing back those straggling nodes.  We will contact the users whose jobs were terminated due to yesterday’s emergency shutdown.  We encourage all users to verify their recent jobs.  Again, our storage system did not lose data.

Monitoring and Risk: OIT Operations staff will continue to monitor the temperature and cooling systems and will alert us upon any major change.  PACE will remain on standby should we need to shutdown services again in case we are unable to maintain cooling.

Coda data center that includes TestFlight-Coda and Hive Clusters and our backup data facilities are not affected by this outage.

Thank you again for your patience while we address emergency operations.

[Update – June 27, 2020, 9:36pm]

Water pressure and cooling have been partially restored at the Rich data center.  During this emergency shutdown, our storage did not experience data loss.  At this time, we have partially restored services  to cluster login nodes and we continue to work on restoring gryphon login node.  We have restored storage, schedulers, and data mover/Globus services.

For safety, we will keep the compute nodes offline overnight, and we aim to begin restoring the compute nodes on Sunday, June 28, along with any other services.

Thank you for your patience as we work through this incident.

 

[Original Note – June 27, 2020, 4:22pm]

Dear PACE Users,

There has been a water main break on a 36-inch transmission main at Ferst Dr NW and Hemphill Ave NW causing a loss of water pressure to campus chiller plants providing cooling to the Rich and other data centers. GT Facilities are in progress of shutting down chiller plants. Operations team is monitoring temperature in Rich and starting to deploy spot chillers.

This issue does not impact CODA datacenter (hive and testflight-hive clusters).

We are initiating an emergency shutdown of Rich resources to prevent overheating. This will impact running jobs. We will keep storage systems online as long as possible, but may need to power them as the situation requires.

Please save your work if possible, and refrain from submitting new jobs. We’ll keep you updated via emails and PACE blog as we continue to monitor the developments.

Comments are closed.