PACE A Partnership for an Advanced Computing Environment

June 29, 2018

[Resolved] Datacenter cooling problem with potential impact on PACE systems

Filed under: Uncategorized — Semir Sarajlic @ 5:39 pm

Update (06/29/2018, 3:30 pm): We’re happy to report that the issues with cooling systems are largely addressed without any visible impact on systems and/or running jobs. The schedulers are resumed, allocating new jobs as they are submitted. There is more work to be done to resolve the issue fully, but it can be performed without any disruption to services. You may continue to use PACE systems as usual. If you notice any problems, please contact pace-support@oit.gatech.edu

For a related status update from OIT, please see: https://status.gatech.edu/incidents/0ykh9wwnw50j

Original post:

The operations team notified PACE of cooling problems that started around noon today, impacting the datacenter housing the storage and virtual machine infrastructure. We immediately started monitoring the temperatures and turning off some non-critical systems as a precautionary step, and paused schedulers to prevent new jobs from running. Submitted jobs will be held until the problem is sufficiently addressed.

Depending on the course of this issue, there is a possibility that we may need to power down critical systems such as storage and Virtual Headnodes, but all critical systems are currently online for now.

We will continue to provide updates as we have them here on this blog and pace-available email list as needed.

Thank you!

 

 

June 4, 2018

Possible Water Service May Impact PACE Clusters

Filed under: Uncategorized — Semir Sarajlic @ 10:02 pm
You probably saw the announcement from Georgia Tech Office of Emergency Management (copied below). Our knowledge on the matter is limited to this message, but as far as we can understand a complete outage is unlikely, but still within possibility.

Impact on PACE Clusters:

In the event of a large scale outage, PACE datacenter cooling systems will stop working and we will need to urgently shutdown all systems, including but not limited to compute nodes, login nodes and storage systems as an emergency step. This will impact all of the running jobs and active sessions.
We’ll continue to keep you updated. Please check this blog for the most up-to-date information.
Thanks!

—————————————–

Original communication from Georgia Tech Office of Emergency Management:

To the campus community:

Out of an abundance of caution, Georgia Tech Emergency Management and Communications has taken steps to prepare the campus for the possibility of a water outage tonight in light of needed repairs to the City of Atlanta’s water lines.

The City of Atlanta’s Department of Watershed will repair a major water line beginning tonight between 11 p.m. and midnight. The repair is scheduled to be completed this week and should not negatively impact campus. If all goes according to plan, the campus will operate as usual.

In the event the repairs cause a significant loss of water pressure or loss of water service completely, the campus will be closed and personnel will be notified through the Georgia Tech Emergency Notifications System (GTENS).

If GTENS alerts are sent, essential personnel who are pre-identified by department leadership should report even if campus is closed. If the campus loses water, all non-essential activities will be canceled on campus.

Those with specialized research areas need to make arrangements tonight in the event there is a water failure. All lab work and experiments that can be delayed should be planned for later in the week or next week.

In the event of an outage, employees are asked to work with department leadership to work remotely. Employees who can work remotely should prepare before leaving work June 4 to work remotely for several days. Toilets won’t be operational, drinking water will not be available, and air conditioning will not be functioning in buildings on campus and throughout the city.

All who are housed on campus should fill bathtubs and other containers to have water on hand to manually flush toilets should there be a loss in pressure. Plans are underway to relocate campus residents to nearby campuses such as Emory University or Kennesaw State University in the event of a complete loss of water to the campus.

Parking and Transportation Services will continue on-campus transportation as long as the campus is open.

In the event of an outage, additional instructions and information on campus operations will be shared at gatech.edu.

Powered by WordPress