PACE A Partnership for an Advanced Computing Environment

June 29, 2018

[Resolved] Datacenter cooling problem with potential impact on PACE systems

Filed under: Uncategorized — Semir Sarajlic @ 5:39 pm

Update (06/29/2018, 3:30 pm): We’re happy to report that the issues with cooling systems are largely addressed without any visible impact on systems and/or running jobs. The schedulers are resumed, allocating new jobs as they are submitted. There is more work to be done to resolve the issue fully, but it can be performed without any disruption to services. You may continue to use PACE systems as usual. If you notice any problems, please contact pace-support@oit.gatech.edu

For a related status update from OIT, please see: https://status.gatech.edu/incidents/0ykh9wwnw50j

Original post:

The operations team notified PACE of cooling problems that started around noon today, impacting the datacenter housing the storage and virtual machine infrastructure. We immediately started monitoring the temperatures and turning off some non-critical systems as a precautionary step, and paused schedulers to prevent new jobs from running. Submitted jobs will be held until the problem is sufficiently addressed.

Depending on the course of this issue, there is a possibility that we may need to power down critical systems such as storage and Virtual Headnodes, but all critical systems are currently online for now.

We will continue to provide updates as we have them here on this blog and pace-available email list as needed.

Thank you!

 

 

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress