[Update 9/11/24 2:51 PM]
Dear Hive community,
The emergency maintenance on the Coda datacenter has been completed and the Hive cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that were held by the scheduler have been released.
[Update 9/11/24 10:52 AM]
Dear Firebird users,
The emergency maintenance on the Coda datacenter has been completed and the Firebird cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that have been held by the scheduler have been released.
As a reminder:
RHEL7 Firebird nodes are accessible at the usual address login-<project>.pace.gatech.edu. RHEL9 Firebird nodes can be accessed via ssh at login-<project>-rh9.pace.gatech.edu for testing new software. The majority of our software stack has been rebuilt for the RHEL9 environment. We strongly encourage you to test your software on RHEL9, and please let us know if anything is missing! For more information, please see our Firebird RHEL9 documentation page.
Please take the time to test your software and workflows on the RHEL9 Firebird Environment (accessible via login-<project>-rh9.pace.gatech.edu) and let us know if anything is missing!
The next Maintenance Period will be January 13-16, 2025.
[Update 9/9/24 6:00 PM]
Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. The datacenter provider, Data Bank, has identified an alternate replacement part which has been brought onsite and is in the process of being deployed/tested. At this time, we estimate that Data Bank will have restored cooling to the Research Hall by Tuesday, September 10, 2024, by close of business day. At which point, PACE will begin powering up, testing infrastructure and begin the process to bring services back online. We plan to provide additional updates on the restoration of services by Wednesday, September 11, 2024, evening.
Please visit https://status.gatech.edu for updates.
Access to head nodes and file systems is available.
[Update 9/9/24 9:00 AM]
Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. While a time frame for resolution is currently unknown, we are actively working with the vendor, Data Bank, to resolve the issue and restore service to the data center as soon as possible. We will provide updates as they are available. Please visit https://status.gatech.edu for updates.
Access to login nodes and filesystems (via Globus, OpenOnDemand or direct connection to login nodes) is still available.
[Original Post 9/8/24]
WHAT’S HAPPENING?
Due to an emergency with a cooling system at the Research Hall, all PACE clusters had to be shut down on the morning of Sunday, September 8, 2024.
WHEN IS IT HAPPENING?
Sunday, September 8, 2024, starting at 7.30 AM.EDT.
WHY IS IT HAPPENING?
PACE have been notified by IOC that the temperatures in the CODA building Research Hall are rising due to a failure of a water pump in the cooling system. Emergency shutdown had to be executed in order to protect equipment. The physical infrastructure provider for our datacenter is working on evaluating the situation.
WHO IS AFFECTED?
All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) had to be stopped at 7.30 AM. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!
WHAT DO YOU NEED TO DO?
Wait patiently; we will communicate as soon as the clusters are ready to resume work.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
For any questions, please contact PACE at pace-support@oit.gatech.edu.