PACE A Partnership for an Advanced Computing Environment

September 18, 2024

PACE Phoenix Storage Hotfix – Sept 24th, 2024

Filed under: Uncategorized — Eric Coulter @ 3:30 pm

WHAT’S HAPPENING? 

Due to a recent instance of lower performance in our Project storage system (coda1), we will be working with our storage vendor to apply updates to the underlying device on Tuesday, September 24th. This should not cause any outage, but may result in decreased performance for some operations during the patch deployment. Due to the non-zero risk of outage, we will be working hand-in-hand with the vendor during this operation, and will be monitoring performance closely. Please do let us know if you observe impact to any work during that time, and we will refund jobs accordingly.                        

WHEN IS IT HAPPENING? 
The update process will begin on Tuesday morning, Sept 24th, 2024. 
We will send an announcement when the update is complete. 

WHY IS IT HAPPENING? 

Patches to the storage devices underlying Phoenix Project storage (coda1) have been recommended by the device vendor to improve reliability and performance based on recently observed degraded performance of the metadata servers on our Lustre filesystem.  

WHO IS AFFECTED? 

Phoenix users *may* experience slower performance of Phoenix Project storage during the update, and there is a low risk of outage. 

WHAT DO YOU NEED TO DO? 

Please do let us know if you observe impact to any work using the Phoenix Project filesystem (coda1) during that time, and we will refund jobs accordingly. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

For any questions, please contact PACE at pace-support@oit.gatech.edu.

September 8, 2024

PACE-Wide Emergency Shutdown – September 8, 2024

Filed under: Uncategorized — Grigori Yourganov @ 9:11 pm

[Update 9/11/24 2:51 PM]

Dear Hive community, 

The emergency maintenance on the Coda datacenter has been completed and the Hive cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that were held by the scheduler have been released. 

[Update 9/11/24 10:52 AM]

Dear Firebird users,

The emergency maintenance on the Coda datacenter has been completed and the Firebird cluster has passed our tests. The cluster is back in production and is accepting jobs on BOTH the RHEL7 and RHEL9 environments; all jobs that have been held by the scheduler have been released.

As a reminder:

RHEL7 Firebird nodes are accessible at the usual address login-<project>.pace.gatech.edu. RHEL9 Firebird nodes can be accessed via ssh at login-<project>-rh9.pace.gatech.edu for testing new software. The majority of our software stack has been rebuilt for the RHEL9 environment. We strongly encourage you to test your software on RHEL9, and please let us know if anything is missing! For more information, please see our Firebird RHEL9 documentation page.

Please take the time to test your software and workflows on the RHEL9 Firebird Environment (accessible via login-<project>-rh9.pace.gatech.edu) and let us know if anything is missing!

The next Maintenance Period will be January 13-16, 2025.

[Update 9/9/24 6:00 PM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. The datacenter provider, Data Bank, has identified an alternate replacement part which has been brought onsite and is in the process of being deployed/tested. At this time, we estimate that Data Bank will have restored cooling to the Research Hall by Tuesday, September 10, 2024, by close of business day. At which point, PACE will begin powering up, testing infrastructure and begin the process to bring services back online. We plan to provide additional updates on the restoration of services by Wednesday, September 11, 2024, evening.

Please visit https://status.gatech.edu for updates.

Access to head nodes and file systems is available.

[Update 9/9/24 9:00 AM]

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. While a time frame for resolution is currently unknown, we are actively working with the vendor, Data Bank, to resolve the issue and restore service to the data center as soon as possible. We will provide updates as they are available. Please visit https://status.gatech.edu for updates. 

Access to login nodes and filesystems (via Globus, OpenOnDemand or direct connection to login nodes) is still available.

[Original Post 9/8/24]

WHAT’S HAPPENING?  

Due to an emergency with a cooling system at the Research Hall, all PACE clusters had to be shut down on the morning of Sunday, September 8, 2024. 

WHEN IS IT HAPPENING?  

Sunday, September 8, 2024, starting at 7.30 AM.EDT.  

WHY IS IT HAPPENING?  

PACE have been notified by IOC that the temperatures in the CODA building Research Hall are rising due to a failure of a water pump in the cooling system. Emergency shutdown had to be executed in order to protect equipment. The physical infrastructure provider for our datacenter is working on evaluating the situation.  

WHO IS AFFECTED?  

All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) had to be stopped at 7.30 AM. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!   

WHAT DO YOU NEED TO DO?  

Wait patiently; we will communicate as soon as the clusters are ready to resume work.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

For any questions, please contact PACE at pace-support@oit.gatech.edu.  

Powered by WordPress