PACE A Partnership for an Advanced Computing Environment

July 15, 2024

PACE Maintenance Period Aug 06-09 2024

Filed under: Uncategorized — Eric Coulter @ 3:36 pm

[Update 07/31/24 02:23pm]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00 AM on Tuesday, August 6th (08/06/2024) and is tentatively scheduled to conclude by 11:59 PM on Friday, August 9th (08/09/2024). An extra day is needed to accommodate additional testing needed due to both RHEL7 and RHEL9 versions of our systems as we migrate to the new Operating System. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard, along with their associated RHEL9 environments) as soon as maintenance work and testing are completed. We plan to focus on the largest portion of each system first, to ensure access to data and compute capabilities are restored as soon as possible.

Also, we have CANCELED the November maintenance period for 2024 and do NOT plan to have another maintenance outage until early 2025.

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected.

For Phoenix, we are migrating 427 nodes (~30% of the ~1400 total nodes on Phoenix) from RHEL7 to RHEL9 in August. The new RHEL9 nodes will not be available immediately after the Maintenance Period is completed but will come online the following week (August 12th – 16th). After this migration, about 50% of the Phoenix cluster will be migrated over to RHEL9, including all but 20 GPU nodes. Given this, we strongly encourage Phoenix users who have not migrated their workflows over to RHEL9 to do so as soon as possible.

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [Phoenix and Hive] Continue migrating nodes to the RHEL 9 operating system
  • Migrate 427 nodes to RHEL9 in Phoenix 
  • Migrate 100 nodes to RHEL9 in Hive 
  • [Phoenix, Hive, Firebird, ICE] GPU nodes will receive new versions of the NVIDIA drivers, which *may* impact locally built tools using CUDA. 
  • [Phoenix] H100 GPU users on Phoenix should use the RHEL9 login node to avoid module environment issues.

ITEMS NOT REQUIRING USER ACTION: 

  • [all] Databank cooling loop work, which will require shutdown of all systems 
  • [all] Upgrade to RHEL 9.4 from 9.3 on all RHEL9 nodes – should not impact user-installed software 
  • [all] Research and Enterprise Hall Ethernet switch code upgrade 
  • [all] Upgrade PACE welcome emails 
  • [all] Upgrade Slurm scheduler nodes to RHEL9 
  • [CEDAR] Adding SSSD and IDmap configurations to RHEL7 nodes to allow correct group access across PACE resources 
  • [Phoenix] Updates to Lustre storage to improve stability  
  • File consistency checks across all metadata servers, appliance firmware updates, external metadata server replacement on project storage 
  • [Phoenix] Install additional InfiniBand interfaces to HGX servers 
  • [Phoenix] Migrate OOD Phoenix RHEL9 apps 
  • [Phoenix, Hive] Enable Apptainer self-service 
  • [Phoenix, ICE] Upgrade Phoenix/Hive/ICE subnet managers to RHEL9 
  • [Hive] Upgrade Hive storage for new disk replacement to take effect 
  • [ICE] Updates to Lustre scratch storage to improve stability 
  • File consistency checks and appliance firmware updates 
  • [ICE] Retire ICE enabling rules for ECE 
  • [ICE] Migrate ondemand-ice server to RHEL9 

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?  

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team 

[Update 07/15/24 03:36pm]

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday August 6th, 08/06/2024, and is tentatively scheduled to conclude by 11:59PM on Friday August 9th, 08/09/2024. The additional day is needed to accommodate additional testing needed due to the presence of both RHEL7 and RHEL9 versions of our systems as we migrate to the new Operating System. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard, along with their associated RHEL9 environments) as soon as maintenance work and testing is completed. We plan to focus on the largest portion of each system first, to ensure access to data and compute capabilities are restored as soon as possible.  
 
Additionally, we have cancelled the November maintenance period for 2024, and do not plan to have a maintenance outage until early 2025 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [Phoenix and Hive] Continue migrating nodes to the RHEL 9.3 operating system.  

ITEMS NOT REQUIRING USER ACTION: 

  • [all] Databank cooling loop work, which will require shutdown of all systems 
  • [CEDAR] Adding SSSD and IDmap configurations to allow correct group access across PACE resources 
  • [Phoenix] Updates to Lustre storage to improve stability  
  • File consistency checks across all metadata servers, appliance firmware updates, external metadata server replacement on /storage/coda1 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.  

Thank you,  

-The PACE Team 

July 8, 2024

Phoenix project storage outage

Filed under: Uncategorized — Michael Weiner @ 4:40 pm

[Update 7/9/24 12:00 PM]

Phoenix project storage has been repaired, and the scheduler has resumed. All Phoenix services are now functioning.

We have updated a parameter to throttle the number of operations on the metadata servers to improve stability.

Please contact us at pace-support@oit.gatech.edu if you encounter any remaining issues.

[Original Post 7/8/24 4:40 PM]

Summary: Phoenix project storage is currently inaccessible. We have paused the Phoenix scheduler, so no new jobs will start.

Details: Phoenix Lustre project storage has experienced slowness and been intermittently unresponsive at times throughout the day today. The PACE team identified a few user jobs causing high workload on the storage system, but the load remained high on one metadata server, which eventually stopped responding. Our storage vendor recommended a failover to a different metadata server as part of a repair, but the system has been left fully unresponsive. PACE and our storage vendor continue to work on restoring full access to project storage.

Impact: The Phoenix scheduler has been paused to prevent new jobs from hanging, so no new jobs can start. Currently-running jobs may not make progress and should be cancelled if stuck. Home and scratch directories remain accessible, but an ls of the full home directory may hang due to the symbolic link to project storage.

Thank you for your patience as we work to restore Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions. You may visit https://status.gatech.edu/ for additional updates.

July 5, 2024

IDEaS storage Maintenance

Filed under: Uncategorized — Deepa Phanish @ 1:04 pm

WHAT’S HAPPENING?

One of the  IDEaS IntelliFlash  controller cards needs to be reseated. Before reseating the card, we will failover all resources to controller B, shutdown controller A, pull the whole enclosure out and reseat the card. The activity takes about 2 hours to complete. 

WHEN IS IT HAPPENING?

Monday, July 8th, 2024, starting at 9 AM EDT.

WHY IS IT HAPPENING?

We are working with the vendor to resolve an issue discovered while debugging controllers and restore system back to a healthy status.

WHO IS AFFECTED?

Users of the IDEaS storage system will notice decreased performance since all services will be switched over to a single controller. It is possible that access will be interrupted while the switch happens. 

WHAT DO YOU NEED TO DO?

During the maintenance, data access should be preserved, and we do not expect downtime. However, there have been cases in the past where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, jobs accessing the IDEaS storage may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage can be accessed.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Powered by WordPress