[Update 05/09/24 04:25 PM]
Dear PACE users,
The maintenance on the Phoenix, Hive, Firebird, and OSG Buzzard clusters has been completed. The Phoenix, Hive, Firebird, and OSG Buzzard clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released.
The ICE cluster is still under maintenance due to the RHEL9 migration, but we expect it to be ready tomorrow. Instructors teaching summer courses will be notified when it is ready.
The POSIX user group names on the Phoenix, Hive, Firebird, and OSG Buzzard clusters have been updated so that names will start with the “pace-” prefix. If your scripts or workflows rely on POSIX group names, they will need to be updated; otherwise, no action is required on your part. This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
Just a reminder that the next Maintenance Period will be August 6-8, 2024.
Thank you for your patience!
-The PACE Team
[Update 05/07/24 06:00 AM]
PACE Maintenance Period starts now at 6:00 AM on Tuesday, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, 05/10/2024.
[Update 05/01/24 06:37 PM]
WHEN IS IT HAPPENING?
PACE’s next Maintenance Period starts at 6:00 AM on Tuesday, May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, May 10th, 05/10/2024. An extra day is needed to accommodate physical work done by Databank in the Coda Data Center.PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.
WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime.
WHAT IS HAPPENING?
ITEMS REQUIRING USER ACTION:
- [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
- This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
- If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
- This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
- NOTE: This item was originally planned for January but was delayed to avoid integration issues with IAM services, which have now been resolved.
- [ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know!
- The ICE login nodes will be updated to RHEL 9.3 as well, and this WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance.
- [ICE] We will be retiring 8 of the RTX6000 GPU nodes on ICE to prepare for the addition of several new L40 nodes the week after MD.
- [software] Sync Gaussian and VASP on RHEL7 pace-apps.
- [software] Sync any remaining RHEL9 pace-apps for the OS migration.
- [Phoenix, ICE] Upgrade Nvidia drivers on all HGX/DGX servers.
- [Hive] The scratch deleter will not run in May and June but will resume in July.
- [Phoenix] The scratch deleter will not run in May but will resume in June.
- [ICE] The scratch deleter will run for Spring semester deletion during the week of May 13.
ITEMS NOT REQUIRING USER ACTION:
- [datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.
- [Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9.
- [datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit).
- [datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0.
- [Hive] Consolidate all the ICE access entitlements into a single one, all-pace-ice-access.
- [Hive] Upgrade Hive compute nodes to GPFS 5.1.
- [Phoenix] Replace cables for the Phoenix storage server.
- [Firebird] Patch Firebird storage server to 100GbE switch and reconfigure.
- [Firebird, Hive] Deploy Slurm scheduler CLI+Feature bits on Firebird and Hive.
- [datacenter] Configure LDAP on the MANTA NetApp HPCNA SVM.
WHY IS IT HAPPENING?
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.
WHO IS AFFECTED?
All users across all PACE clusters.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.
Thank you,
-The PACE Team
[Update 04/22/24 09:53 AM]
WHEN IS IT HAPPENING?
PACE’s next Maintenance Period starts at 6:00AM on Tuesday May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59PM on Friday May 10th, 05/10/2024. The additional day is needed to accommodate physical work carried out by Databank in the Coda datacenter. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.
WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime.
WHAT IS HAPPENING?
ITEMS REQUIRING USER ACTION:
- [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
- This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
- If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
- This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
- NOTE: This item was originally planned for January, but was delayed to avoid integration issues with IAM services, which have now been resolved.
- [ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know!
- Note – This WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance.
ITEMS NOT REQUIRING USER ACTION:
- [datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.
- [Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9
- [datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit)
- [datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0.
WHY IS IT HAPPENING?
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.
WHO IS AFFECTED?
All users across all PACE clusters.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.