[Update] May 9, 2025 at 5:34 pm
Dear PACE Community,
While all PACE clusters are up, have passed tests, and are accepting jobs, you may encounter errors due to the packages installed across our systems. We are aware of minor inconsistencies in the list of packages installed on compute nodes of the same type and are working on addressing this as quickly as possible.
Please let us know via email to pace-support@oit.gatech.edu if you encounter any unusual job errors.
We will continue working to resolve the situation and provide updates as we learn more.
The PACE Team
[Update] May 9, 2025 at 5:16 pm
Dear Firebird users,
The Firebird cluster is back in production and has resumed running jobs.
As previously mentioned, this cluster is now only running the RHEL9 operating system. Please reference our prior emails about SSH keys on Firebird if you experience any trouble logging in!
1 RTX6000 GPU node is currently unavailable, but all other GPU types (A100 and H200) are available – we will work to repair this node next week.
Thank you for your patience as we continue to work on the Firebird cluster.
Best,
The PACE Team
[Update] May 9, 2025 at 12:15 pm
Dear PACE users,
Maintenance on the Hive, Buzzard, ICE and Phoenix clusters is complete. These clusters are back in production, and all jobs held by the scheduler have been released.
The Firebird cluster is still under maintenance; these users will be notified separately once work is complete.
We are happy to share that all PACE clusters are now running the RHEL9 operating system and that other important security updates are complete.
The update to IDEaS storage is ongoing – the storage is currently accessible, but it is still necessary to use the `newgrp` command to set the order of your group membership just as before maintenance.
If you are building or running MPI applications on Phoenix’s H100/H200 nodes, please be aware that the MVAPICH2 and OpenMPI modules are no longer compatible with system updates to the H100/H200 nodes. We highly recommend using HPC-X for MPI, as it provides numerous benefits for MPI + GPU workloads. To use it, load the nvhpc/24.5 and hpcx/2.19-cuda modules. This will not affect the vast majority of single-node Python workflows, which typically do not use MPI.
Another goal for this maintenance period was the replacement of the problematic cooling system pump. While this system was rigorously tested and calibrated prior to installation, the DataBank datacenter staff were required to remove the new pump and replace it with the original as it did not pass inspection upon installation. We share your frustration in this matter. However, operating a safe and reliable datacenter is of the utmost priority and we will continue doing our best to keep PACE resources stable until DataBank is able to successfully replace the cooling pump. We are continuing to work with Georgia Tech leadership on long term solutions to improve the overall reliability to meet the expectations of our users.
At this time, we have extended the next maintenance period August 5-8, 2025 to allow for reinstalling a new cooling pump. We will share additional information as it becomes available.
Thank you,
The PACE Team
[Maintenance] April 28, 2025 at 9:42am
WHEN IS IT HAPPENING?
PACE’s next Maintenance Period starts at 6:00AM on Monday May 5th, 05/06/2025, and is tentatively scheduled to conclude by 11:59PM on Friday May 9th, 05/09/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.
WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected.
WHAT IS HAPPENING?
ITEMS REQUIRING USER ACTION:
- [Firebird] The Firebird system will completely migrate to the RHEL9 Operating system
ITEMS NOT REQUIRING USER ACTION:
- Change IDEaS storage user authentication from AD to LDAP
- Run filesystem checks on all lustre filesystems.
- Upgrade IDEaS storage
- Upgrade Phoenix Project storage servers and controllers
- Upgrade Phoenix scratch storage servers and controllers
- Upgrade ICE scratch storage servers and controllers
- Move ice-shared from NetApp to VAST storage
- Rebuild ondemand-ice on physical hardware to handle increased usage
- Move ICE pace-apps to separate storage volume
- Firebird storage and scheduler improvements
- Upgrade ddn insight (for monitoring storage system performance)
- Databank: replace cooling pump assembly
- Databank: Cooling tower cleanup
WHY IS IT HAPPENING?
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. This particular instance is allowing for the complete replacement of a problematic cooling system pump in the datacenter.
WHO IS AFFECTED?
All users across all PACE clusters.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.
Thank you,
-The PACE Team