The maintenance on the Phoenix, Hive, Buzzard, Firebird, and ICE clusters has been completed. All clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. The Firebird cluster has been released at 12:30 pm on October 30, and the other clusters have been released at 2:45 pm on October 27.
Update on the current cooling situation: DataBank performed a temporary repair to restore cooling to the research hosting environment. Cooling capacity in the research hall is at less than 100%, and is being actively monitored. We are currently able to run the clusters at full capacity. The plan is for DataBank to install new parts during the next Maintenance window, which is scheduled for Jan 23rd-25th, 2024. Should the situation worsen, and a full repair be required sooner, we will do our best to provide at least 1 week worth of notice. At this time, we do not expect the need for additional downtime.
Update on Firebird: We are happy to announce that the Firebird cluster is ready to use after migration to the Slurm scheduler! Again, we greatly appreciate your patience during this extended maintenance period. Over the weekend we were able to research a few lingering issues with MPI and the user environment on the cluster and have both implemented and tested corrections.
Firebird users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides the Firebird Migration Guide, and an additional Firebird-specific Slurm training session [register here] to support the smooth transition of your workflows to Slurm. You are also welcome to join our PACE Consulting Sessions or to email us for support.
[Changes to Note]
- New Hardware: There are 12 new 32-core Intel Cascade Lake CPU nodes with 384 GB of RAM available, in addition to new GPU nodes with 4x NVIDIA A100 GPUs, 48 core Intel Xeon Gold CPUs and 512GB of RAM.
- Account names: Under slurm, charge accounts will have the prefix “cgts-<PI username>-<project>-<account>” rather than “GT-”
- Default GPU: If you do not specify a GPU type in your job script, Slurm will default to using an NVIDIA A100 node, rather than an NVIDIA RTX6000 node; the A100 nodes are more expensive but more performant.
- SSH Keys: When you login in for the first time, you may receive a warning about new Host Keys:, similar to the following:
Warning: the ECDSA host key for ‘login-.pace.gatech.edu’ differs from the key for the IP address ‘xxx.xx.xx.xx’
Offending key for IP in /home/gbrudell3/.ssh/known_hosts:1
Are you sure you want to continue connecting (yes/no)?
This is expected! Simply type “yes” to continue!- You may also be prevented from login, and have to edit your .ssh/known_hosts to remove the old key, depending on your local ssh client settings.
- Jupyter and VNC: We do not currently have a replacement for Jupyter or VNC scripts for the new Slurm environment; we will be working on a solution to these needs over the coming weeks.
- MPI: For researchers using mvapich2 under the Slurm environment, specifying the additional –-constraint=core24 or –-constraint=core32 is necessary to ensure a homogeneous node allocation for the job (these reflect the number of CPUs per node).
If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
Thank you for your patience during this extended outage!
The PACE Team