[Complete 11/5/21 3:15 PM]
Our scheduled maintenance has completed ahead of schedule! All PACE clusters, including Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard, are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.
Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, February 9, 2022, and conclude by 11:59PM on Friday, February 11, 2022. We have also tentatively scheduled the remaining maintenance periods for 2022 for May 11-13, August 10-12, and November 2-4.
The following tasks were part of this maintenance period:
• [Complete] TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details are available on our blog.
• [Complete][Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
• [Complete][System] Operating system patch installs
• [Complete][Storage/Phoenix] Lustre controller firmware and other upgrades
• [Complete][Storage/Phoenix] Lustre scratch upgrade and expansion
• [Postponed][Storage] Hive GPFS storage upgrade
• [Complete][System] System configuration management updates
• [Complete][System] Updates to NVIDIA drivers and libraries
• [Complete][System] Upgrade some PACE infrastructure nodes to RHEL 7.9
• [Complete][System] Reorder group file
• [Complete][Headnode/ICE] Configure c-group controls on COC-ICE and PACE-ICE headnodes
• [Complete][Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
• [Complete][Network] update ethernet switch firmware
• [Complete][Network] update IP addresses of switches in BCDC
If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.
[Update 11/1/21 2:00 PM]
C-group controls will be configured on the login nodes for both COC-ICE and PACE-ICE during the maintenance period this week. This should help mitigate overuse of the login node by students running heavy computations, which has slowed the node for others.
Please use compute nodes for all computational work and avoid resource-intensive processes on the login nodes. Students who need an interactive environment are requested to submit an interactive job. Students who are uncertain about how to use ICE schedulers to work on compute nodes should contact their course’s instructor or TA for assistance. They can help you with workflows on the cluster. PACE will stop processes that overuse the login nodes, in order to restore functionality for all students.
Thank you for your efforts to ensure ICE clusters are an available resource for all students in participating courses.
[Reminder 10/26/21 4:30 PM]
Additional details and instructions for the TensorFlow upgrade are available in another blog post.
[Full announcement 10/20/21 10:30 AM]
As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, November 3, and end at 11:59 PM on Friday, November 5. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.
Please see below for a tentative list of activities:
- TensorFlow upgrade due to security vulnerability. PACE will retire older versions of TensorFlow, and researchers should shift to using the new module. We also request that you replace any self-installed TensorFlow packages. Additional details and instructions will follow in a separate message.
- [Datacenter] Databank will clean the water cooling tower, requiring that all PACE compute nodes be powered off.
- [System] Operating system patch installs
- [Storage/Phoenix] Lustre controller firmware and other upgrades
- [Storage/Phoenix] Lustre scratch upgrade and expansion
- [System] System configuration management updates
- [System] Updates to NVIDIA drivers and libraries
- [System] Upgrade some PACE infrastructure nodes to RHEL 7.9
- [System] Reorder group file
- [Headnode/COC-ICE] Configure c-group controls on COC-ICE headnode
- [Scheduler/Hive] separate Torque & Moab servers to improve scheduler reliability
- [Network] update ethernet switch firmware
- [Network] update IP addresses of switches in BCDC
If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.
[Early announcement]
Dear PACE Users,
This is a friendly reminder that our next Maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 11/03/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/05/2021. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.
As we get closer to the Maintenance Period, we will communicate the list of activities to be completed and update this blog post.
If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.
The PACE Team