On Friday, May 24, at 4pm, we experienced a partial failure of our primary subnet manger that may have impacted running and starting MPI jobs that use IP over IB. Our backup IB subnet manager (SM) did not take over due the primary SM experiencing a partial failure. On Saturday, May 25, at 12:15pm, we switched to a new Subnet Manager, and restored the network. This service outage lasted from Friday, May 24, 4:00pm – Saturday, May 25, 12:15pm. Since this brief network interruption may have impacted the running jobs, please check your jobs to identify if there are any crashed jobs and report any problems to pace-support@oit.gatech.edu
May 28, 2019
May 18, 2019
PACE Ready for Research
Our May 2019 maintenance (https://blog.pace.gatech.edu/?p=6473) is complete one day ahead of schedule! We have brought compute nodes online and released previously submitted jobs. Login nodes are accessible and your data are available. We are postponing the replacement of CMOS batteries on the servers due to scheduling conflict with the vendor. As usual, there are a small number straggling nodes we will address over the coming days.
Compute
- (Complete) Upgrade testflightcluster to RHEL 7.6
- (Complete) Upgrade gemini-gpuand gemini-cpu clusters to RHEL7, which will require user action (only for gemini-cpu/gpu clusters‘ users)
- (Complete) Switch nodes between chemxand gemini-cpu queues
- (Postponed) Replace CMOS batteries on multiple servers
Network
- (Complete) Replace a faulty InfiniBand switch, which affects a single rack with no impact to the complete fabric
- (Complete) Migrate Rich to campus connections to 10Gbps
Storage
- (Complete) Reboot ICE storage servers to correct issues with backup application
- (Complete) Perform detailed performance analysis of the GPFS environment, in order to fine tune parameters to improve performance
Other
- (Postponed) Updates to the submit filters in the schedulers
- (Complete) Update salt master and minions
If you have any questions or concerns, please contact pace-support@oit.gatech.edu
May 7, 2019
[Complete] PACE quarterly maintenance – May 16-18, 2019
[Update – 05/09/2019] Our final quarterly maintenance schedule will include the following list of tasks:
Compute
- (no user action needed) Replace CMOS batteries on multiple servers
- (no user action needed) Upgrade testflight cluster to RHEL 7.6
- (some user action needed) Upgrade gemini-gpu and gemini-cpu clusters to RHEL7, which will require user action (only for gemini-cpu/gpu clusters‘ users)
- (no user action needed) Switch nodes between chemx and gemini-cpu queues
Network
- (no user action needed) Replace a faulty InfiniBand switch, which affects a single rack with no impact to the complete fabric
- (no user action needed) Migrate Rich to campus connections to 10Gbps
Storage
- (no user action needed) Reboot ICE storage servers to correct issues with backup application
- (no user action needed) Perform detailed performance analysis of the GPFS environment, in order to fine tune parameters to improve performance
Other
- (no user action needed) Updates to the submit filters in the schedulers
- (no user action needed) Update salt master and minions
[Original Post – May 7, 2019 – 12:32pm] We are preparing for a maintenance day on May 16, 2019. This maintenance day is planned for three days and will start on Thursday May 16 and go through Saturday, May 18.
As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.
In general, we will perform maintenance on PACE Network and migrate from 10Gbps to 40Gbps connections, GPFS storage performance analysis, upgrade schedulers, replace CMOS batteries, upgrade testflight cluster to the latest RHEL 7 kernel, 3.10.0-957.12.1, i.e., RHEL 7.6.
While we are still working on finalizing the task list and details, none of these tasks are expected to require any user actions.
May 3, 2019
Brief Interruption to PACE VPN During Service Maintenance
[May 3, 2019 – 4:53pm] On May 7, 2019, from 8:00pm (EST)- 9:00pm (EST), GT IT will be conducting maintenance of our VPN service. During this period, users that are connected to our ITAR/ASDL/CUI clusters via the VPN (pace.vpn.gatech.edu) will be disconnected. This interruption will be brief (about 5 minutes), then you may reconnect to the VPN and then the cluster. This service maintenance will not impact any of the running batch jobs, but it may impact running interactive jobs during this period. For additional details on the maintenance taking place, please visit the following link.
Thank you for your attention to this maintenance that GT IT is conducting.