GT Home : : Campus Maps : : GT Directory

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period 5/19/2021-5/21/2021

This entry was posted by on Tuesday, 18 May, 2021 at

[Update – 05/20/2021, 2:10PM]

Dear PACE Users,

Our scheduled maintenance has completed 1 day ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 08/11/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.

Here is an update on the tasks performed during this maintenance period, which includes an additional task that was added to our list as the maintenance progressed:

New task added during maintenance period:

  • [COMPLETE] [Datacenter/Network] Departmental Firewall firmware upgrade:  This task is part of a scheduled OIT maintenance for Friday, 05/21/2021 (8:00pm – 2:00am on 5/22), which PACE was able to decouple from the OIT maintenance period and include that task in our current maintenance.  This allows us to avoid any further interruptions to the research community after PACE maintenance period completes.

Items Not Requiring User Action:

  • [COMPLETE] [Network] Replace InfiniBand cables on login-hive1.
  • [COMPLETE] [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [COMPLETE] [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [COMPLETE] [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [CANCELLED] [Network] Update KVM/qemu hosts in CUI clusters. UPDATE: Cancelled as it was deemed unnecessary due to firmware being up-to-date for our RHEL version.
  • [COMPLETE] [Archive] Removal of InfiniteIO from pace-archive.
  • [COMPLETE] [System] Remove /opt/pace directories everywhere.
  • [COMPLETE] [Firewall] PACE departmental firewall will be updated.
  • [COMPLETE] [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [COMPLETE] [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [COMPLETE] [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!
The PACE Team

 

[Update – 05/18/2021, 5:18pm]

We are writing to notify you of our next Maintenance Period, which will begin at 6:00 AM on Wednesday, 5/19/2021, and is scheduled to conclude by 11:59 PM on Friday, 5/21/2021. As the systems will be powered off during the period, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to the computational and storage resources will be unavailable.

For your reference, the following tasks are scheduled:

Items Requiring User Action:

  • None currently scheduled.

Items Not Requiring User Action:

  • [Network] Replace InfiniBand cables on login-hive1.
  • [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [Network] Update KVM/qemu hosts in CUI clusters.
  • [Archive] Removal of InfiniteIO from pace-archive.
  • [System] Remove /opt/pace directories everywhere.
  • [Firewall] PACE departmental firewall will be updated.
  • [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Comments are closed.