PACE A Partnership for an Advanced Computing Environment

May 18, 2021

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period 5/19/2021-5/21/2021

Filed under: Maintenance — Semir Sarajlic @ 5:18 pm

[Update – 05/20/2021, 2:10PM]

Dear PACE Users,

Our scheduled maintenance has completed 1 day ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 08/11/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.

Here is an update on the tasks performed during this maintenance period, which includes an additional task that was added to our list as the maintenance progressed:

New task added during maintenance period:

  • [COMPLETE] [Datacenter/Network] Departmental Firewall firmware upgrade:  This task is part of a scheduled OIT maintenance for Friday, 05/21/2021 (8:00pm – 2:00am on 5/22), which PACE was able to decouple from the OIT maintenance period and include that task in our current maintenance.  This allows us to avoid any further interruptions to the research community after PACE maintenance period completes.

Items Not Requiring User Action:

  • [COMPLETE] [Network] Replace InfiniBand cables on login-hive1.
  • [COMPLETE] [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [COMPLETE] [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [COMPLETE] [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [CANCELLED] [Network] Update KVM/qemu hosts in CUI clusters. UPDATE: Cancelled as it was deemed unnecessary due to firmware being up-to-date for our RHEL version.
  • [COMPLETE] [Archive] Removal of InfiniteIO from pace-archive.
  • [COMPLETE] [System] Remove /opt/pace directories everywhere.
  • [COMPLETE] [Firewall] PACE departmental firewall will be updated.
  • [COMPLETE] [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [COMPLETE] [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [COMPLETE] [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!
The PACE Team

 

[Update – 05/18/2021, 5:18pm]

We are writing to notify you of our next Maintenance Period, which will begin at 6:00 AM on Wednesday, 5/19/2021, and is scheduled to conclude by 11:59 PM on Friday, 5/21/2021. As the systems will be powered off during the period, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to the computational and storage resources will be unavailable.

For your reference, the following tasks are scheduled:

Items Requiring User Action:

  • None currently scheduled.

Items Not Requiring User Action:

  • [Network] Replace InfiniBand cables on login-hive1.
  • [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [Network] Update KVM/qemu hosts in CUI clusters.
  • [Archive] Removal of InfiniteIO from pace-archive.
  • [System] Remove /opt/pace directories everywhere.
  • [Firewall] PACE departmental firewall will be updated.
  • [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

July 24, 2020

[Resolved]: PACE Maintenance Days 8/6/2020-8/8/2020

Filed under: Maintenance — Semir Sarajlic @ 7:10 pm

Dear PACE Users,

RESOLVED: PACE is now ready for research.

We are preparing for our next PACE maintenance period, which will begin at 6:00 AM on August 6th, 2020 and conclude at 11:59 PM on August 8th, 2020. As usual, jobs with long walltimes will be held by the scheduler to ensure that no active jobs will be running when systems are powered off. These jobs will be released as soon as the maintenance activities are complete.

We are still finalizing planned activities for the maintenance period. Here is a current list:

ITEMS REQUIRING USER ACTION:

– None Current.

ITEMS NOT REQUIRING USER ACTION:

– [Resolved] Coda Lustre Upgrade (This will start on Wednesday (08/05), which will impact testflight-coda only, and a scheduler reservation was put in place to prevent any jobs from running past 6:00AM on Wednesday – August 5).

– [Resolved] Install additional line cards for CS8500 Infiniband switch.

– [Resolved] Deploy PBSToools RPM on schedulers

– [Resolved] Upgrade Hive Infiniband switches firmware to version 3.9.0914

– [Resolved] Upgrade Coda Infiniband director switches firmware to version 3.9.0914

– [Resolved] Move DNS appliance from Rich to Coda.

– [Resolved] Update coda-apps file system mounts to use qtrees from NetApp on all servers.

– [Deferred] Update Nvidia GPU Drivers in Coda to support Cuda 11 SDK.

– [Resolved] Reboot of all nodes.

– [Resolved] Rebooted the subnet manager.

 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

 

The PACE Team

December 16, 2019

OIT Network Maintenance 12/18/2019-12/19/2019

Filed under: Maintenance,News — Semir Sarajlic @ 9:41 pm

To Our Valued PACE Research Community,

We are writing to inform our research community of upcoming maintenance, as follows: 

The Office of Information Technology (OIT) will be performing a series of upgrades to the networking infrastructure to improve the performance and reliability of networking operations. Some of these upcoming enhancements may impact PACE users’ ability to connect and interact with computational and storage resources. We do not expect that this network maintenance to have any impact on currently running jobs.   

12/18/2019 20:00-23:59 (Router Code Upgrade) An upgrade to the software on some routers is scheduled and will include an approximate 30-minute disruption to telecommunication services.  

12/18/2019 20:00 – 12/19/2019 02:00 (Date Center Router Code Upgrade & Routing Engine Upgrade)  An upgrade to the software on multiple devices will impact network connectivity across the main campus of the Georgia Institute of Technology. This disruption will include the CODA Building. 

OIT Technical Teams will be actively monitoring the progress of upgrades during the maintenance windows described above. These teams will be providing ongoing communications to student, faculty, and staff members of the Institute. A central location for progress communications will be available at http://status.gatech.edu 

Issues during the upgrade may be reported to the OIT Network Operations Center at (404)894-4669. 

We do not expect any impact on running jobs and no changes to the PACE computational and storage resources are part of this OIT Network maintenance. 

Thank you for your time and diligence,

PACE Outreach and Faculty Interaction Team

September 25, 2019

OIT Planned Maintenance

Filed under: Maintenance — Tags: , — Aaron Jezghani @ 2:51 pm

The OIT Network Services team will be performing a software upgrade on our campus Carrier-Grade NAT (CGN) appliances this week – see OIT Status for a full description. The affected subnet is the out of band management of the Hive/MRI servers; additionally, only internet-bound connections are being serviced. As such, no failures are expected for users of the Hive/MRI servers. Nonetheless, if you encounter connectivity issues to Hive resources, please do not hesitate to contact pace-support@oit.gatech.edu for assistance.

Powered by WordPress