PACE A Partnership for an Advanced Computing Environment

May 18, 2021

[Completed – PACE Clusters Ready for Research] PACE Maintenance Period 5/19/2021-5/21/2021

Filed under: Maintenance — Semir Sarajlic @ 5:18 pm

[Update – 05/20/2021, 2:10PM]

Dear PACE Users,

Our scheduled maintenance has completed 1 day ahead of the schedule! All Coda datacenter clusters are ready for research. As usual, we have released all users jobs that were held by the scheduler. We appreciate everyone’s patience as we worked through these maintenance activities.

Our next maintenance period is tentatively scheduled to begin at 6:00AM on Wednesday, 08/11/2021, and it is tentatively scheduled to conclude by 11:59PM on Friday, 08/13/2021.

Here is an update on the tasks performed during this maintenance period, which includes an additional task that was added to our list as the maintenance progressed:

New task added during maintenance period:

  • [COMPLETE] [Datacenter/Network] Departmental Firewall firmware upgrade:  This task is part of a scheduled OIT maintenance for Friday, 05/21/2021 (8:00pm – 2:00am on 5/22), which PACE was able to decouple from the OIT maintenance period and include that task in our current maintenance.  This allows us to avoid any further interruptions to the research community after PACE maintenance period completes.

Items Not Requiring User Action:

  • [COMPLETE] [Network] Replace InfiniBand cables on login-hive1.
  • [COMPLETE] [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [COMPLETE] [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [COMPLETE] [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [CANCELLED] [Network] Update KVM/qemu hosts in CUI clusters. UPDATE: Cancelled as it was deemed unnecessary due to firmware being up-to-date for our RHEL version.
  • [COMPLETE] [Archive] Removal of InfiniteIO from pace-archive.
  • [COMPLETE] [System] Remove /opt/pace directories everywhere.
  • [COMPLETE] [Firewall] PACE departmental firewall will be updated.
  • [COMPLETE] [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [COMPLETE] [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [COMPLETE] [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Happy Computing!
The PACE Team

 

[Update – 05/18/2021, 5:18pm]

We are writing to notify you of our next Maintenance Period, which will begin at 6:00 AM on Wednesday, 5/19/2021, and is scheduled to conclude by 11:59 PM on Friday, 5/21/2021. As the systems will be powered off during the period, jobs with resource requests that would be running during the Maintenance Period will be held until after the Maintenance Period by the scheduler. During the Maintenance Period, access to the computational and storage resources will be unavailable.

For your reference, the following tasks are scheduled:

Items Requiring User Action:

  • None currently scheduled.

Items Not Requiring User Action:

  • [Network] Replace InfiniBand cables on login-hive1.
  • [Network] Upgrade of InfiniBand firmware on PACE admin nodes.
  • [Network] Switch all affected CloudBolt Virtual Machines to DHCP.
  • [Network] Upgrade DDN SFA14KXE GridScaler and SFA Firmware.
  • [Network] Update KVM/qemu hosts in CUI clusters.
  • [Archive] Removal of InfiniteIO from pace-archive.
  • [System] Remove /opt/pace directories everywhere.
  • [Firewall] PACE departmental firewall will be updated.
  • [Scheduler] Upgrade Torque on the Phoenix, Hive, Firebird, PACE-ICE, and COC-ICE schedulers. In order to improve user experience and prevent a recurrence of the recent Phoenix scheduler outage (May 13), we will deploy this recent patch from the vendor. The patch resolves a race condition the scheduler can face with array jobs and jobs with dependencies, which should provide PACE users with a more reliable experience.
  • [Hive] Add InfiniBand Topo Map for the Hive Scheduler.
  • [Datacenter] Georgia Power Microgrid Testing

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

May 13, 2021

[RESOLVED] Phoenix Scheduler is Down

Filed under: Uncategorized — Semir Sarajlic @ 11:58 am

Update (5/13 2:00pm): We are happy to report that the Phoenix Scheduler is now online and accepting jobs.

We are sorry for the inconvenience this has caused and please let us know if you continue to observe any problems (pace-support@oit.gatech.edu)
—-
At around 10:30am this morning, we restarted the Phoenix scheduler to apply a new license file. The scheduler is having trouble coming back online and we are actively troubleshooting this issue. So far we know the issue is unrelated to the license, rather some left over job files may be causing the issue. We are working on reviving the scheduler as soon as possible.

This issue doesn’t impact any running jobs, or those submitted before the incident. Only new job submissions will fail with an error.

We’ll update this post (https://blog.pace.gatech.edu/?p=7075) and send a follow up message once the issue is resolved.

Thank you for your patience and sorry for this inconvenience.

 

 

May 6, 2021

OIT Scheduled Service for MATLAB- 05/07/2021, 10:00AM – noon

Filed under: Uncategorized — Semir Sarajlic @ 3:39 pm

OIT will perform work on Georgia Tech’s MATLAB license server tomorrow morning, 05/07/2021, 10:00 AM – noon, which will impact any MATLAB jobs running on PACE at the time of the outage (as well as elsewhere on campus).

During the outage window, attempts to open new MATLAB instances in batch or interactive jobs will fail. In addition, we expect running MATLAB instances will stop working, but the job will continue running.

PACE aims to identify affected jobs tomorrow morning and follow up with the impacted users.

We recommend that you avoid submitting additional MATLAB jobs to PACE that will not finish before 10 AM on Friday (May 6) and instead submit them after work is complete.

OIT will be providing up-to-date progress on Georgia Tech’s Status page, http://status.gatech.edu. 

If you have any questions, please contact us at pace-support@oit.gatech.edu.

 

 

Powered by WordPress