GT Home : : Campus Maps : : GT Directory

Author Archive

Firebird inaccessible

Posted by on Monday, 3 October, 2022

[Update 10/3/22 10:45 AM]

Access to Firebird and the PACE VPN has been restored, and all systems should be functioning normally. If you do not see the PACE VPN as an option in the GlobalProtect client, please disconnect from the GT VPN and reconnect for it to appear again.

Urgent maintenance on the GlobalProtect VPN device on Thursday night inadvertently led to the loss of PACE VPN access, which was restored this morning.

Please contact us at pace-support@oit.gatech.edu with questions, or if you are still unable to access Firebird.

 

[Original Message 10/3/22 9:40 AM]

Summary: The Firebird cluster and PACE VPN are currently inaccessible. OIT is working to restore access.

Details: The Firebird cluster was found to be inaccessible over the weekend. PACE is working with OIT colleagues to identify the cause and restore access.

Impact: Researchers are unable to connect to the PACE VPN or access the Firebird cluster.

Thank you for your patience as we work to restore access. Please contact us at pace-support@oit.gatech.edu with questions.

Hive scheduler outage

Posted by on Tuesday, 26 July, 2022

Summary: The Hive scheduler became non-responsive last evening and was restored at approximately 8:30 AM today.

Details: The Torque resource manager on the Hive scheduler stopped responding around 7:00 PM yesterday. The PACE team restored its function around 8:30 AM this morning and is continuing to monitor its status. The scheduler was fully functional for some time after the system utility repair yesterday afternoon, and it is not clear if the issues are connected.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Hive Open OnDemand. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

[Resolved] Utility error prevented new jobs starting on Hive, Phoenix, PACE-ICE, and COC-ICE

Posted by on Monday, 25 July, 2022

(updated to reflect that Hive was impacted as well)

Summary: An error in a system utility resulted in the Hive, Phoenix, PACE-ICE, and COC-ICE clusters temporarily not launching new jobs. It has been repaired, and jobs have resumed launching.

Details: An unintended update to the system utility that checks the health of compute nodes resulted in all Hive, Phoenix, PACE-ICE, and COC-ICE compute nodes being recorded as down shortly before 4:00 PM today, even if there was in fact no issue with them. The scheduler will not launch new jobs on nodes marked down. After correcting the issue, all nodes are again correctly reporting their status, and jobs have resumed launching on all three clusters as of 6:30 PM.

Impact: As all nodes appeared down, no new jobs could launch but would instead remain in queue after being submitted. Running jobs were not impacted. Interactive jobs waiting to start might have been cancelled, in which case the researcher should re-submit.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Phoenix scheduler outage

Posted by on Friday, 8 July, 2022

Summary: The Phoenix scheduler became nonresponsive this afternoon and was restored at approximately 4:50 PM today.

Details: The Torque resource manager on the Phoenix scheduler became overloaded, likely around 2:45 PM. The PACE team restarted the scheduler and restored its function around 4:50 PM.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs were not interrupted.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Campus ESX Incident Impacting PACE services

Posted by on Tuesday, 28 June, 2022

[Update 6/28/22 2:00 PM]

The ESX host issue is resolved, and all PACE services are fully restored. Please contact pace-support@oit.gatech.edu with any questions, or if you encounter further issues.

[Original Post 6/28/22 12:55 PM]

Summary: An issue with an ESX host is affecting multiple campus services, including several PACE services. Open OnDemand and some PACE utilities are currently unavailable. OIT is working to resolve the issue.

Details: The ESX issue affects campus virtual machines, hosting both PACE and other services. Visit https://status.gatech.edu for details.

Impact:

– Open OnDemand websites for all PACE clusters may not load.

– Some PACE utilities may hang, including pace-quota, pace-whoami, and pace-check-queue.

– There may be intermittent unavailability of software licenses.

Thank you for your patience as OIT works to resolve this outage. Please contact us at pace-support@oit.gatech.edu with any questions about the impacted PACE services.

Hive scheduler degraded state

Posted by on Tuesday, 31 May, 2022

[Update 6/3/22 4:55 PM]

After the full restart of scheduler services across Hive this afternoon, we have returned to full production status on the cluster. Thank you for your patience this week as we investigated the issue. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 6/3/22 2:25 PM]

The PACE team is continuing to investigate the partial disruption of the Hive scheduler. We are currently performing a full restart of all scheduler services across the Hive cluster. While this cluster-wide service restart is in progress this afternoon, it is not possible to submit, start, or check the status of any jobs on Hive. Commands such as qsub, qstat, and showq are unavailable. Running jobs are not impacted.

We appreciate your patience during this process. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 5/31/22 5:30 PM]

Summary: The Hive scheduler is currently in a degraded state, and many waiting jobs will not start.

Details: The Torque resource manager and the Moab workload manager, the two components of the Hive scheduler, are currently reporting conflicting information about resources allocated to running jobs. This causes failed attempts to schedule waiting jobs on resources that are already allocated, which prevents the jobs from starting. The PACE team is actively investigating this situation and working to resolve it.

Impact: Some queued jobs, especially those requesting a larger number of resources, may remain in the queue even though resources may appear to be available via tools such as pace-check-queue. Interactive jobs may be cancelled by the scheduler while waiting to start. Running jobs are not impacted.

Please contact us at pace-support@oit.gatech.edu with any questions.

Hive scheduler outage

Posted by on Tuesday, 31 May, 2022

Summary: The Hive scheduler stopped launching new jobs on Monday afternoon and was restored at approximately 10:00 AM on Tuesday.

Details: At approximately 12:35 PM on Monday, during the Memorial Day holiday, the Torque resource manager on Hive became nonresponsive due to an error. The PACE team restarted the scheduler and restored its function at 10:00 this morning.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted. Running jobs were not interrupted. Moab commands such as “showq” were not impacted.

Thank you for your patience during the holiday weekend. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Resolved] Phoenix scheduler timeout

Posted by on Friday, 20 May, 2022

Summary: A timeout on the Phoenix scheduler prevented new jobs from beginning earlier today.

Details: A setting caused a timeout issue in the communication between the Torque and Moab portions of the Phoenix scheduler this morning, beginning at 10:20 AM. The PACE team restored communication between the services before 12:20 PM today.

Impact: During the intervening period, no new jobs could start. Running jobs were not interrupted, and submitting new jobs to queue remained functional. Commands such as “qsub” and “qstat” continued to work.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

PACE Firebird Login Node Outages

Posted by on Wednesday, 27 April, 2022

[Update 4/27/22 5:45 PM]

The remaining headnode has been repaired, and service is restored. Thank you for your patience.

[Original Post 4/27/22 5:20 PM]

Summary: A storage server issue made headnodes for two projects on Firebird inaccessible. One has been recovered, while repairs are in progress on the second one.

Details: The storage server housing two Firebird projects had an NFS issue earlier today. The login nodes were impacted. The PACE team has repaired one project’s login node and is currently repairing the second that has a more complex issue.

Impact: Researchers on impacted projects are/were not able to log into Firebird today. Running jobs were not impacted, as only the login node is/was affected.

We apologize for the disruption. Please email us at pace-support@oit.gatech.edu with any questions.

Campus network disaster recovery testing June 10-13

Posted by on Wednesday, 27 April, 2022

[Update 6/6/22 11:20 AM]

Summary: Revised plans for OIT’s network disaster recovery test remove all expected impact to PACE.

Details: Changes in the disaster recovery test mean that we no longer expect PACE to have any impact this weekend, and all PACE clusters should operate normally, including OnDemand and other PACE services. Campus license servers should also remain reachable from PACE. For additional details about the disaster recovery scope, please see https://oit.gatech.edu/recoveryexercisejun22.

Impact: We have removed the scheduler reservations on all PACE clusters, so longer jobs that have been held can now begin. No impact is expected.

Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

[Update 5/24/22 10:00 AM]

Summary: Updated information lessens impact to Hive and introduces new partial impact to Firebird during disaster recovery testing (June 10-13).

Details: As additional details about the disaster recovery testing have been clarified, we have determined that Hive can remain in production throughout the testing with limited disruptions, which will also impact Firebird. We will remove the reservation currently in place on Hive for these dates.

Impact:

  • Phoenix, PACE-ICE, and COC-ICE will be disabled from 5:00 PM on Friday, June 10, through the morning of Monday, June 13.
  • Hive and Firebird will remain in production, but some services will be unavailable for much of the weekend:
    • Hive OnDemand will be unavailable.
    • PACE license servers will be unavailable. Intel compilers will not be usable, so no code can be compiled with Intel compilers, though previously-compiled binaries can be executed.
    • License servers from the College of Engineering, providing access to MATLAB, Ansys, Abaqus, and Comsol for the entire campus, will not be reachable. Any batch or interactive jobs that attempt to check out a license for these applications will fail. Researchers are encouraged to avoid such jobs just before the outage and to wait until it is complete before submitting them.
    • A number of PACE utilities, such as pace-quota and pace-check-queue, will not function.
    • Other intermittent disruptions are possible.
  • Buzzard will not be impacted.

Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

 

[Original announcement 4/27/22 11:45 AM]

Summary: Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13.  

Details: In accordance with USG security requirements, OIT will be conducting disaster recovery testing on the Georgia Tech campus network during the weekend of June 11, which will close access to most of PACE’s clusters as well as some other campus resources.  PACE’s Phoenix, Hive, PACE-ICE, and COC-ICE clusters will be impacted. Firebird and Buzzard will remain in production.  

Impact: PACE will set a reservation to prevent any jobs from running during the downtime. You will not be able to log in, access your data, nor run jobs during the outage.  

Longer jobs will be held until the testing is complete if their walltime request will not lead the job to conclude before the outage, just as they are during quarterly maintenance periods. Researchers who run long jobs should note the duration between PACE’s May maintenance period (May 11-13) and the testing period, beginning June 10. In particular, Hive researchers who submit 30-day jobs to the hive-nvme, hive-sas, or hive-nvme-sas queues should note that any 30-day job submitted after April 12 will not begin until at least June 13. Researchers are encouraged to submit jobs with reduced walltimes whenever feasible to make use of the cluster between maintenance and disaster recovery testing.  

Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns.