[Update 6/6/22 11:20 AM]
Summary: Revised plans for OIT’s network disaster recovery test remove all expected impact to PACE.
Details: Changes in the disaster recovery test mean that we no longer expect PACE to have any impact this weekend, and all PACE clusters should operate normally, including OnDemand and other PACE services. Campus license servers should also remain reachable from PACE. For additional details about the disaster recovery scope, please see https://oit.gatech.edu/recoveryexercisejun22.
Impact: We have removed the scheduler reservations on all PACE clusters, so longer jobs that have been held can now begin. No impact is expected.
Please contact us at pace-support@oit.gatech.edu with any questions or concerns.
[Update 5/24/22 10:00 AM]
Summary: Updated information lessens impact to Hive and introduces new partial impact to Firebird during disaster recovery testing (June 10-13).
Details: As additional details about the disaster recovery testing have been clarified, we have determined that Hive can remain in production throughout the testing with limited disruptions, which will also impact Firebird. We will remove the reservation currently in place on Hive for these dates.
Impact:
- Phoenix, PACE-ICE, and COC-ICE will be disabled from 5:00 PM on Friday, June 10, through the morning of Monday, June 13.
- Hive and Firebird will remain in production, but some services will be unavailable for much of the weekend:
- Hive OnDemand will be unavailable.
- PACE license servers will be unavailable. Intel compilers will not be usable, so no code can be compiled with Intel compilers, though previously-compiled binaries can be executed.
- License servers from the College of Engineering, providing access to MATLAB, Ansys, Abaqus, and Comsol for the entire campus, will not be reachable. Any batch or interactive jobs that attempt to check out a license for these applications will fail. Researchers are encouraged to avoid such jobs just before the outage and to wait until it is complete before submitting them.
- A number of PACE utilities, such as pace-quota and pace-check-queue, will not function.
- Other intermittent disruptions are possible.
- Buzzard will not be impacted.
Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns.
[Original announcement 4/27/22 11:45 AM]
Summary: Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13.
Details: In accordance with USG security requirements, OIT will be conducting disaster recovery testing on the Georgia Tech campus network during the weekend of June 11, which will close access to most of PACE’s clusters as well as some other campus resources. PACE’s Phoenix, Hive, PACE-ICE, and COC-ICE clusters will be impacted. Firebird and Buzzard will remain in production.
Impact: PACE will set a reservation to prevent any jobs from running during the downtime. You will not be able to log in, access your data, nor run jobs during the outage.
Longer jobs will be held until the testing is complete if their walltime request will not lead the job to conclude before the outage, just as they are during quarterly maintenance periods. Researchers who run long jobs should note the duration between PACE’s May maintenance period (May 11-13) and the testing period, beginning June 10. In particular, Hive researchers who submit 30-day jobs to the hive-nvme, hive-sas, or hive-nvme-sas queues should note that any 30-day job submitted after April 12 will not begin until at least June 13. Researchers are encouraged to submit jobs with reduced walltimes whenever feasible to make use of the cluster between maintenance and disaster recovery testing.
Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns.