PACE A Partnership for an Advanced Computing Environment

April 27, 2022

PACE Firebird Login Node Outages

Filed under: Uncategorized — Michael Weiner @ 5:23 pm

[Update 4/27/22 5:45 PM]

The remaining headnode has been repaired, and service is restored. Thank you for your patience.

[Original Post 4/27/22 5:20 PM]

Summary: A storage server issue made headnodes for two projects on Firebird inaccessible. One has been recovered, while repairs are in progress on the second one.

Details: The storage server housing two Firebird projects had an NFS issue earlier today. The login nodes were impacted. The PACE team has repaired one project’s login node and is currently repairing the second that has a more complex issue.

Impact: Researchers on impacted projects are/were not able to log into Firebird today. Running jobs were not impacted, as only the login node is/was affected.

We apologize for the disruption. Please email us at pace-support@oit.gatech.edu with any questions.

Campus network disaster recovery testing June 10-13

Filed under: Uncategorized — Michael Weiner @ 11:42 am

[Update 6/6/22 11:20 AM]

Summary: Revised plans for OIT’s network disaster recovery test remove all expected impact to PACE.

Details: Changes in the disaster recovery test mean that we no longer expect PACE to have any impact this weekend, and all PACE clusters should operate normally, including OnDemand and other PACE services. Campus license servers should also remain reachable from PACE. For additional details about the disaster recovery scope, please see https://oit.gatech.edu/recoveryexercisejun22.

Impact: We have removed the scheduler reservations on all PACE clusters, so longer jobs that have been held can now begin. No impact is expected.

Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

[Update 5/24/22 10:00 AM]

Summary: Updated information lessens impact to Hive and introduces new partial impact to Firebird during disaster recovery testing (June 10-13).

Details: As additional details about the disaster recovery testing have been clarified, we have determined that Hive can remain in production throughout the testing with limited disruptions, which will also impact Firebird. We will remove the reservation currently in place on Hive for these dates.

Impact:

  • Phoenix, PACE-ICE, and COC-ICE will be disabled from 5:00 PM on Friday, June 10, through the morning of Monday, June 13.
  • Hive and Firebird will remain in production, but some services will be unavailable for much of the weekend:
    • Hive OnDemand will be unavailable.
    • PACE license servers will be unavailable. Intel compilers will not be usable, so no code can be compiled with Intel compilers, though previously-compiled binaries can be executed.
    • License servers from the College of Engineering, providing access to MATLAB, Ansys, Abaqus, and Comsol for the entire campus, will not be reachable. Any batch or interactive jobs that attempt to check out a license for these applications will fail. Researchers are encouraged to avoid such jobs just before the outage and to wait until it is complete before submitting them.
    • A number of PACE utilities, such as pace-quota and pace-check-queue, will not function.
    • Other intermittent disruptions are possible.
  • Buzzard will not be impacted.

Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

 

[Original announcement 4/27/22 11:45 AM]

Summary: Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13.  

Details: In accordance with USG security requirements, OIT will be conducting disaster recovery testing on the Georgia Tech campus network during the weekend of June 11, which will close access to most of PACE’s clusters as well as some other campus resources.  PACE’s Phoenix, Hive, PACE-ICE, and COC-ICE clusters will be impacted. Firebird and Buzzard will remain in production.  

Impact: PACE will set a reservation to prevent any jobs from running during the downtime. You will not be able to log in, access your data, nor run jobs during the outage.  

Longer jobs will be held until the testing is complete if their walltime request will not lead the job to conclude before the outage, just as they are during quarterly maintenance periods. Researchers who run long jobs should note the duration between PACE’s May maintenance period (May 11-13) and the testing period, beginning June 10. In particular, Hive researchers who submit 30-day jobs to the hive-nvme, hive-sas, or hive-nvme-sas queues should note that any 30-day job submitted after April 12 will not begin until at least June 13. Researchers are encouraged to submit jobs with reduced walltimes whenever feasible to make use of the cluster between maintenance and disaster recovery testing.  

Thank you for your understanding and cooperation during this campus network testing. Please contact us at pace-support@oit.gatech.edu with any questions or concerns. 

April 25, 2022

Hive Gateway Resource Now Available to Campus Champions

Filed under: Uncategorized — Semir Sarajlic @ 2:17 pm

Dear Campus Champion Community,

We are pleased to announce the official release of the Hive Gateway at Georgia Tech’s Partnership for an Advanced Computing Environment (PACE) to the Campus Champion community. The Hive gateway is powered by Apache Airavata, and provides access to a portion of the Hive cluster at GT that is an NSF MRI funded supercomputer that delivers nearly 1 Linpack petaflops of computing power.  For more hardware details see the following link: https://docs.pace.gatech.edu/hive/resources/.

The Hive Gateway is available to *any* XSEDE researcher via federated login (i.e., CILogon), and has a variety of applications available including Abinit, Psi4, NAMD, a python environment with Tensorflow and Keras installed, among others.

Hive Gateway is accessible via https://gateway.hive.pace.gatech.edu

Our user guide is available at: https://docs.pace.gatech.edu/hiveGateway/gettingStarted/ and contains details on the process of getting access.  Briefly, to get access to the Hive gateway, go to “Log In” on the site, select XSEDE credentials via CILogon, which should allow you to log into the gateway and generate a request to our team to approve your gateway access and enable job submissions on the resource.

Please feel free to stop by the Hive gateway site, try it out, and/or direct your researchers to it.

Cheers!

– The PACE Team

April 22, 2022

Phoenix scheduler outage

Filed under: Uncategorized — Michael Weiner @ 9:30 am

Summary: The Phoenix scheduler became nonresponsive yesterday evening and was restored at approximately 11:30 PM last night.

Details: Yesterday evening, the Torque resource manager on the Phoenix scheduler became overloaded, likely shortly after 7:30 PM. The PACE team restarted the scheduler and restored its function just before 11:30 PM last night.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted. Running jobs were not interrupted.

Thank you for your patience last night. Please contact us at pace-support@oit.gatech.edu with any questions.

Hive project & scratch storage cable replacement

Filed under: Uncategorized — Michael Weiner @ 9:11 am

Summary: Hive project & scratch storage cable replacement and potential for an outage

Details: A cable connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 11:30 AM Tuesday (April 26).

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

April 19, 2022

Launch of Open OnDemand Portal for PACE’s Phoenix and Hive Clusters

Filed under: Uncategorized — Semir Sarajlic @ 9:50 am

Dear PACE Researchers, 

We are pleased to announce the official release of the Open OnDemand (OOD) portal for PACE’s Phoenix and Hive clusters! OOD portal allows you to access PACE compute resources through your browser, and OOD provides a seamless interface for several different interactive applications, including Jupyter, Matlab, and a general interactive desktop environment. Each PACE cluster has its own portal, allowing access to all your data as usual with the Web interface. 

In-depth documentation on OOD at PACE is available at https://docs.pace.gatech.edu/ood/guide, and links to the portal for each PACE cluster are listed below: 

Please note that you will need to be on the GT VPN in order to access the OOD portals.

Thursday’s PACE clusters orientation will feature a demo using OOD. To register for upcoming PACE clusters orientation, visit https://b.gatech.edu/3w6ifqO.  

Please direct any questions about Open OnDemand to our ticketing system via email to pace-support@oit.gatech.edu or by filling out a help request form.  

Cheers! 

– The PACE Team 

April 18, 2022

Phoenix scheduler outage

Filed under: Uncategorized — Michael Weiner @ 9:46 am

Summary: The Phoenix scheduler stopped launching new jobs on Friday evening and was restored at approximately 9:30 AM on Saturday.

Details: At some point after 8 PM on Friday evening, the node hosting the Moab workload manager of the Phoenix scheduler lost its network connection, leaving it unable to communicate with the rest of the cluster. The PACE team repaired the connection just before 9:30 AM on Saturday morning, and functionality was restored.

Impact: While jobs could be submitted via “qsub” and checked via “qstat”, no new jobs would launch but would instead remain queued. Moab commands such as “showq” would not have worked. Running jobs were not interrupted.

Thank you for your patience over the weekend. Please contact us at pace-support@oit.gatech.edu with any questions.

April 14, 2022

Phoenix scheduler outage

Filed under: Uncategorized — Michael Weiner @ 4:44 pm

Summary: The Phoenix scheduler became nonresponsive overnight and was restored at approximately 9:00 AM today.

Details: Last night, the Phoenix scheduler became nonresponsive, likely shortly after midnight. The PACE team restarted the scheduler and restored its function just before 9:00 this morning.

Impact: Commands such as “qsub” and “qstat” would not have worked, so new jobs could not be submitted early this morning. Running jobs were not interrupted.

Thank you for your patience early this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

April 8, 2022

Hive project & scratch storage cable replacement

Filed under: Uncategorized — Michael Weiner @ 9:41 am

Summary: Hive project & scratch storage cable replacement and potential for an outage

Details: A cable connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10:00 AM Tuesday (April 12).

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

April 1, 2022

[Resolved] Phoenix Charge Account Authorization

Filed under: Uncategorized — Michael Weiner @ 8:49 am

[Update 4/4/22 12:25 PM]

Summary: [Resolved] Free tier charge account balances did not reset on April 1. A manual reset was performed on April 4.

Details: The deleted Perl library that prevented job submissions last Thursday night and Friday morning also caused an error in the monthly reset of free tier charge account balances at midnight on Friday, April 1. Other accounts that reset on a monthly basis were not impacted. PACE manually reset all free tier account balances just before noon today.

Impact: Job submissions to free tier accounts over the last three days would have succeeded only if sufficient leftover balance from March remained. At this time, all free tier accounts have been reset to their full monthly allocation, and jobs run prior to the reset will not count towards April utilization. All faculty and their teams now have access to their full April free tier allocation. Researchers can run the “pace-quota” command to view their available charge accounts and balances.

We apologize for any disruption this may have caused. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Message 4/1/22 8:49 AM]

Summary: [Resolved] Phoenix users attempting to submit jobs have received an error message that they are not authorized for their charge account.

Details: Beginning yesterday evening, Phoenix users attempting to submit jobs have at times received an error message indicating that they are not authorized for charge accounts to which they should have access. PACE deployed a temporary repair at 6:45 PM yesterday. The issue recurred at midnight, and the temporary repair was again made at 8:15 AM today. We have now identified the root cause as a deleted Perl library on the scheduler and deployed a permanent fix.

Impact: At this time, researchers are again able to submit jobs. Please resubmit any rejected jobs with the usual charge account. Researchers can run the “pace-quota” command to view their available charge accounts and balances. No running jobs were impacted.

We apologize for this disruption. Please contact us at pace-support@oit.gatech.edu with any questions.

 

 

Powered by WordPress