PACE A Partnership for an Advanced Computing Environment

October 24, 2022

PACE Maintenance Period (November 2 – 4, 2022)

Filed under: Uncategorized — Jeff Valdez @ 4:58 pm

[11/4/2022 Update]

The Phoenix (Moab/Torque and Slurm), Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning. We have released all jobs that were held by the scheduler. 

The 2nd phase of the Phoenix-Slurm cluster migration for 300 additional nodes (for a combined total of 800 nodes [out of 1319]) completed successfully and researchers can resume using it. 

The next maintenance period for all PACE clusters is January 31, 2023, at 6:00 AM through February 2, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for 2023 on May 9-11, August 8-10, and October 31-November 2. Additional phases for the Phoenix-Slurm cluster migration are tentatively scheduled for November 29 in 2022, and January 4, 17, and 31 in 2023. 

Status of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Complete][Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Complete] [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Complete] [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Complete] [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Complete] [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Complete] [Firebird] Reconfigure Firebird in PACE DB 
  • [Complete] [OSG] Update Nvidia drivers 
  • [Complete] [OSG][Network] Remove IB drivers on osg-login2 
  • [Complete] [Datacenter] Transformer repairs 
  • [Complete] [Network] Update VRF configuration on compute racks 
  • [Complete] [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[11/2/2022 Update]

This is a reminder that our next PACE Maintenance period has now begun and is scheduled to end at 11:59PM on Friday, 11/04/2022During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Firebird] Reconfigure Firebird in PACE DB 
  • [OSG] Update Nvidia drivers 
  • [OSG][Network] Remove IB drivers on osg-login2 
  • [Datacenter] Transformer repairs 
  • [Network] Update VRF configuration on compute racks 
  • [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[10/31/2022 Update]

This is a reminder that our next PACE Maintenance period is scheduled to begin later this week at 6:00AM on Wednesday, 11/02/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/04/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard. 

Tentative list of activities: 

ITEMS REQUIRING USER ACTION: 

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319]) 
  • [Phoenix] Reconfigure Phoenix in PACE DB 
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller 
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server 
  • [Firebird] Reconfigure Firebird in PACE DB 
  • [OSG] Update Nvidia drivers 
  • [OSG][Network] Remove IB drivers on osg-login2 
  • [Datacenter] Transformer repairs 
  • [Network] Update VRF configuration on compute racks 
  • [Storage] Upgrade Globus to 5.4.50 for new CA 

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

[10/24/2022 Early Reminder]

Dear PACE Users,

This is a friendly reminder that our next PACE Maintenance period is scheduled to begin at 6:00AM on Wednesday, 11/02/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 11/04/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • [Hive] New Hive login servers might cause a security message due to changes in the SSH keys. Please be aware of that and clear your local cache to clear the message

ITEMS NOT REQUIRING USER ACTION:

  • [Phoenix] Slurm migration for second phase of Phoenix cluster (300 additional nodes for a combined total of 800 nodes [out of 1319])
  • [Phoenix] Reconfigure Phoenix in PACE DB
  • [Hive][Storage] Cable replacement for GPFS (project/scratch) controller
  • [Firebird][Storage] Migrate some Firebird projects from current file servers to new file server
  • [Firebird] Reconfigure Firebird in PACE DB
  • [OSG] Update Nvidia drivers
  • [OSG][Network] Remove IB drivers on osg-login2
  • [Datacenter] Transformer repairs
  • [Network] Update VRF configuration on compute racks
  • [Storage] Upgrade Globus to 5.4.50 for new CA

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

October 20, 2022

Phoenix Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 10:56 am

Summary: The Phoenix scheduler was non-responsive between Wed 10/19/2022 9:30pm and Thurs 10/20/2022 12:30am.

Details: The Torque resource manager on the Phoenix scheduler was non-responsive around 9:30pm last night. At 12:30am this morning we restarted the scheduler.

Impact: Running jobs were not interrupted, but no new jobs could be submitted or cancelled during the period scheduler was down, including via Phoenix Open OnDemand. Commands such as “qsub” and “qstat” were impacted as well.

Thank you for your patience this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

October 19, 2022

Firebird Storage Outage

Filed under: Uncategorized — Marian Zvada @ 9:48 am
[Update 2022/10/21, 10:00am]
Summary: The Firebird storage outage recurred this morning at approximately 3:45 AM, and repairs were completed at approximately 9:15 AM. ASDL, LANNS, and Montecarlo projects were affected. Orbit and RAMC were not affected.
Details: Storage for three Firebird projects became unavailable this morning, and PACE has now restored the system. Jobs that failed at the time of the outage will be refunded. At this time, we have adjusted several settings, and we continue investigating the root cause of the issue.
Impact: Researchers on ASDL, LANNS, and Montecarlo would have been unable to access Firebird this morning. Running jobs on these projects would have failed as well. Please resubmit any failed job to run it again.
Thank you for your patience as we restored the system this morning. Please contact us at pace-support@oit.gatech.edu if you have any questions.
[Update 2022/10/19, 10:00am CST]
Everything is back to normal on Firebird, apologies for any inconvenience!
[Original post]
We are having an issue with Firebird storage. Jobs on ASDL, LANNS and Montecarlo are effected. Rebooting storage server causes the login nodes issue on LANNS and Montecarlo. We are actively working on resolving issues and expect the issue to be resolved by noon today.
Orbit and RAMC are not affected by this storage outrage.

Please contact us at pace-support@oit.gatech.edu if you have any questions.

October 3, 2022

Firebird inaccessible

Filed under: Uncategorized — Michael Weiner @ 9:41 am

[Update 10/3/22 10:45 AM]

Access to Firebird and the PACE VPN has been restored, and all systems should be functioning normally. If you do not see the PACE VPN as an option in the GlobalProtect client, please disconnect from the GT VPN and reconnect for it to appear again.

Urgent maintenance on the GlobalProtect VPN device on Thursday night inadvertently led to the loss of PACE VPN access, which was restored this morning.

Please contact us at pace-support@oit.gatech.edu with questions, or if you are still unable to access Firebird.

 

[Original Message 10/3/22 9:40 AM]

Summary: The Firebird cluster and PACE VPN are currently inaccessible. OIT is working to restore access.

Details: The Firebird cluster was found to be inaccessible over the weekend. PACE is working with OIT colleagues to identify the cause and restore access.

Impact: Researchers are unable to connect to the PACE VPN or access the Firebird cluster.

Thank you for your patience as we work to restore access. Please contact us at pace-support@oit.gatech.edu with questions.

Powered by WordPress