PACE A Partnership for an Advanced Computing Environment

September 29, 2022

Phoenix Project & Scratch Storage Cables Replacement

Filed under: Uncategorized — Marian Zvada @ 5:25 pm

[Update 2022/10/05, 12:40PM CST]

Work has been completed on one cable and associated systems connecting to the storage were restored back to normal. We’re going to do stability assessment of the system after first cable replacement and schedule second cable replacement sometime next week.

 

[Update 2022/10/05, 10:10AM CST]

As the work is still ongoing we’re experiencing issues with one of the cable replacement. While there is still redundant controller in place we already identified an impact on some users where the data are not currently accessible. In order to minimize impact on the system we’ve decided to pause scheduler to prevent new jobs from starting and crashing. Running jobs may be impacted by the storage outage.

Please, be mindful about opening new ticket to pace-support@oit.gatech.edu if your issue is storage related.

 

[Original post]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

Details: Two cables connecting enclosures of the Phoenix Lustre device, hosting project and scratch storage, both needs to be replaced, beginning around 10AM Wednesday, October 5th, 2022. Individual cables will be replaced one by one and expected time to finish the work will take about 4 hours. After the replacement, pools will need to rebuild over the course of about a day.

Impact: Since there is a redundant controller when doing work on one cable, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

September 19, 2022

Phoenix Cluster Migration to Slurm Scheduler

Filed under: Uncategorized — Jeff Valdez @ 5:15 pm
Dear Phoenix researchers,

The Phoenix cluster will be migrating to the Slurm scheduler over the next couple of months with the first phase scheduled for October 10! PACE has worked closely with the PACE Advisory Committee (PAC) on a plan for the migration to ensure minimum interruption to research. Slurm is a widely popular, open-source scheduler on many research computing clusters, so you may have experienced it elsewhere. If commands like ‘sbatch’ and ‘squeue’ sound familiar to you, then you have used Slurm! Phoenix will be the second cluster (after Hive) in PACE’s transition from Torque/Moab to Slurm. We expect the new scheduler to provide improved stability and reliability, offering a better user experience. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration.

Phased Migration
The phased transition is planned in collaboration with the faculty-led PACE Advisory Committee which is comprised of a representative group of PACE and faculty members. We are planning a staggered phased migration for the Phoenix cluster. The six phases include the following dates and number of nodes:
  • October 10, 2022 – 500 nodes
  • November 2, 2022 (PACE Maintenance Period) – 300 nodes
  • November 29, 2022 – 200 nodes
  • January 4, 2023 – 100 nodes
  • January 17, 2023 – 100 nodes
  • January 31, 2023 (PACE Maintenance Period) – 119 nodes

The first phase will begin October 10, during which 500 Phoenix compute nodes (of 1319 total) will join our new “Phoenix-Slurm” cluster while the rest will remain on the existing Phoenix cluster. The 500 nodes will represent each existing node type proportionally. Following the first phase, we strongly encourage all researchers to begin shifting over their workflows to the Slurm-based side of Phoenix to take advantage of the improved features and queue wait times. Also, as part of the phased migration approach, researchers will continue to have access to the existing Phoenix cluster that will last until the final phase of this migration to ensure minimum interruption to research. Users will receive detailed communication on how to connect to the Phoenix-Slurm cluster along with other documentation and training.

Software Stack
In addition to the scheduler migration, another significant change for researchers on Phoenix will be an update to the PACE Apps software stack. The Phoenix-Slurm cluster will feature a new set of provided applications listed in our documentation. Please review this list of software we plan to offer on Phoenix post-migration and let us know via email (pace-support@oit.gatech.edu) if any software you are currently using on Phoenix is missing from that list. We encourage you to let us know as soon as possible to avoid any potential delay to your research as the migration process concludes. We have reviewed batch job logs to determine packages in use and upgraded them to the latest version. Researchers installing or writing their own software will also need to recompile applications to reflect new MPI and other libraries.

Starting after the November PACE Maintenance period (November 2), we will no longer be accepting software installation requests for new software on the existing Phoenix cluster with Torque/Moab. All software requests after November 2 will be for Phoenix-Slurm. Additionally, all new researcher groups joining PACE after November 2 will be onboarded onto Phoenix-Slurm only.

Billing
You will notice a few other changes to Phoenix in the new environment. As with the current Phoenix cluster, faculty and their research teams will receive the full free tier monthly allocation, equivalent to 10,000 CPU*hours on base hardware and usable on all architectures, on Phoenix-Slurm (in addition to the one on the existing Phoenix cluster) as well as access to Embers, our free backfill queue. We will be charging users for jobs on Phoenix-Slurm.

For prepaid accounts (including CODA20 refresh accounts), PACE will split your account balances 50/50 on the Phoenix-Slurm and existing Phoenix (with Torque/Moab) clusters during the migration. For new computing credits purchased after Nov 1st, 75% will be allocated to the Phoenix-Slurm cluster. For new computing credits purchased after Jan 3, 100% will be allocated to the Phoenix-Slurm cluster.

For postpaid (monthly) accounts, PACE will set the same limit based on existing credits on Phoenix to Phoenix-Slurm. Please be aware that for postpaid accounts this could lead to potential monthly overcharges if users were to run on both clusters to 100%. However, we wanted to allow researchers to have access to their full monthly limit for flexibility. For postpaid accounts, Principal Investigators and users are responsible for tracking their spending limit on the Phoenix-Slurm and Phoenix clusters to avoid going over budget.

Support
PACE will provide documentation, training sessions, and additional support (e.g., increased frequency of PACE consulting sessions) to aid you as you transition your workflows to Slurm. Prior to the launch, we will have updated documentation as well as a guide for converting job scripts from PBS to Slurm-based commands. We will also offer specialized training virtual sessions (PACE Slurm Orientation) on the use of Slurm on Phoenix. Additionally, we have increased the frequency of our PACE consulting sessions during this migration phase for the Fall and Spring semesters, and you are invited to join our PACE Consulting Sessions or to email us for support. The schedule for the upcoming PACE Phoenix-Slurm orientation sessions will be provided in future communications.

We are excited to launch Slurm on Phoenix as we continue to improve Georgia Tech’s research computing infrastructure, and we will be providing additional information and support in the coming weeks through documentation, support tickets, and live sessions. Please contact us with any questions or concerns about this transition.

Best,
-The PACE Team

September 12, 2022

Phoenix Project & Scratch Storage Cable Replacement

Filed under: Uncategorized — Marian Zvada @ 3:44 pm

[Update 09/16/2022 2:18 PM]

Work has been completed on Sep 15 as scheduled in the original post.

[Original post: 09/12/2022 3:40PM]

Summary: Phoenix project & scratch storage cable replacement potential outage and subsequent temporary decreased performance

Details: A cable connecting one enclosure of the Phoenix Lustre device, hosting project and scratch storage, to one of its controllers needs to be replaced, beginning around 1PM Thursday, September 15th, 2022. After the replacement, pools will need to rebuild over the course of about a day.

Impact: Since there is a redundant controller, there should not be an outage during the cable replacement. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If this happens, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. In addition, performance will be slower than usual for a day following the repair as pools rebuild. Jobs may progress more slowly than normal. If your job runs out of wall time and is cancelled by the scheduler, please resubmit it to run again. PACE will monitor Phoenix Lustre storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

Please accept our sincere apology for any inconvenience that this temporary limitation may cause you. If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Hive Project & Scratch Storage Cable Replacement

Filed under: Uncategorized — Marian Zvada @ 3:40 pm

[Update 09/16/2022 2:18 PM]

Work has been completed on Sep 15 as scheduled in the original post.

[Original post: 09/12/2022 3:40PM]

Summary: Hive project & scratch storage cable replacement and potential for an outage.

Details: Two cables connecting the Hive GPFS device, hosting project (data) and scratch storage, to one of its controllers needs to be replaced, beginning around 10AM Thursday, September 15th, 2022.

Impact: Since there is a redundant controller, no impact is expected. However, a similar previous replacement caused storage to become unavailable, so this is a possibility. If the redundant controller fails, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored. PACE will monitor Hive GPFS storage throughout this procedure. In the event of a loss of availability occurs, we will update you.

If you have any questions or concerns, please direct them to pace-support@oit.gatech.edu.

Phoenix Scheduler Outage

Filed under: Uncategorized — Deepa Phanish @ 11:24 am

Summary: The Phoenix scheduler became non-responsive on Friday 9/9/2021 between 7:30pm and 10pm.

Details: The Torque resource manager on the Phoenix scheduler crashed unexpectedly around 7:30 PM. A bad GPU node with the same error message caused a segmentation fault on the server, and the crashing scheduler corrupted a handful of jobs in queue with dependencies, requiring some pruning of those records from the system. Around 10pm, the node causing issues was purged from the scheduler and the corrupted jobs were removed restoring normal operations.

Impact: Running jobs were not interrupted, but no new jobs could be submitted during the period scheduler was down. Commands such as “qsub” and “qstat” were impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Corrupted jobs in queue were cancelled.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

September 5, 2022

Data Center Power outage

Filed under: Uncategorized — Marian Zvada @ 6:42 pm

[Update 09/07/2022 10:23 AM]

Cooling has been fully restored to the datacenter, and operations have resumed on all PACE clusters, including Phoenix, Hive, Firebird, Buzzard, PACE-ICE, and COC-ICE. Jobs are now running, and new jobs may be submitted via the command line or Open OnDemand.

Any jobs that were running at the time of the outage have been cancelled and must be resubmitted by the researcher.

Refunds will be issued for jobs that were cancelled on clusters with charge accounts (Phoenix and Firebird).

Thank you for your patience as emergency repairs were completed. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 09/06/2022 4:51 PM]

Summary: Unfortunately, the same issue that happened yesterday with the primary cooling loop happened today with the secondary cooling loop. OIT operations and Databank requested that we power off all the compute nodes to repair the secondary cooling loop. We have captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix.

Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or from Open Ondemand webserver, the existing running jobs will be killed, and computing credits will be refunded if applied. Your partially ran jobs will lose their calculation if checkpointing is not supported in the job.

Sorry for the inconvenience and thank you for your patience; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

 

[Update 09/06/2022 11:10 AM]

Cooling has been restored to the datacenter, and operations have resumed on all PACE clusters, including Phoenix, Hive, Firebird, Buzzard, PACE-ICE, and COC-ICE. Jobs are now running, and new jobs may be submitted via the command line or Open OnDemand. Any jobs that were running at the time of the outage have been cancelled and must be resubmitted by the researcher. Refunds will be issued for jobs that were cancelled on clusters with charge accounts (Phoenix and Firebird).

Thank you for your patience as emergency repairs were completed to avoid damage to the datacenter. Please contact us at pace-support@oit.gatech.edu with any questions.

 

[Update 09/06/2022 9:58 AM]

Around 8:00PM on 09/05/2022, OIT operations and Databank requested that PACE powers off all the compute nodes to avoid additional issues. PACE had captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix after they have been brought back online.

Currently, the cooling has been restored in Coda datacenter, and as of about 7:00am (09/06/22), PACE has been cleared to online and test the clusters before releasing them to users.

 

[Update 09/05/2022 8:45 PM]

Summary: Unfortunately, cooling tower issues continue. OIT operations and Databank requested that we power off all the compute nodes to avoid additional issues. We have captured all job information for Phoenix, Hive, COC-ICE, PACE-ICE, and Firebird and we will refund all running jobs on Firebird and Phoenix.
Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or from Open Ondemand webserver, the existing running jobs will be killed, and computing credits will be refunded if applied. Your partially ran jobs will lose their calculation if checkpointing is not supported in the job.
Sorry for the inconvenience and thank you for your patience; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

[Original post]

Summary: One of the cooling towers in the CODA data center has issues, and the temperature is rising. We need to pause all PACE cluster schedulers, and possibly power down all compute nodes.

Impact: Researchers won’t be able to submit new jobs to the clusters either from the command line or the Open OnDemand web server; the existing jobs should continue running.

Thank you for your patience this afternoon; please contact us at pace-support@oit.gatech.edu with any questions. We will keep you posted with any updates.

September 2, 2022

Phoenix scheduler outage

Filed under: Uncategorized — Marian Zvada @ 4:33 pm

[Update 9/2/22 5:23 PM]

The PACE team has identified an issue with the Phoenix scheduler and restored functionality. The scheduler is currently back up and running and new jobs can be submitted. We will continue to monitor scheduler performance and we appreciate your patience as we work through this. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 9/2/22 4:33 PM]

Summary: The Phoenix scheduler became non-responsive this afternoon around 4pm.

Details: The Torque resource manager on the Phoenix scheduler shut down unexpectedly around 4:00 PM. The PACE team is working on resolution.

Impact: Commands such as “qsub” and “qstat” are impacted, so new jobs could not be submitted, including via Phoenix Open OnDemand. Running jobs are not expected to be interrupted, but no new jobs can be submitted, and currently running jobs can not be queried.

Thank you for your patience this afternoon. Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress