PACE A Partnership for an Advanced Computing Environment

August 31, 2023

Upcoming Firebird Slurm Migration Announcement

Filed under: Uncategorized — Marian Zvada @ 10:50 am

The Firebird cluster will be migrating to the Slurm scheduler on October 24-26, 2023. PACE has developed a plan to transition researchers’ workflow smoothly. As you may be aware, PACE began the Slurm migration in July 2022, and we have successfully migrated the Hive, Phoenix, and ICE clusters already. Firebird is the last cluster in PACE’s transition from Torque/Moab to Slurm, bringing increased job throughput and better scheduling policy enforcement. The new scheduler will better support the new hardware to be added soon to Firebird. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration. 

Software Stack 

In addition to the scheduler migration, the PACE Apps central software stack will also be updated. This software stack supports the Slurm scheduler and runs successfully on Phoenix/Hive/ICE. The Firebird cluster will feature the provided applications listed in our documentationPlease review this list of non-CUI software we will offer on Firebird post-migration and let us know via email (pace-support@oit.gatech.edu) if any PACE-installed software you are currently using on Firebird is missing from the list.  If you already submitted a reply to the application survey sent to Firebird PIs there is no need to repeat requests. Researchers installing or writing custom software will need to recompile applications to reflect new MPI and other libraries once the new system is ready.   
 
We will freeze the new software installation in PACE central software stack in Torque stack from Sep 1st, 2023. You can continue installing the software under your local/shared space without interruption. 

No Test Environment 

Due to security and capacity constraints, it is infeasible to use a progressive rollout approach as we did for Phoenix and Hive. Hence there will not be a test environment. For researchers installing or writing their software, we highly recommend the following: 

  • For those with access to Phoenix, compile Non-CUI software on Phoenix now and report any issue you encounter so that we can help you before migration. 
  • Please report any self-installed CUI software you need which cannot be tested on Phoenix. We will try our best to make all dependent libraries ready and give higher priority to assisting with reinstallation immediately after the Slurm migration.  

Support 

PACE will provide documentation, training sessions [register here], and support (consulting sessions and 1-1 sessions) to aid your workflow transitions to Slurm. Documentation and a guide for converting job scripts from PBS to Slurm-based commands will be ready before migration. We will offer Slurm training right after Migration; future communications will provide the schedule. You are welcome to join our PACE Consulting Sessions or to email us for support.  

We are excited to launch Slurm on Firebird to improve Georgia Tech’s research computing infrastructure! Please contact us with any questions or concerns about this transition. 

August 26, 2023

All PACE Clusters Down Due to Cooling Failure

Filed under: Uncategorized — Michael Weiner @ 9:40 pm

[Update 8/27/23 11:10 AM]

All PACE clusters have returned to service.

The datacenter cooling pump was replaced early this morning. After powering on compute nodes and testing, PACE resumed jobs on all clusters. On clusters which charge for use (Phoenix and Firebird), jobs that were cancelled yesterday evening when compute nodes were turned off will be refunded. Please submit new jobs to resume your work.

Thank you for your patience during this emergency repair.

[Original Post 8/26/23 9:40 PM]

Summary: A pump in the Coda Datacenter cooling system overheated on Saturday evening. All PACE compute nodes across all clusters (Phoenix, Hive, Firebird, Buzzard, and ICE) have been shut down until cooling is restored, stopping all compute jobs.

Details: Databank is reporting an issue with the high temp condenser pump in the Research Hall of the CODA data center, hosting PACE compute nodes. The Research Hall is being powered off in order for Databank facilities to get the pump changed.

Impact: All PACE compute nodes are unavailable. Running jobs have been cancelled, and no new jobs can start. Login nodes and storage systems remain available. Compute nodes will remain off until the cooling system is repaired.

August 21, 2023

Phoenix Storage Cables and Hard Drive Replacement

Filed under: Uncategorized — Jeff Valdez @ 1:08 pm

WHAT’S HAPPENING?

Two SAS cables and one hard drive for Phoenix’s Lustre storage need to be replaced. Cable and hard drive replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, August 22th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Slurm Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 11:17 am

[Update 8/21/23 5:02 PM]

Dear Phoenix Users, 

The Slurm scheduler on Phoenix is back up and available. We have applied the patch that was recommended by SchedMD, the developer of Slurm; cleaned the database; and run tests to confirm that the scheduler is running correctly. We will continue to monitor the scheduler database for any other issues.

Existing jobs that have been queued should have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. We will refund any jobs that failed due to the scheduler outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you,

-The PACE Team 

[Update 8/21/23 3:20 PM]

Dear Phoenix Users, 

We have been working with the Slurm scheduler vendor, SchedMD, to identify and fix a corrupted association in the scheduler database and provide a patch. In troubleshooting the scheduler this afternoon, some jobs were able to be scheduled. We are going to pause the scheduler again to make sure the database cleanup can be completed without disruption from new jobs. 

Based on our estimates, we are expecting to restore the scheduler by later tonight. We will provide an update as soon as the scheduler is released.

Thank you, 

-The PACE Team 

[Update 8/21/23 11:17 AM]

Dear Phoenix Users, 

Unfortunately, the Slurm scheduler controller is down due to issues with Slurm’s database and jobs are not able to be scheduled. We have submitted a high-priority service request to SchedMD, the developer of Slurm, and should be able to provide an update soon. 

Jobs currently running will likely run, but we recommend reviewing the output as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler is fixed. 

The rest of the Phoenix cluster infrastructure (i.e. login, storage, etc.) outside of the scheduler should be working. We recommend not running commands that require interaction with Slurm (i.e.  any scheduler commands like ‘sbatch’, ‘srun’, ‘sacct’, or ‘pace-quota’ commands, etc.) because they will not work at this time. 

We will provide updates soon as we work on fixing the scheduler. 

Thank you, 

-The PACE Team 

August 6, 2023

Phoenix Scratch Storage Outage

Filed under: Uncategorized — Michael Weiner @ 1:33 pm

[Update 8/7/23 9:34 PM]

Access to Phoenix scratch continued to have issues last night as of 10:19 PM last night (Sunday). We paused the scheduler and restarted the controller around 6am this morning (Monday).

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 10:19 PM Sunday and ended this morning at 9:24 AM Monday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 8/6/23 2:25 PM]

Access to Phoenix scratch has been restored, and the scheduler has resumed allowing new jobs to begin. Jobs that failed due to the scratch outage, which began at 9:30 PM Saturday, will be refunded. We continue to work with our storage vendor to identify what caused the controller to freeze.

Thank you for your patience as we restored scratch storage access today. Please contact us at pace-support@oit.gatech.edu with any questions.

[Original Post 8/6/23 1:30 PM]

Summary: Phoenix scratch storage is currently unavailable, which may impact access to directories on other Phoenix storage systems. The Phoenix scheduler is paused, so no new jobs can start.

Details: A storage target controller on the Phoenix scratch system became unresponsive just before midnight on Saturday evening. The Phoenix scheduler crashed shortly before 7 AM Sunday morning due to the number of failures to reach scratch directories. PACE restarted the scheduler around 1 PM today (Sunday), restoring access, while also pausing it to prevent new jobs from starting.

Impact: The network scratch filesystem on Phoenix is inaccessible. Due to the symbolic link to scratch, an ls of Phoenix home directories may also hang. Access via Globus may also time out. Individual directories on the home storage device may be reachable if an ls of the main home directory is not performed. Scheduler commands, such as squeue, were not available this morning but have now been restored. As the scheduler is paused, any new jobs submitted will not start at this time. There is no impact to project storage.

Thank you for your patience as we investigate this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress