PACE A Partnership for an Advanced Computing Environment

November 6, 2023

Resolved – Scratch Space Outage on the Phoenix Cluster

Filed under: Uncategorized — Jeff Valdez @ 9:34 am

[Update 11/6/2023 at 12:26 pm]

Dear Phoenix users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused, and the jobs that have been put on hold are now resumed.  

Details: The PACE support team has upgraded the different components (controller software, disk firmware) of the Scratch storage system according to the plan provided by the hardware vendor (DDN). We have tested the performance of the file system and the tests have passed.  

Impact: Please continue using the Phoenix cluster as usual. In case of issues, please contact us at pace-support@oit.gatech.edu. Also, please keep in mind that the cluster will be offline tomorrow (November 7) from 8am until 8pm so the PACE team can work on fixing the project storage (which is an unrelated issue). 

Thank you and have a great day!

The PACE Team

[Update 11/6/2023 at 9:27 am]

Dear Phoenix users, 

Summary: Storage performance on Phoenix scratch space is degraded. 

Details: Around 11pm on Saturday (November 4, 2023), the scratch space on the Phoenix cluster became unresponsive. Currently, the scratch space is inaccessible to the users. The PACE team is investigating the situation and applying an upgrade recommended by the vendor to improve stability. The PACE team paused the scheduler on Phoenix at 8:13am on Monday, November 6, to prevent additional job failures. The upgrade is estimated to take until 12pm on Monday. After the upgrade is installed, the scheduler will be released, and the paused jobs will resume executing. This issue is not related to the issue of the slowness of the Phoenix project storage which was reported last week and will be addressed during the Phoenix outage tomorrow (November 7). 

Impact: The users of the Phoenix cluster are currently unable to access the scratch storage. The jobs on the Phoenix cluster have been paused, and the new jobs will not start until the scheduler is resumed. Other PACE clusters (ICE, Hive, Firebird, Buzzard) are not affected. 

We apologize for the multiple issues that have been observed on the Phoenix cluster related to storage access. We are continuing to engage with the storage vendor to improve the performance of our system. The recommended upgrade is in process, and the cluster will be offline tomorrow to address the project filesystem issue. 

Thank you for your patience!

The PACE Team

November 3, 2023

Degraded Phoenix Project storage performance

Filed under: Uncategorized — Jeff Valdez @ 1:53 pm

[Update 11/12/2023 11:15 PM]

The rebuild process completed on Sunday afternoon, and the system has returned to normal performance.

[Update 11/11/2023 6:40 PM]

Unfortunately the rebuilding is still going. Another drive has failed and it is slowing down the rebuilding process. We keep monitoring the situation closely.

[Update 11/10/2023 4:30 PM]

Summary: The project storage on Phoenix (/storage/coda1) is degraded, due to a failure of the hard drives. Access to the data is not affected; the scheduler continues to accept and process jobs. 

Details: Two hard drives that are part of the Phoenix storage space failed on the morning of Friday, November 10, 2023 (the first drive failed at 8:05 am, and the second drive failed at 11:10 am). The operating system automatically activated some spare drives and started rebuilding the pool. During this process, file read and write operations by Phoenix users will take longer than usual. The rebuild is expected to end around 3 am on Saturday, November 11, 2023 (our original estimate of 7pm, Nov 10 2023, was too optimistic).  

Impact: during the rebuilding process, file input/output operations are slower than usual. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate. 

We thank you for your patience as we are working on solving the problem. 

[Update 11/8/2023 at 10:00 am]

Summary: Phoenix project storage experienced degraded performance overnight. PACE and our storage vendor made an additional configuration change this morning to restore performance.  

Details: Following yesterday’s upgrade, Phoenix project storage became degraded overnight, though to a lesser extent than prior to the upgrade. Early this morning, the PACE team found that performance was slower than normal and began working with our storage vendor to identify the cause. We adjusted a parameter that handles migration of data between disk pools, and performance was restored.  

Impact: Reading or writing files on the Phoenix project filesystem (coda1) may have been slower than usual last night and this morning. The prior upgrade mitigated the impact, so performance was less severely impacted. Home and scratch directories were not affected. 

Thank you for your patience as we completed this repair.

[Update 11/7/2023 at 2:53 pm]

Dear Phoenix users, 

Summary: The hardware upgrade of the Phoenix cluster storage was completed successfully. The cluster is back in operation. The scheduler is unpaused and the jobs that were put on hold are now resumed. Globus transfer jobs have also been resumed.

Details: In order to fix the issue with the slow response of the project storage, we had to bring the Phoenix cluster offline from 8am until 2:50pm, and to upgrade several hardware and firmware libraries on the /storage/coda1 file system. The engineers from the storage vendor company have been working with us through the upgrade and helped us ensure that the storage is operating correctly.  

Impact: The storage on the Phoenix cluster can be accessed as usual, and the jobs can be submitted. The Globus and OpenOnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu

Thank you for your patience! 

The PACE team

[Update 11/3/2023 at 3:53 pm]

Dear Phoenix users,  
 

Summary: Based on the significant impact the storage issues are causing to the community, the Phoenix cluster will be taken offline on Tuesday, November 7, 2023, to fix problems with the project storage. The offline period will start at 8am and is planned to be over at 8pm.   

Details: To implement the fixes to the firmware and software libraries for the storage appliance controllers, we need to pause the Phoenix cluster for 12 hours, starting at 8am. Access to the file system /storage/coda1 will be interrupted while the work is in progress; Globus transfer jobs will also be paused while the fix is implemented. These fixes are expected to help improve the performance of the project storage, which has been below the normal baseline since Monday, October 30.

The date of Tuesday, November 7, 2023 was selected to ensure that an engineer from our storage vendor will be available to assist our team perform the upgrade tasks and monitor the health of the storage. 

Impact: On November 7, 2023, no new jobs will start on the Phoenix cluster from 8 am until 8 pm. The job queue will be resumed after 8 pm. In case your job fails after the cluster is released at 8 pm, please resubmit it. This only affects the Phoenix cluster; the other PACE clusters (Firebird, Hive, ICE, and Buzzard) will be online and operate as usual. 

Again, we greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!

[Update 11/3/2023 1:53 pm]

Dear Phoenix users,  

Summary: Storage performance on Phoenix coda1 project space continues to be degraded. 

Details: Intermittent performance issues continue on the project file system on the Phoenix cluster, /storage/coda1. This was first observed on the afternoon of Monday, October 30. 

Our storage vendor found versioning issues with firmware and software libraries on storage appliance controllers that might be causing additional delays with data retransmissions. The mismatch was created when a hardware component was replaced during a scheduled maintenance period, that required the rest of the system to be upgraded to the same versions. Unfortunately, this step was omitted as part of the installation and upgrade instructions. 

We continue to work with the vendor to define a proper plan to update all the components and correct this situation. It is possible we’ll need to pause cluster operations to avoid any issues while the fix is implemented; during this pause, the jobs will be put on hold, and will be resumed when the cluster is released. Again, we are working with the vendor to make sure we have all the details before scheduling the implementation. We’ll provide information on when the fix will be applied, and what to expect of the cluster performance and operations. 

Impact: Simple file operations, including listing the files in a directory, reading from a file, saving a file, etc., are intermittently taking longer than usual (at its worst, the operation that is expected to take a few milliseconds runs in about 10 seconds). This affects the /storage/coda1/ project storage directories, but not scratch storage, nor any of the other PACE clusters.    

We greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!  

Thank you and have a great day!  

-The PACE Team 

October 30, 2023

PACE Maintenance Period (Oct 24 – Oct 30, 2023) is over

Filed under: Uncategorized — Grigori Yourganov @ 12:33 pm

The maintenance on the Phoenix, Hive, Buzzard, Firebird, and ICE clusters has been completed. All clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. The Firebird cluster has been released at 12:30 pm on October 30, and the other clusters have been released at 2:45 pm on October 27.  

Update on the current cooling situation: DataBank performed a temporary repair to restore cooling to the research hosting environment. Cooling capacity in the research hall is at less than 100%, and is being actively monitored. We are currently able to run the clusters at full capacity. The plan is for DataBank to install new parts during the next Maintenance window, which is scheduled for Jan 23rd-25th, 2024. Should the situation worsen, and a full repair be required sooner, we will do our best to provide at least 1 week worth of notice. At this time, we do not expect the need for additional downtime.  

Update on Firebird: We are happy to announce that the Firebird cluster is ready to use after migration to the Slurm scheduler! Again, we greatly appreciate your patience during this extended maintenance period. Over the weekend we were able to research a few lingering issues with MPI and the user environment on the cluster and have both implemented and tested corrections.  
 

Firebird users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides the Firebird Migration Guide, and an additional Firebird-specific Slurm training session [register here] to support the smooth transition of your workflows to Slurm. You are also welcome to join our PACE Consulting Sessions or to email us for support.  

 
[Changes to Note] 

  • New Hardware: There are 12 new 32-core Intel Cascade Lake CPU nodes with 384 GB of RAM available, in addition to new GPU nodes with 4x NVIDIA A100 GPUs, 48 core Intel Xeon Gold CPUs and 512GB of RAM.  
  • Account names: Under slurm, charge accounts will have the prefix “cgts-<PI username>-<project>-<account>” rather than “GT-” 
  • Default GPU: If you do not specify a GPU type in your job script, Slurm will default to using an NVIDIA A100 node, rather than an NVIDIA RTX6000 node; the A100 nodes are more expensive but more performant.  
  • SSH Keys: When you login in for the first time, you may receive a warning about new Host Keys:, similar to the following: 
    Warning: the ECDSA host key for ‘login-.pace.gatech.edu’ differs from the key for the IP address ‘xxx.xx.xx.xx’ 
    Offending key for IP in /home/gbrudell3/.ssh/known_hosts:1 
    Are you sure you want to continue connecting (yes/no)? 
    This is expected! Simply type “yes” to continue!
    • You may also be prevented from login, and have to edit your .ssh/known_hosts to remove the old key, depending on your local ssh client settings. 
  • Jupyter and VNC: We do not currently have a replacement for Jupyter or VNC scripts for the new Slurm environment; we will be working on a solution to these needs over the coming weeks. 
  • MPI: For researchers using mvapich2 under the Slurm environment, specifying the additional –-constraint=core24 or –-constraint=core32 is necessary to ensure a homogeneous node allocation for the job (these reflect the number of CPUs per node).  

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Thank you for your patience during this extended outage!

The PACE Team

October 18, 2023

PACE Maintenance Period (Oct 24 – Oct 26, 2023) 

Filed under: Uncategorized — Grigori Yourganov @ 10:31 am

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 10/24/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 10/26/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

•     [Firebird] Migrate from the Moab/Torque scheduler to the Slurm scheduler. If you are a Firebird user, we will get in touch with you and provide assistance with rewriting your batch scripts and adjusting your workflow to Slurm.

ITEMS NOT REQUIRING USER ACTION:

•     [Network] Upgrade network switches

•     [Network][Hive] Configure redundancy on Hive racks

•     [Network] Upgrade firmware on InfiniBand network switches

•     [Storage][Phoenix] Reconfigure old scratch storage

•     [Storage][Phoenix] Upgrade Lustre controller and disk firmware, apply patches

•     [Datacenter] Datacenter cooling tower cleaning

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

October 13, 2023

Phoenix Storage and Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 9:48 am

[Update 10/13/2023 10:35am] 

Dear Phoenix Users,  

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues. 

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26). 

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors.  We will refund any jobs that failed due to the outage. 

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you, 

-The PACE Team 

[Update 10/13/2023 9:48am] 

Dear Phoenix Users, 

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive. 

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed. 

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler. 

Thank you, 

-The PACE Team  

September 13, 2023

Phoenix Storage Cables and Hard Drive Replacement

Filed under: Uncategorized — Jeff Valdez @ 5:54 pm

[Update 9/14/2023 1:02pm]
The cables have been replaced on Phoenix and Hive Storage with no interruption on production.

[Update 9/14/2023 5:54pm]

WHAT’S HAPPENING?

Two cables on Phoenix’s Lustre storage and one cable on Hive’s storage need to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Thursday, September 14th, 2023 starting at 10 AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement for the Phoenix and Hive clusters, respectively, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

August 31, 2023

Upcoming Firebird Slurm Migration Announcement

Filed under: Uncategorized — Marian Zvada @ 10:50 am

The Firebird cluster will be migrating to the Slurm scheduler on October 24-26, 2023. PACE has developed a plan to transition researchers’ workflow smoothly. As you may be aware, PACE began the Slurm migration in July 2022, and we have successfully migrated the Hive, Phoenix, and ICE clusters already. Firebird is the last cluster in PACE’s transition from Torque/Moab to Slurm, bringing increased job throughput and better scheduling policy enforcement. The new scheduler will better support the new hardware to be added soon to Firebird. We will be updating our software stack at the same time and offering support with orientation and consulting sessions to facilitate this migration. 

Software Stack 

In addition to the scheduler migration, the PACE Apps central software stack will also be updated. This software stack supports the Slurm scheduler and runs successfully on Phoenix/Hive/ICE. The Firebird cluster will feature the provided applications listed in our documentationPlease review this list of non-CUI software we will offer on Firebird post-migration and let us know via email (pace-support@oit.gatech.edu) if any PACE-installed software you are currently using on Firebird is missing from the list.  If you already submitted a reply to the application survey sent to Firebird PIs there is no need to repeat requests. Researchers installing or writing custom software will need to recompile applications to reflect new MPI and other libraries once the new system is ready.   
 
We will freeze the new software installation in PACE central software stack in Torque stack from Sep 1st, 2023. You can continue installing the software under your local/shared space without interruption. 

No Test Environment 

Due to security and capacity constraints, it is infeasible to use a progressive rollout approach as we did for Phoenix and Hive. Hence there will not be a test environment. For researchers installing or writing their software, we highly recommend the following: 

  • For those with access to Phoenix, compile Non-CUI software on Phoenix now and report any issue you encounter so that we can help you before migration. 
  • Please report any self-installed CUI software you need which cannot be tested on Phoenix. We will try our best to make all dependent libraries ready and give higher priority to assisting with reinstallation immediately after the Slurm migration.  

Support 

PACE will provide documentation, training sessions [register here], and support (consulting sessions and 1-1 sessions) to aid your workflow transitions to Slurm. Documentation and a guide for converting job scripts from PBS to Slurm-based commands will be ready before migration. We will offer Slurm training right after Migration; future communications will provide the schedule. You are welcome to join our PACE Consulting Sessions or to email us for support.  

We are excited to launch Slurm on Firebird to improve Georgia Tech’s research computing infrastructure! Please contact us with any questions or concerns about this transition. 

August 26, 2023

All PACE Clusters Down Due to Cooling Failure

Filed under: Uncategorized — Michael Weiner @ 9:40 pm

[Update 8/27/23 11:10 AM]

All PACE clusters have returned to service.

The datacenter cooling pump was replaced early this morning. After powering on compute nodes and testing, PACE resumed jobs on all clusters. On clusters which charge for use (Phoenix and Firebird), jobs that were cancelled yesterday evening when compute nodes were turned off will be refunded. Please submit new jobs to resume your work.

Thank you for your patience during this emergency repair.

[Original Post 8/26/23 9:40 PM]

Summary: A pump in the Coda Datacenter cooling system overheated on Saturday evening. All PACE compute nodes across all clusters (Phoenix, Hive, Firebird, Buzzard, and ICE) have been shut down until cooling is restored, stopping all compute jobs.

Details: Databank is reporting an issue with the high temp condenser pump in the Research Hall of the CODA data center, hosting PACE compute nodes. The Research Hall is being powered off in order for Databank facilities to get the pump changed.

Impact: All PACE compute nodes are unavailable. Running jobs have been cancelled, and no new jobs can start. Login nodes and storage systems remain available. Compute nodes will remain off until the cooling system is repaired.

August 21, 2023

Phoenix Storage Cables and Hard Drive Replacement

Filed under: Uncategorized — Jeff Valdez @ 1:08 pm

WHAT’S HAPPENING?

Two SAS cables and one hard drive for Phoenix’s Lustre storage need to be replaced. Cable and hard drive replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, August 22th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shut down and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Phoenix Slurm Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 11:17 am

[Update 8/21/23 5:02 PM]

Dear Phoenix Users, 

The Slurm scheduler on Phoenix is back up and available. We have applied the patch that was recommended by SchedMD, the developer of Slurm; cleaned the database; and run tests to confirm that the scheduler is running correctly. We will continue to monitor the scheduler database for any other issues.

Existing jobs that have been queued should have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. We will refund any jobs that failed due to the scheduler outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you,

-The PACE Team 

[Update 8/21/23 3:20 PM]

Dear Phoenix Users, 

We have been working with the Slurm scheduler vendor, SchedMD, to identify and fix a corrupted association in the scheduler database and provide a patch. In troubleshooting the scheduler this afternoon, some jobs were able to be scheduled. We are going to pause the scheduler again to make sure the database cleanup can be completed without disruption from new jobs. 

Based on our estimates, we are expecting to restore the scheduler by later tonight. We will provide an update as soon as the scheduler is released.

Thank you, 

-The PACE Team 

[Update 8/21/23 11:17 AM]

Dear Phoenix Users, 

Unfortunately, the Slurm scheduler controller is down due to issues with Slurm’s database and jobs are not able to be scheduled. We have submitted a high-priority service request to SchedMD, the developer of Slurm, and should be able to provide an update soon. 

Jobs currently running will likely run, but we recommend reviewing the output as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler is fixed. 

The rest of the Phoenix cluster infrastructure (i.e. login, storage, etc.) outside of the scheduler should be working. We recommend not running commands that require interaction with Slurm (i.e.  any scheduler commands like ‘sbatch’, ‘srun’, ‘sacct’, or ‘pace-quota’ commands, etc.) because they will not work at this time. 

We will provide updates soon as we work on fixing the scheduler. 

Thank you, 

-The PACE Team 

Older Posts »

Powered by WordPress