PACE A Partnership for an Advanced Computing Environment

January 31, 2024

Phoenix Scheduler Outage

Filed under: Uncategorized — Michael Weiner @ 11:44 am

Summary: The Slurm scheduler on Phoenix is experiencing an intermittent outage.

Details: The scheduler is repeatedly freezing due to a problematic input. The PACE team has identified the likely cause and is attempting to restore functionality.

Impact: Commands like squeue and sinfo may report errors, and new jobs may not start on Phoenix. Already-running jobs are not impacted. Other clusters (Hive, ICE, Firebird, Buzzard) are not impacted.

Thank you for your patience as we work to restore Phoenix to full functionality. Please contact us at pace-support@oit.gatech.edu with any questions. You may track the status of this outage on the GT Status page.

January 25, 2024

PACE Maintenance Period (Jan 23 – Jan 25, 2024) is over 

Filed under: Uncategorized — Grigori Yourganov @ 10:55 am

Dear PACE users,  

The maintenance on the Phoenix, Hive, Firebird, and ICE clusters has been completed; the OSG Buzzard cluster is still under maintenance, and we expect it to be ready next week. The Phoenix, Hive, Firebird, and ICE clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released.   

  

The POSIX group names on the Phoenix, Hive, Firebird, and ICE clusters have not been updated, due to the factors within the IAM team. This update is now scheduled to happen during our next maintenance period in May 7-9, 2024.  

Thank you for your patience!   

  

The PACE Team 

January 18, 2024

NetApp Storage Outage

Filed under: Uncategorized — Michael Weiner @ 5:22 pm

[Update 1/18/24 6:30 PM]

Access to storage has been restored, and all systems have full functionality. The Phoenix and ICE schedulers have been resumed, and queued jobs will now start.

Please resubmit any jobs that may have failed. If a running job is no longer progressing, please cancel and resubmit.

The cause of the outage was identified as an update this afternoon to resolve a specific permissions issue affecting some users on the ICE shared directories. The update has been reverted.

Thank you for your patience as we resolved this issue.

[Original Post 1/18/24 5:20 PM]

Summary: An outage on PACE NetApp storage devices is affecting the Phoenix and ICE clusters. Home directories and software are not accessible.

Details: At approximately 5:00 PM, an issue began affecting access to NetApp storage devices on PACE. The PACE team is investigating at this time.

Impact: All storage devices provided by NetApp services are currently unreachable. This includes home directories on Phoenix and ICE, the pace-apps software repository on Phoenix and ICE, and course shared directories on ICE. Users may encounter errors upon login due to inaccessible home directories. We have paused the schedulers on Phoenix and ICE, so no new jobs will start. The Hive and Firebird clusters are not affected.

Please contact us at pace-support@oit.gatech.edu with any questions.

January 12, 2024

PACE Maintenance Period (Jan 23 – Jan 25, 2024) 

Filed under: Uncategorized — Grigori Yourganov @ 3:53 pm

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 01/23/2024, and is tentatively scheduled to conclude by 11:59PM on Thursday, 01/25/2024. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated. 
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part. 
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.  

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: Replace pump impeller, cooling tower maintenance 
  • [storage] Install NFS over RDMA kernel module to enable pNFS for access to VAST storage test machine 
  • Replace two UPS for SFA14KXE controllers 
  • [storage] upgrade DDN SFA14KXE controllers FW 
  • [storage] upgrade DDN 400NV ICE storage controllers and servers 
  • [Phoenix, Hive, Ice, Firebird] Upgrade all Clusters to Slurm version 23.11.X 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

December 14, 2023

PACE Winter Break Schedule

Filed under: Uncategorized — Craig Moseley @ 4:33 pm

Thank you for being a PACE user. Please be mindful that we are closed during the official GT Winter Break, providing only emergency services, and will have limited availability the week of Dec 18th-22nd. If you have an urgent incident, be specific about the request, including deadlines. While we cannot make any guarantees, we will do our best. We hope you enjoy your holiday, stay safe, and best wishes for the new year!

November 6, 2023

Resolved – Scratch Space Outage on the Phoenix Cluster

Filed under: Uncategorized — Jeff Valdez @ 9:34 am

[Update 11/6/2023 at 12:26 pm]

Dear Phoenix users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused, and the jobs that have been put on hold are now resumed.  

Details: The PACE support team has upgraded the different components (controller software, disk firmware) of the Scratch storage system according to the plan provided by the hardware vendor (DDN). We have tested the performance of the file system and the tests have passed.  

Impact: Please continue using the Phoenix cluster as usual. In case of issues, please contact us at pace-support@oit.gatech.edu. Also, please keep in mind that the cluster will be offline tomorrow (November 7) from 8am until 8pm so the PACE team can work on fixing the project storage (which is an unrelated issue). 

Thank you and have a great day!

The PACE Team

[Update 11/6/2023 at 9:27 am]

Dear Phoenix users, 

Summary: Storage performance on Phoenix scratch space is degraded. 

Details: Around 11pm on Saturday (November 4, 2023), the scratch space on the Phoenix cluster became unresponsive. Currently, the scratch space is inaccessible to the users. The PACE team is investigating the situation and applying an upgrade recommended by the vendor to improve stability. The PACE team paused the scheduler on Phoenix at 8:13am on Monday, November 6, to prevent additional job failures. The upgrade is estimated to take until 12pm on Monday. After the upgrade is installed, the scheduler will be released, and the paused jobs will resume executing. This issue is not related to the issue of the slowness of the Phoenix project storage which was reported last week and will be addressed during the Phoenix outage tomorrow (November 7). 

Impact: The users of the Phoenix cluster are currently unable to access the scratch storage. The jobs on the Phoenix cluster have been paused, and the new jobs will not start until the scheduler is resumed. Other PACE clusters (ICE, Hive, Firebird, Buzzard) are not affected. 

We apologize for the multiple issues that have been observed on the Phoenix cluster related to storage access. We are continuing to engage with the storage vendor to improve the performance of our system. The recommended upgrade is in process, and the cluster will be offline tomorrow to address the project filesystem issue. 

Thank you for your patience!

The PACE Team

November 3, 2023

Degraded Phoenix Project storage performance

Filed under: Uncategorized — Jeff Valdez @ 1:53 pm

[Update 11/12/2023 11:15 PM]

The rebuild process completed on Sunday afternoon, and the system has returned to normal performance.

[Update 11/11/2023 6:40 PM]

Unfortunately the rebuilding is still going. Another drive has failed and it is slowing down the rebuilding process. We keep monitoring the situation closely.

[Update 11/10/2023 4:30 PM]

Summary: The project storage on Phoenix (/storage/coda1) is degraded, due to a failure of the hard drives. Access to the data is not affected; the scheduler continues to accept and process jobs. 

Details: Two hard drives that are part of the Phoenix storage space failed on the morning of Friday, November 10, 2023 (the first drive failed at 8:05 am, and the second drive failed at 11:10 am). The operating system automatically activated some spare drives and started rebuilding the pool. During this process, file read and write operations by Phoenix users will take longer than usual. The rebuild is expected to end around 3 am on Saturday, November 11, 2023 (our original estimate of 7pm, Nov 10 2023, was too optimistic).  

Impact: during the rebuilding process, file input/output operations are slower than usual. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate. 

We thank you for your patience as we are working on solving the problem. 

[Update 11/8/2023 at 10:00 am]

Summary: Phoenix project storage experienced degraded performance overnight. PACE and our storage vendor made an additional configuration change this morning to restore performance.  

Details: Following yesterday’s upgrade, Phoenix project storage became degraded overnight, though to a lesser extent than prior to the upgrade. Early this morning, the PACE team found that performance was slower than normal and began working with our storage vendor to identify the cause. We adjusted a parameter that handles migration of data between disk pools, and performance was restored.  

Impact: Reading or writing files on the Phoenix project filesystem (coda1) may have been slower than usual last night and this morning. The prior upgrade mitigated the impact, so performance was less severely impacted. Home and scratch directories were not affected. 

Thank you for your patience as we completed this repair.

[Update 11/7/2023 at 2:53 pm]

Dear Phoenix users, 

Summary: The hardware upgrade of the Phoenix cluster storage was completed successfully. The cluster is back in operation. The scheduler is unpaused and the jobs that were put on hold are now resumed. Globus transfer jobs have also been resumed.

Details: In order to fix the issue with the slow response of the project storage, we had to bring the Phoenix cluster offline from 8am until 2:50pm, and to upgrade several hardware and firmware libraries on the /storage/coda1 file system. The engineers from the storage vendor company have been working with us through the upgrade and helped us ensure that the storage is operating correctly.  

Impact: The storage on the Phoenix cluster can be accessed as usual, and the jobs can be submitted. The Globus and OpenOnDemand services are working as expected. In case you have any issues, please contact us at pace-support@oit.gatech.edu

Thank you for your patience! 

The PACE team

[Update 11/3/2023 at 3:53 pm]

Dear Phoenix users,  
 

Summary: Based on the significant impact the storage issues are causing to the community, the Phoenix cluster will be taken offline on Tuesday, November 7, 2023, to fix problems with the project storage. The offline period will start at 8am and is planned to be over at 8pm.   

Details: To implement the fixes to the firmware and software libraries for the storage appliance controllers, we need to pause the Phoenix cluster for 12 hours, starting at 8am. Access to the file system /storage/coda1 will be interrupted while the work is in progress; Globus transfer jobs will also be paused while the fix is implemented. These fixes are expected to help improve the performance of the project storage, which has been below the normal baseline since Monday, October 30.

The date of Tuesday, November 7, 2023 was selected to ensure that an engineer from our storage vendor will be available to assist our team perform the upgrade tasks and monitor the health of the storage. 

Impact: On November 7, 2023, no new jobs will start on the Phoenix cluster from 8 am until 8 pm. The job queue will be resumed after 8 pm. In case your job fails after the cluster is released at 8 pm, please resubmit it. This only affects the Phoenix cluster; the other PACE clusters (Firebird, Hive, ICE, and Buzzard) will be online and operate as usual. 

Again, we greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!

[Update 11/3/2023 1:53 pm]

Dear Phoenix users,  

Summary: Storage performance on Phoenix coda1 project space continues to be degraded. 

Details: Intermittent performance issues continue on the project file system on the Phoenix cluster, /storage/coda1. This was first observed on the afternoon of Monday, October 30. 

Our storage vendor found versioning issues with firmware and software libraries on storage appliance controllers that might be causing additional delays with data retransmissions. The mismatch was created when a hardware component was replaced during a scheduled maintenance period, that required the rest of the system to be upgraded to the same versions. Unfortunately, this step was omitted as part of the installation and upgrade instructions. 

We continue to work with the vendor to define a proper plan to update all the components and correct this situation. It is possible we’ll need to pause cluster operations to avoid any issues while the fix is implemented; during this pause, the jobs will be put on hold, and will be resumed when the cluster is released. Again, we are working with the vendor to make sure we have all the details before scheduling the implementation. We’ll provide information on when the fix will be applied, and what to expect of the cluster performance and operations. 

Impact: Simple file operations, including listing the files in a directory, reading from a file, saving a file, etc., are intermittently taking longer than usual (at its worst, the operation that is expected to take a few milliseconds runs in about 10 seconds). This affects the /storage/coda1/ project storage directories, but not scratch storage, nor any of the other PACE clusters.    

We greatly appreciate your patience as we are trying to resolve the issue. Please contact us at pace-support@oit.gatech.edu with any questions!  

Thank you and have a great day!  

-The PACE Team 

October 30, 2023

PACE Maintenance Period (Oct 24 – Oct 30, 2023) is over

Filed under: Uncategorized — Grigori Yourganov @ 12:33 pm

The maintenance on the Phoenix, Hive, Buzzard, Firebird, and ICE clusters has been completed. All clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. The Firebird cluster has been released at 12:30 pm on October 30, and the other clusters have been released at 2:45 pm on October 27.  

Update on the current cooling situation: DataBank performed a temporary repair to restore cooling to the research hosting environment. Cooling capacity in the research hall is at less than 100%, and is being actively monitored. We are currently able to run the clusters at full capacity. The plan is for DataBank to install new parts during the next Maintenance window, which is scheduled for Jan 23rd-25th, 2024. Should the situation worsen, and a full repair be required sooner, we will do our best to provide at least 1 week worth of notice. At this time, we do not expect the need for additional downtime.  

Update on Firebird: We are happy to announce that the Firebird cluster is ready to use after migration to the Slurm scheduler! Again, we greatly appreciate your patience during this extended maintenance period. Over the weekend we were able to research a few lingering issues with MPI and the user environment on the cluster and have both implemented and tested corrections.  
 

Firebird users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides the Firebird Migration Guide, and an additional Firebird-specific Slurm training session [register here] to support the smooth transition of your workflows to Slurm. You are also welcome to join our PACE Consulting Sessions or to email us for support.  

 
[Changes to Note] 

  • New Hardware: There are 12 new 32-core Intel Cascade Lake CPU nodes with 384 GB of RAM available, in addition to new GPU nodes with 4x NVIDIA A100 GPUs, 48 core Intel Xeon Gold CPUs and 512GB of RAM.  
  • Account names: Under slurm, charge accounts will have the prefix “cgts-<PI username>-<project>-<account>” rather than “GT-” 
  • Default GPU: If you do not specify a GPU type in your job script, Slurm will default to using an NVIDIA A100 node, rather than an NVIDIA RTX6000 node; the A100 nodes are more expensive but more performant.  
  • SSH Keys: When you login in for the first time, you may receive a warning about new Host Keys:, similar to the following: 
    Warning: the ECDSA host key for ‘login-.pace.gatech.edu’ differs from the key for the IP address ‘xxx.xx.xx.xx’ 
    Offending key for IP in /home/gbrudell3/.ssh/known_hosts:1 
    Are you sure you want to continue connecting (yes/no)? 
    This is expected! Simply type “yes” to continue!
    • You may also be prevented from login, and have to edit your .ssh/known_hosts to remove the old key, depending on your local ssh client settings. 
  • Jupyter and VNC: We do not currently have a replacement for Jupyter or VNC scripts for the new Slurm environment; we will be working on a solution to these needs over the coming weeks. 
  • MPI: For researchers using mvapich2 under the Slurm environment, specifying the additional –-constraint=core24 or –-constraint=core32 is necessary to ensure a homogeneous node allocation for the job (these reflect the number of CPUs per node).  

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Thank you for your patience during this extended outage!

The PACE Team

October 18, 2023

PACE Maintenance Period (Oct 24 – Oct 26, 2023) 

Filed under: Uncategorized — Grigori Yourganov @ 10:31 am

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 10/24/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 10/26/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

•     [Firebird] Migrate from the Moab/Torque scheduler to the Slurm scheduler. If you are a Firebird user, we will get in touch with you and provide assistance with rewriting your batch scripts and adjusting your workflow to Slurm.

ITEMS NOT REQUIRING USER ACTION:

•     [Network] Upgrade network switches

•     [Network][Hive] Configure redundancy on Hive racks

•     [Network] Upgrade firmware on InfiniBand network switches

•     [Storage][Phoenix] Reconfigure old scratch storage

•     [Storage][Phoenix] Upgrade Lustre controller and disk firmware, apply patches

•     [Datacenter] Datacenter cooling tower cleaning

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

October 13, 2023

Phoenix Storage and Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 9:48 am

[Update 10/13/2023 10:35am] 

Dear Phoenix Users,  

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues. 

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26). 

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors.  We will refund any jobs that failed due to the outage. 

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you, 

-The PACE Team 

[Update 10/13/2023 9:48am] 

Dear Phoenix Users, 

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive. 

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed. 

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler. 

Thank you, 

-The PACE Team  

« Newer PostsOlder Posts »

Powered by WordPress