PACE A Partnership for an Advanced Computing Environment

October 30, 2023

PACE Maintenance Period (Oct 24 – Oct 30, 2023) is over

Filed under: Uncategorized — Grigori Yourganov @ 12:33 pm

The maintenance on the Phoenix, Hive, Buzzard, Firebird, and ICE clusters has been completed. All clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. The Firebird cluster has been released at 12:30 pm on October 30, and the other clusters have been released at 2:45 pm on October 27.  

Update on the current cooling situation: DataBank performed a temporary repair to restore cooling to the research hosting environment. Cooling capacity in the research hall is at less than 100%, and is being actively monitored. We are currently able to run the clusters at full capacity. The plan is for DataBank to install new parts during the next Maintenance window, which is scheduled for Jan 23rd-25th, 2024. Should the situation worsen, and a full repair be required sooner, we will do our best to provide at least 1 week worth of notice. At this time, we do not expect the need for additional downtime.  

Update on Firebird: We are happy to announce that the Firebird cluster is ready to use after migration to the Slurm scheduler! Again, we greatly appreciate your patience during this extended maintenance period. Over the weekend we were able to research a few lingering issues with MPI and the user environment on the cluster and have both implemented and tested corrections.  
 

Firebird users will no longer be able to use the Torque/Moab scheduler and should make sure their workflows work on the Slurm-based cluster. Please contact us if you need additional help shifting your workflows to the Slurm-based cluster. PACE provides the Firebird Migration Guide, and an additional Firebird-specific Slurm training session [register here] to support the smooth transition of your workflows to Slurm. You are also welcome to join our PACE Consulting Sessions or to email us for support.  

 
[Changes to Note] 

  • New Hardware: There are 12 new 32-core Intel Cascade Lake CPU nodes with 384 GB of RAM available, in addition to new GPU nodes with 4x NVIDIA A100 GPUs, 48 core Intel Xeon Gold CPUs and 512GB of RAM.  
  • Account names: Under slurm, charge accounts will have the prefix “cgts-<PI username>-<project>-<account>” rather than “GT-” 
  • Default GPU: If you do not specify a GPU type in your job script, Slurm will default to using an NVIDIA A100 node, rather than an NVIDIA RTX6000 node; the A100 nodes are more expensive but more performant.  
  • SSH Keys: When you login in for the first time, you may receive a warning about new Host Keys:, similar to the following: 
    Warning: the ECDSA host key for ‘login-.pace.gatech.edu’ differs from the key for the IP address ‘xxx.xx.xx.xx’ 
    Offending key for IP in /home/gbrudell3/.ssh/known_hosts:1 
    Are you sure you want to continue connecting (yes/no)? 
    This is expected! Simply type “yes” to continue!
    • You may also be prevented from login, and have to edit your .ssh/known_hosts to remove the old key, depending on your local ssh client settings. 
  • Jupyter and VNC: We do not currently have a replacement for Jupyter or VNC scripts for the new Slurm environment; we will be working on a solution to these needs over the coming weeks. 
  • MPI: For researchers using mvapich2 under the Slurm environment, specifying the additional –-constraint=core24 or –-constraint=core32 is necessary to ensure a homogeneous node allocation for the job (these reflect the number of CPUs per node).  

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu

Thank you for your patience during this extended outage!

The PACE Team

October 18, 2023

PACE Maintenance Period (Oct 24 – Oct 26, 2023) 

Filed under: Uncategorized — Grigori Yourganov @ 10:31 am

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 10/24/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 10/26/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

•     [Firebird] Migrate from the Moab/Torque scheduler to the Slurm scheduler. If you are a Firebird user, we will get in touch with you and provide assistance with rewriting your batch scripts and adjusting your workflow to Slurm.

ITEMS NOT REQUIRING USER ACTION:

•     [Network] Upgrade network switches

•     [Network][Hive] Configure redundancy on Hive racks

•     [Network] Upgrade firmware on InfiniBand network switches

•     [Storage][Phoenix] Reconfigure old scratch storage

•     [Storage][Phoenix] Upgrade Lustre controller and disk firmware, apply patches

•     [Datacenter] Datacenter cooling tower cleaning

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

October 13, 2023

Phoenix Storage and Scheduler Outage

Filed under: Uncategorized — Jeff Valdez @ 9:48 am

[Update 10/13/2023 10:35am] 

Dear Phoenix Users,  

The Lustre scratch on Phoenix and Slurm scheduler on Phoenix went down late yesterday evening (starting around 11pm) and is now back up and available. We have run tests to confirm that the Lustre storage and Slurm scheduler is running correctly. We will continue to monitor the storage and scheduler for any other issues. 

Preliminary analysis by the storage vendor indicates that a kernel bug that we thought was previously addressed caused the outage. We have disabled features with the Lustre storage appliance that should avoid triggering another outage for an immediate fix, with a long-term patch planned for our upcoming Maintenance Period (October 24-26). 

Existing jobs that were queued have already started or will start soon. You should be able to submit new jobs on the scheduler without issue. Again, we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch directories) as there may be unexpected errors.  We will refund any jobs that failed due to the outage. 

We apologize for the inconvenience. Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

Thank you, 

-The PACE Team 

[Update 10/13/2023 9:48am] 

Dear Phoenix Users, 

Unfortunately, the Lustre storage drive on Phoenix became unresponsive late yesterday evening (starting around 11pm). As a result of the Lustre storage outage, the Slurm scheduler also was impacted and has become unresponsive. 

We have restarted the Lustre storage appliance, and the file system is now available. The vendor is currently running tests to make sure the Lustre storage is healthy. We will also be running checks on the Slurm scheduler as well.

Jobs currently running will likely continue running, but we strongly recommend reviewing the output of your jobs that use the Lustre storage (project and scratch) as there may be unexpected errors. Jobs waiting in-queue will stay in-queue until the scheduler has resumed. 

We will continue to provide updates soon as we complete testing of the Lustre storage and Slurm scheduler. 

Thank you, 

-The PACE Team  

Powered by WordPress