PACE A Partnership for an Advanced Computing Environment

April 22, 2024

PACE Maintenance Period (May 07 – May 10, 2024) 

Filed under: Uncategorized — Eric Coulter @ 9:53 am

[Update 05/09/24 04:25 PM]

Dear PACE users,   

The maintenance on the Phoenix, Hive, Firebird, and OSG Buzzard clusters has been completed. The Phoenix, Hive, Firebird, and OSG Buzzard clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released. 

The ICE cluster is still under maintenance due to the RHEL9 migration, but we expect it to be ready tomorrow. Instructors teaching summer courses will be notified when it is ready. 

The POSIX user group names on the Phoenix, Hive, Firebird, and OSG Buzzard clusters have been updated so that names will start with the “pace-” prefix. If your scripts or workflows rely on POSIX group names, they will need to be updated; otherwise, no action is required on your part. This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.

Just a reminder that the next Maintenance Period will be August 6-8, 2024

Thank you for your patience! 

-The PACE Team 

[Update 05/07/24 06:00 AM]

PACE Maintenance Period starts now at 6:00 AM on Tuesday, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, 05/10/2024.

[Update 05/01/24 06:37 PM]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00 AM on Tuesday, May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, May 10th, 05/10/2024. An extra day is needed to accommodate physical work done by Databank in the Coda Data Center.PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
    • NOTE: This item was originally planned for January but was delayed to avoid integration issues with IAM services, which have now been resolved.
  • [ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know! 
    • The ICE login nodes will be updated to RHEL 9.3 as well, and this WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance. 
  • [ICE] We will be retiring 8 of the RTX6000 GPU nodes on ICE to prepare for the addition of several new L40 nodes the week after MD. 
  • [software] Sync Gaussian and VASP on RHEL7 pace-apps.
  • [software] Sync any remaining RHEL9 pace-apps for the OS migration.
  • [Phoenix, ICE] Upgrade Nvidia drivers on all HGX/DGX servers.
  • [Hive] The scratch deleter will not run in May and June but will resume in July.
  • [Phoenix] The scratch deleter will not run in May but will resume in June.
  • [ICE] The scratch deleter will run for Spring semester deletion during the week of May 13.

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.  
  • [Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9.
  • [datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit).
  • [datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0. 
  • [Hive] Consolidate all the ICE access entitlements into a single one, all-pace-ice-access.
  • [Hive] Upgrade Hive compute nodes to GPFS 5.1.
  • [Phoenix] Replace cables for the Phoenix storage server.
  • [Firebird] Patch Firebird storage server to 100GbE switch and reconfigure.
  • [Firebird, Hive] Deploy Slurm scheduler CLI+Feature bits on Firebird and Hive. 
  • [datacenter] Configure LDAP on the MANTA NetApp HPCNA SVM.

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog

Thank you,  

-The PACE Team 

[Update 04/22/24 09:53 AM]

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59PM on Friday May 10th, 05/10/2024. The additional day is needed to accommodate physical work carried out by Databank in the Coda datacenter. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.  
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated. 
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part. 
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.  
    • NOTE: This item was originally planned for January, but was delayed to avoid integration issues with IAM services, which have now been resolved.
  • [ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know! 
    • Note – This WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance.

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.  
  • [Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9 
  • [datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit) 
  • [datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0. 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

April 8, 2024

Phoenix A100 CPU:GPU Ratio Change

Filed under: Uncategorized — Michael Weiner @ 5:45 pm

On Phoenix, the default number of CPUs assigned to jobs requesting an Nvidia Tensor Core A100 GPU has recently changed. Now, jobs requesting one or more A100 GPUs will be assigned 8 cores per GPU by default, rather than 32 cores per GPU. You may still request up to 32 cores per GPU if you wish by using the --ntasks-per-node flag in your SBATCH script or salloc command to specify the number of CPUs per node your job requires. Any request with a CPU:GPU ratio of at most 32 will be honored.

12 of our Phoenix A100 nodes host 2 GPUs and 64 CPUs (AMD Epyc 7513), supporting a CPU:GPU ratio up to 32, and can be allocated through both the inferno (default priority) and embers (free backfill) QOSs. We have recently added 1 more A100 node with 8 GPUs and 64 CPUs (AMD Epyc 7543), requiring this change to the default ratio. This new node is available only to jobs using the embers QOS due to the funding for its purchase.

Please visit our documentation to learn more about GPU requests and QOS or about compute resources on Phoenix and contact us with any questions about this change.

April 4, 2024

PACE clusters unreachable on the morning of April 4, 20204

Filed under: Uncategorized — Grigori Yourganov @ 10:54 am

The PACE clusters were not accepting new connections from 4 AM until 10 AM today (April 4, 2024). As part of the preparations to migrate the clusters to a new version of the operating system (Red Hat Enterprise Edition 9), an entry in the configuration management system from the development environment was accidentally applied to production, including the /etc/nologin file on the head nodes. This has been fixed and additional controls are in place to avoid reincidence. 

The jobs and the data transfers running during that period were not affected. The interactive sessions that started before the configuration change were not affected either. 

Currently, the clusters are back online, and the scheduler is accepting jobs. We strongly apologize for this accidental disruption. 

Powered by WordPress