PACE A Partnership for an Advanced Computing Environment

January 31, 2024

Phoenix Scheduler Outage

Filed under: Uncategorized — Michael Weiner @ 11:44 am

Summary: The Slurm scheduler on Phoenix is experiencing an intermittent outage.

Details: The scheduler is repeatedly freezing due to a problematic input. The PACE team has identified the likely cause and is attempting to restore functionality.

Impact: Commands like squeue and sinfo may report errors, and new jobs may not start on Phoenix. Already-running jobs are not impacted. Other clusters (Hive, ICE, Firebird, Buzzard) are not impacted.

Thank you for your patience as we work to restore Phoenix to full functionality. Please contact us at pace-support@oit.gatech.edu with any questions. You may track the status of this outage on the GT Status page.

January 25, 2024

PACE Maintenance Period (Jan 23 – Jan 25, 2024) is over 

Filed under: Uncategorized — Grigori Yourganov @ 10:55 am

Dear PACE users,  

The maintenance on the Phoenix, Hive, Firebird, and ICE clusters has been completed; the OSG Buzzard cluster is still under maintenance, and we expect it to be ready next week. The Phoenix, Hive, Firebird, and ICE clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released.   

  

The POSIX group names on the Phoenix, Hive, Firebird, and ICE clusters have not been updated, due to the factors within the IAM team. This update is now scheduled to happen during our next maintenance period in May 7-9, 2024.  

Thank you for your patience!   

  

The PACE Team 

January 18, 2024

NetApp Storage Outage

Filed under: Uncategorized — Michael Weiner @ 5:22 pm

[Update 1/18/24 6:30 PM]

Access to storage has been restored, and all systems have full functionality. The Phoenix and ICE schedulers have been resumed, and queued jobs will now start.

Please resubmit any jobs that may have failed. If a running job is no longer progressing, please cancel and resubmit.

The cause of the outage was identified as an update this afternoon to resolve a specific permissions issue affecting some users on the ICE shared directories. The update has been reverted.

Thank you for your patience as we resolved this issue.

[Original Post 1/18/24 5:20 PM]

Summary: An outage on PACE NetApp storage devices is affecting the Phoenix and ICE clusters. Home directories and software are not accessible.

Details: At approximately 5:00 PM, an issue began affecting access to NetApp storage devices on PACE. The PACE team is investigating at this time.

Impact: All storage devices provided by NetApp services are currently unreachable. This includes home directories on Phoenix and ICE, the pace-apps software repository on Phoenix and ICE, and course shared directories on ICE. Users may encounter errors upon login due to inaccessible home directories. We have paused the schedulers on Phoenix and ICE, so no new jobs will start. The Hive and Firebird clusters are not affected.

Please contact us at pace-support@oit.gatech.edu with any questions.

January 12, 2024

PACE Maintenance Period (Jan 23 – Jan 25, 2024) 

Filed under: Uncategorized — Grigori Yourganov @ 3:53 pm

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 01/23/2024, and is tentatively scheduled to conclude by 11:59PM on Thursday, 01/25/2024. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
    • This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated. 
    • If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part. 
    • This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.  

ITEMS NOT REQUIRING USER ACTION: 

  • [datacenter] Databank maintenance: Replace pump impeller, cooling tower maintenance 
  • [storage] Install NFS over RDMA kernel module to enable pNFS for access to VAST storage test machine 
  • Replace two UPS for SFA14KXE controllers 
  • [storage] upgrade DDN SFA14KXE controllers FW 
  • [storage] upgrade DDN 400NV ICE storage controllers and servers 
  • [Phoenix, Hive, Ice, Firebird] Upgrade all Clusters to Slurm version 23.11.X 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

Powered by WordPress