PACE A Partnership for an Advanced Computing Environment

February 19, 2025

[Advance Notice] Planned Spring-Break (March 17-21st) Downtime

Filed under: Uncategorized — Eric Coulter @ 5:10 pm

[Update 3/6/25]

Summary: All PACE compute nodes will be unavailable from 4:00 PM on Friday, March 14, through Tuesday, March 18, to repair a water leak in the Coda Datacenter Research Hall cooling system. Access to login nodes and data will remain available.  

Details: Due to a water leak discovered last month, a seal will be replaced at the start of Spring Break in the cooling system. A full replacement of the pump is planned for the May maintenance period, which will be extended one day and is now planned for May 6-9, 2025, once all parts are available. Non-compute nodes in the Enterprise Hall will not be impacted in the Spring Break repair.  

Impact: During the outage, it will not be possible to run any compute jobs on any PACE cluster (Phoenix, Hive, ICE, Firebird, Buzzard). Login nodes and storage systems will remain available. A reservation has been placed on all schedulers to prevent any jobs from starting if their walltime request extends past 4:00 PM on March 14; these jobs will be held until maintenance is complete.  

Thank you for your patience as we work to restore full functionality of the cooling system. You can read this message on our blog.  

Best, 

-The PACE Team 

[Original Post 2/19/25]

Summary: A water leak has occurred in the CODA Datacenter Research Hall cooling system due to the failure of a pump seal – as a result, we are planning a two or three-day outage (pending confirmation from the Databank and mechanical contractor) during the week of March 17th-21st, which we hope will have less impact due to Spring Break. Access to login nodes and data will remain available, as these live in a different part of the datacenter. No compute services (Phoenix, Hive, ICE, Firebird, or Buzzard compute nodes) will be available. We will follow up once the exact days of the outage are finalized. 

Details: A pump seal in the CODA research hall cooling system failed on Feb 16th. The leak is not currently impacting operation of any PACE resources. Databank is working on a full pump replacement (“flange-to-flange”) plan to address the issue. Databank is actively sourcing the pump and associated parts and coordinating with a new mechanical contractor. We currently target the pump replacement to occur during Spring Break (March 17 – 21). However, this target date could change based on supply chain constraints. The mechanical work is estimated to take one to two days (depending on if additional damage or issues are identified during the pump replacement). Upon completion of the work, the PACE team will need one business day to conduct all necessary testing on the ~2,000 systems and release the five clusters currently hosted in the Research Hall (Phoenix, Hive, ICE/AI Makerspace, Firebird, and OSG Buzzard).  

Being able to perform the work during Spring Break represents a best-case scenario. Databank is actively monitoring the leak and the overall health of the cooling system. Should the situation deteriorate quickly or a catastrophic failure occur, Databank will coordinate emergency repair work to replace the pump seal itself using available on-site spare parts. Under this scenario, a complete pump replacement would be coordinated during the planned PACE Maintenance period in May.  
 
We are striving to keep the shutdown as short as possible. A reservation has been placed on the cluster to prevent any jobs being cancelled by the shutdown – which will cause some jobs to hold until the outage is over. 

Thank you for your patience as we work to recover from this situation

Best, 

-The PACE Team 

February 4, 2025

[Notice] Phoenix Scheduler Account Issue

Filed under: Uncategorized — Eric Coulter @ 4:55 pm

Following the Slurm upgrade during the January 2025 maintenance window, the monthly usage reset did not execute as scheduled on February 1. Consequently, reported balances were lower than expected, as the reported usage still included January utilization. Having identified the issue, the PACE team manually reset usage across all accounts at 12:00 PM on February 4. Additionally, the vendor has been notified of the bug to provide a patch before the next monthly cycle.  

Any jobs that ran from the beginning of February until now will not count towards February usage, and any overages of pre-set limits will be refunded.  

We sincerely apologize for the inconvenience.  

Thank you and have a great day! 

PACE team 

Powered by WordPress