PACE A Partnership for an Advanced Computing Environment

July 26, 2023

PACE Maintenance Period (Aug 8 – Aug 10, 2023) 

Filed under: Uncategorized — Jeff Valdez @ 4:39 pm

[Update 8/11/2023 8:33pm]

The controller replacement on the scratch storage system successfully passed four rounds of testing. Phoenix is back in production and is ready for research. We have released all jobs that were held by the scheduler. Please let us know if you have any problems.

I apologize for the inconvenience, but I believe this delayed return to production will help decrease future downtime.

The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for January 23-25, 2024, and May 7-9, 2024.

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

Pam Buffington 

Pace Director 

[Update 8/10/2023 5:00pm]

The Hive, ICE, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler. 

Unfortunately, Phoenix storage issues continue. All work was completed, but the scratch storage failed initial stress-tests. The vendor is sending us a replacement controller, which will arrive and be replaced early tomorrow. We will then stress-test the storage again. If it passes, Phoenix will be brought into production. If it fails, we will revert to the old scratch infrastructure in use prior to May 2023 while we hunt for a new solution. While we have begun syncing data, this will take time and Phoenix will be brought into production with a syncing scratch file system while 800TB is transferred, which may take approximately 1 week. Not all files will be there, but if you wait, they’ll come back. In the meantime, you may encounter files that were present in your scratch directory prior to the May maintenance period but have since been deleted, which will disappear as the sync completes.  

The monthly deletion of old scratch directories scheduled for next week is canceled. Please disregard the notification you may have received last week.  

I apologize for the inconvenience, but I believe this delay will help decrease future downtime.  

The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM. 

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

Pam Buffington 

Pace Director 

[Update 8/8/2023 6:00am]

PACE Maintenance Period starts now at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.

[Update 8/7/2023 12:00pm]

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.

[Update 8/2/2023 1:43pm]

WHEN IS IT HAPPENING?

This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?  

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING? 

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Create Interactive CPU and GPU partitions on Phoenix 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB 
  • [Phoenix, Hive, ICE] Open XDMoD to campus 
  • [Phoenix] Replace Phoenix project storage controller 
  • [Firebird] Upgrade firewall device firmware supporting CUI 
  • [Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity 
  • [OSG][Network] Move ScienceDMZ VRF to new network fabric 
  • [Network] Install leaf module to InfiniBand director switch 
  • [Network] Configure VPC pair redundancy to Research hall network switches 
  • [Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity 
  • [Storage] DDN Controller firmware & Disk firmware upgrade
  • [Storage] Reboot the backup controller to synchronize with the main controller 
  • [Storage] Increase storage capacity for PACE backup servers 
  • [Storage] Increase storage capacity for EAS group storage servers 
  • [Storage] Replace cables on storage controller
  • [Software] Move pace-apps to Slurm on admin nodes 
  • [Datacenter] Datacenter cooling maintenance

WHY IS IT HAPPENING? 

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED? 

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

-The PACE Team 

[Update 7/26/2023 4:39pm]

WHEN IS IT HAPPENING? 

PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO?  

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING? 

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB 
  • [Phoenix, Hive, ICE] Open XDMoD to campus 
  • [Phoenix] Replace Phoenix project storage controller 
  • [Firebird] Upgrade firewall device firmware supporting CUI 
  • [Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity 
  • [OSG][Network] Move ScienceDMZ VRF to new network fabric 
  • [Network] Install leaf module to InfiniBand director switch 
  • [Network] Configure VPC pair redundancy to Research hall network switches 
  • [Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity 
  • [Storage] Reboot the backup controller to synchronize with the main controller 
  • [Storage] Increase storage capacity for PACE backup servers 
  • [Storage] Increase storage capacity for EAS group storage servers 
  • [Storage] Replace cables on storage controller 
  • [Datacenter] Datacenter cooling maintenance 

WHY IS IT HAPPENING? 

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. 

WHO IS AFFECTED? 

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?  

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you, 

-The PACE Team 

July 24, 2023

Hive Storage SAS Cable Replacement

Filed under: Uncategorized — Jeff Valdez @ 3:13 pm

[Update 7/25/2023 1:04pm]
The SAS cable has been replaced with no interruption on production.

[Update 7/24/2023 3:13pm]
Hive Storage SAS Cable Replacement

WHAT’S HAPPENING?

One SAS cable for Hive between the enclosure and controller for Hive storage needs to be replaced. Cable replacement will take about 2 hours to complete the work.

WHEN IS IT HAPPENING?

Tuesday, July 25th, 2023 starting at 10AM EDT.

WHY IS IT HAPPENING?

Required maintenance.

WHO IS AFFECTED?

Potential storage access outage and subsequent temporary decreased performance to all users.

WHAT DO YOU NEED TO DO?

During cable replacement, one of the controllers will be shutdown and the redundant controller will take all the traffic. Data access should be preserved, but there have been cases where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage availability is restored.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

July 21, 2023

Phoenix Project Storage & Login Node Outage

Filed under: Uncategorized — Jeff Valdez @ 9:46 am

[Update 7/21/2023 3:30pm]

Dear Phoenix Users,

The Lustre project storage filesystem on Phoenix is back up and available. We have completed cable replacements, reseated and replaced a couple hard drives, and restarted the controller. We have run tests to confirm that the storage is running correctly. Performance may still be degraded and impacted as redundant drives rebuild, but is better than the last few days.

Phoenix’s head nodes, which were unresponsive earlier this morning, are available again without issue. We will continue to monitor the login nodes for any other issues.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM this morning due to the outage.

Thank you for your patience today as we repaired the Phoenix cluster. For questions, please contact PACE at pace-support@oit.gatech.edu.

[Original Post 7/21/2023 9:46 am]

Summary: The Lustre project storage filesystem on Phoenix became unresponsive this morning. Researchers may be unable to access data in their project storage. Multiple Phoenix login nodes have also become unresponsive, which may also prevent logins. We have paused the scheduler, preventing new jobs from starting, while we investigate.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known, but PACE is working with the vendor to find a resolution.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. This may impact logins as well. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

July 13, 2023

Phoenix Project Storage & Login Node Outage

Filed under: Uncategorized — Michael Weiner @ 10:19 am

[ Update 7/18/2023 4:00 PM]

Summary: Phoenix project storage performance is degraded as redundant drives rebuild. The process may continue for several more days. Scratch storage is not impacted, so tasks may proceed more quickly if run on the scratch filesystem.

Details: During and after the storage outage last week, several redundant drives on the Phoenix project storage filesystem failed. The system is rebuilding the redundant array across additional disks, which is expected to take several more days. Researchers may wish to copy necessary files to their scratch directories or to local disk and run jobs from there for faster performance. In addition, we continue working with our storage vendor to identify the cause of last week’s outage.

Impact: Phoenix project storage performance is degraded for both read & write, which may continue for several days. Home and scratch storage are not impacted. All data on project storage is accessible.

Thank you for your patience as the process continues. Please contact us at pace-support@oit.gatech.edu with any questions.

[Update 7/13/2023 2:57 PM]

Phoenix’s head nodes, which were unresponsive earlier this morning, have been rebooted and are available again without issue. We will continue to monitor the login nodes for any other issues.

Regarding the failed redundant drives, we have replaced the control cables and a few hard drives have been reseated. We have run tests to confirm that the storage is running correctly.

You should be able to start jobs on the scheduler without issue. We will refund any job that failed after 8:00 AM due to the outage.

[Update 7/13/2023 12:20 PM]

Failed redundant drives led an object storage target to become unreachable. We are working to replace controller cables to restore access.

[Original Post 7/13/2023 10:20 AM]

Summary: The Phoenix project storage filesystem became unresponsive this morning. Researchers may be unable to access data in their project storage. We have paused the scheduler, preventing new jobs from starting, while we investigate. Multiple Phoenix login nodes have also become unresponsive, which may have prevented logins.

Details: The PACE team is currently investigating an outage on the Lustre project storage filesystem for Phoenix. The cause is not yet known. We have also rebooted several Phoenix login nodes that had become unresponsive to restore ssh access.

Impact: The project storage filesystem may not be reachable at this time, so read, write, or ls attempts on project storage may fail, including via Globus. Job scheduling is now paused, so jobs can be submitted, but no new jobs will start. Jobs that were already running will continue, though those on project storage may not progress. Some login attempts this morning may have hung.

Thank you for your patience as we continue investigating this issue. Please contact us at pace-support@oit.gatech.edu with any questions.

Powered by WordPress