[Update 8/11/2023 8:33pm]
The controller replacement on the scratch storage system successfully passed four rounds of testing. Phoenix is back in production and is ready for research. We have released all jobs that were held by the scheduler. Please let us know if you have any problems.
I apologize for the inconvenience, but I believe this delayed return to production will help decrease future downtime.
The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM. Additional maintenance periods are tentatively scheduled for January 23-25, 2024, and May 7-9, 2024.
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.
Thank you,
Pam Buffington
Pace Director
[Update 8/10/2023 5:00pm]
The Hive, ICE, Firebird, and Buzzard clusters are now ready for research. We have released all jobs that were held by the scheduler.
Unfortunately, Phoenix storage issues continue. All work was completed, but the scratch storage failed initial stress-tests. The vendor is sending us a replacement controller, which will arrive and be replaced early tomorrow. We will then stress-test the storage again. If it passes, Phoenix will be brought into production. If it fails, we will revert to the old scratch infrastructure in use prior to May 2023 while we hunt for a new solution. While we have begun syncing data, this will take time and Phoenix will be brought into production with a syncing scratch file system while 800TB is transferred, which may take approximately 1 week. Not all files will be there, but if you wait, they’ll come back. In the meantime, you may encounter files that were present in your scratch directory prior to the May maintenance period but have since been deleted, which will disappear as the sync completes.
The monthly deletion of old scratch directories scheduled for next week is canceled. Please disregard the notification you may have received last week.
I apologize for the inconvenience, but I believe this delay will help decrease future downtime.
The next planned maintenance period for all PACE clusters is October 24, 2023, at 6:00 AM through October 26, 2023, at 11:59 PM.
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.
Thank you,
Pam Buffington
Pace Director
[Update 8/8/2023 6:00am]
PACE Maintenance Period starts now at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.
[Update 8/7/2023 12:00pm]
This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023.
[Update 8/2/2023 1:43pm]
WHEN IS IT HAPPENING?
This is a reminder that PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.
WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.
WHAT IS HAPPENING?
ITEMS REQUIRING USER ACTION:
- [Phoenix] Create Interactive CPU and GPU partitions on Phoenix
ITEMS NOT REQUIRING USER ACTION:
- [Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB
- [Phoenix, Hive, ICE] Open XDMoD to campus
- [Phoenix] Replace Phoenix project storage controller
- [Firebird] Upgrade firewall device firmware supporting CUI
- [Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity
- [OSG][Network] Move ScienceDMZ VRF to new network fabric
- [Network] Install leaf module to InfiniBand director switch
- [Network] Configure VPC pair redundancy to Research hall network switches
- [Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity
- [Storage] DDN Controller firmware & Disk firmware upgrade
- [Storage] Reboot the backup controller to synchronize with the main controller
- [Storage] Increase storage capacity for PACE backup servers
- [Storage] Increase storage capacity for EAS group storage servers
- [Storage] Replace cables on storage controller
- [Software] Move pace-apps to Slurm on admin nodes
- [Datacenter] Datacenter cooling maintenance
WHY IS IT HAPPENING?
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.
WHO IS AFFECTED?
All users across all PACE clusters.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.
Thank you,
-The PACE Team
[Update 7/26/2023 4:39pm]
WHEN IS IT HAPPENING?
PACE’s next Maintenance Period starts at 6:00AM on Tuesday, 08/08/2023, and is tentatively scheduled to conclude by 11:59PM on Thursday, 08/10/2023. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.
WHAT DO YOU NEED TO DO?
As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.
WHAT IS HAPPENING?
ITEMS NOT REQUIRING USER ACTION:
- [Phoenix, Hive, ICE] Re-image login nodes to increase /tmp to 20GB
- [Phoenix, Hive, ICE] Open XDMoD to campus
- [Phoenix] Replace Phoenix project storage controller
- [Firebird] Upgrade firewall device firmware supporting CUI
- [Firebird] Add additional InfiniBand switches and cables to increase redundancy and capacity
- [OSG][Network] Move ScienceDMZ VRF to new network fabric
- [Network] Install leaf module to InfiniBand director switch
- [Network] Configure VPC pair redundancy to Research hall network switches
- [Network][Firebird] Install high-speed IB NIC on storage appliance for improved performance and capacity
- [Storage] Reboot the backup controller to synchronize with the main controller
- [Storage] Increase storage capacity for PACE backup servers
- [Storage] Increase storage capacity for EAS group storage servers
- [Storage] Replace cables on storage controller
- [Datacenter] Datacenter cooling maintenance
WHY IS IT HAPPENING?
Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.
WHO IS AFFECTED?
All users across all PACE clusters.
WHO SHOULD YOU CONTACT FOR QUESTIONS?
Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.
Thank you,
-The PACE Team