GT Home : : Campus Maps : : GT Directory

[Complete] PACE Maintenance Period: May 11 – 13, 2022

This entry was posted by on Tuesday, 3 May, 2022 at

[Update 5/16/22 9:20 AM]

All PACE clusters, including Phoenix, are now ready for research and learning. We have restored stability of the Phoenix Lustre storage system and released jobs on Phoenix.

Thank you for your patience as we worked to restore Lustre project & scratch storage on the Phoenix cluster. In working with our support vendor, we identified a scanning tool that was causing instability on the scratch filesystem and impacting the entire storage system. This has been disabled pending further investigation.

Due to the complications, we will not proceed with monthly deletions of old files on the Phoenix & Hive scratch filesystems tomorrow. Although only Phoenix was impacted, we will also delay Hive to avoid confusion. Files for which researchers were notified this month will not be deleted at this time, and you will receive another notification prior to any future deletion. Researchers are still encouraged to delete unneeded scratch files to preserve space on the system.

Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13. The next maintenance period for all PACE clusters is August 10, 2022, at 6:00 AM through August 12, 2022, at 11:59 PM. An additional maintenance period is tentatively scheduled for November 2-4.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Postponed][Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [Complete][Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Complete][Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Complete][Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Complete][Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [Complete][System] Install operating system patches
  • [Complete][System] Update operating system on administrative servers
  • [Complete][Network] Move BCDC DNS appliance to new IP address
  • [Complete][Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [Complete][System] Remove unused nouveau graphics kernel from GPU nodes
  • [Complete][Network] Set static IP addresses on schedulers to improve reliability
  • [Complete][Datacenter] Cooling loop maintenance
  • [Complete][Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 5/13/22 3:25 PM]

The PACE team and our support vendor’s engineers continue working to restore functionality of the Phoenix Lustre filesystem following the upgrade. Testing and remediation will continue today and through the weekend. At this time, we hope to be able to open Phoenix for research on Monday. We appreciate your patience as our maintenance period is extended. If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

[Update 5/13/22 2:00 PM]

PACE maintenance continues on Phoenix, while the Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters are now ready for research and learning.

Phoenix remains under maintenance, as complications arose following the upgrade of Lustre project and scratch storage. PACE and our storage vendor are working to resolve the issue at this time. We will update you when Phoenix is ready for research.

Jobs on the Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard clusters have been released.

Campus network disaster recovery testing will disable Phoenix, Hive, PACE-ICE, and COC-ICE from 5:00 PM on Friday, June 10, through 12:00 noon on Monday, June 13. The next maintenance period for all PACE clusters is August 10, 2022, at 6:00 AM through August 12, 2022, at 11:59 PM. An additional maintenance period is tentatively scheduled for November 2-4.

Status of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [Complete][ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Postponed][Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [In progress][Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Complete][Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Complete][Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Complete][Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [Complete][System] Install operating system patches
  • [Complete][System] Update operating system on administrative servers
  • [Complete][Network] Move BCDC DNS appliance to new IP address
  • [Complete][Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [Complete][System] Remove unused nouveau graphics kernel from GPU nodes
  • [Complete][Network] Set static IP addresses on schedulers to improve reliability
  • [Complete][Datacenter] Cooling loop maintenance
  • [Complete][Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

 

 

 

[Detailed announcement 5/3/22]

As previously announced, our next PACE maintenance period is scheduled to begin at 6:00 AM on Wednesday, May 11, and end at 11:59 PM on Friday, May 13. As usual, jobs that request durations that would extend into the maintenance period will be held by the scheduler to run after maintenance is complete. During the maintenance window, access to all PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, PACE-ICE, COC-ICE, and Buzzard.

Tentative list of activities:

ITEMS REQUIRING USER ACTION:

  • None expected on research clusters

ITEMS NOT REQUIRING USER ACTION:

  • [ICE only][System] PACE-ICE and COC-ICE instructional clusters will receive an operating system upgrade to RHEL7.9, to match the research clusters. Visit our documentation for a guide on potential impacts. A testflight environment is not available for ICE.
  • [Phoenix, Hive][Open OnDemand] Deploy R 8.3 on Open OnDemand
  • [Phoenix][Storage] multiple upgrades to Lustre project and scratch storage
  • [Hive][Storage] replace cable connecting GPFS project and scratch storage
  • [Network] Upgrade interfaces to 100 GbE on Globus Vapor endpoint and border storage
  • [Network] Add redundant 100GbE switch to storage servers, increasing capacity
  • [System] Install operating system patches
  • [System] Update operating system on administrative servers
  • [Network] Move BCDC DNS appliance to new IP address
  • [Hive][System] Upgrade cuda and Nvidia drivers on Hive to match other clusters with cuda 11.5
  • [System] Remove unused nouveau graphics kernel from GPU nodes
  • [Network] Set static IP addresses on schedulers to improve reliability
  • [Datacenter] Cooling loop maintenance
  • [Datacenter] Georgia Power Microgrid testing

If you have any questions or concerns, please contact us at pace-support@oit.gatech.edu.

[Early announcement]

Dear PACE Users,

This is friendly reminder that our next Maintenance period is scheduled to begin at 6:00AM on Wednesday, 05/11/2022, and it is tentatively scheduled to conclude by 11:59PM on Friday, 05/13/2022. As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During the Maintenance Period, access to all the PACE managed computational and storage resources will be unavailable.

If you have any questions or concerns, please do not hesitate to contact us at pace-support@oit.gatech.edu.

Best,

The PACE Team

Comments are closed.